LikeLike

]]>LikeLike

]]>Even if that’s the case you don’t really lose anything by using the methods listed here, but you might waste a little effort.

LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>I have used the <a href=”http://en.wikipedia.org/wiki/Akaike_information_criterion” rel=”nofollow”>Akaike Information Criterion corrected (AICc) to test for how good the exponential model for Mr. Falkovich’s data is. I made a visualization of the compatibility scores that I will use to help illustrate the fitness of models to this data. It includes a better histogram of the scores then Mr. Falkovich used in his post as well as the Empirical Distribution Function for the compatibility scores.

Ideally, I should chose my method of analysis before I know what the data is. I was not able to follow this advice this time. After reading the post, I speculated that the log-normal distribution would be a good fit. This was from visualizing the data and because I can come up with an explanation for why the compatibility values would be distributed thusly.

On the surface, it seems like a <a href=”http://en.wikipedia.org/wiki/Normal_distribution” rel=”nofollow”>normal distribution would work. A decision matrix takes a linear combination of different values and a sum of different random (or close enough) independent values produces a normal distribution regardless of probability distribution is for each variable. The issue is that, in this case, the variables are likely not independent. I would guess that, say, “20,000 Wednesdays,” and “Building a Future,” would be correlated for each person.

A more complicated model would be, (underlying characteristics)->(superficial characteristics)->(measured traits)->(compatibility score). In this sense how the underlying characteristics combine to produce the superficial characteristics should ultimately determine the basic nature of the distribution in compatibility score (though not necessarily the specific final distribution). If this is dominated by simple addition then this would yield a normal distribution. If instead, this is dominated by simple multiplication then this would yield a log-normal distribution.

(I want to point out that in the above paragraph, I use superficial in a denotative sense: that is the superficial characteristics are the characteristics that can be directly observed. The underlying characteristics are opaque and can only be guessed at from indirect information. I do not intend superficial in the common connotative sense in that I think plenty of superficial characteristics can be deep and meaningful.)

I can buy that this process is either largely addition, largely multiplication, or something more complicated. As the data does not look symmetrical (and looks to me like it would be a good candidate for a log-normal fit), I guessed the log-normal distribution would fit better than an exponential distribution. An exponential distribution is used to model things like how long a DMV transaction takes (not including the wait time which would not follow an exponential distribution) or the different lengths between successive road-kill caucuses on a highway. I don’t see any reason why scores for how compatible a person is in marrying Jacob would follow such a pattern.

Before I do calculations there is a big problem. To use the AICc, I need to calculate the likelihood function for each model I use. The exponential distribution is zero for numbers smaller than the starting point. This would yield a likelihood of 0 for the exponential model as there is a data point for a number smaller than the starting point. I need to figure out how to deal with that.

I spent a bit of time going over different ways to handle that single data point. I don’t want to ignore inconvenient data points as that is horrible form but there isn’t any good way of handling this issue in this problem that I know of. I decided to break up the data into two different distributions concatenated together, an exponential distribution for the data points above the zero point and another exponential distribution for the data points below the zero point. I then weight each distribution in proportion to how many data points are above and below the zero point so to keep the probability that any value is found to be 1. This, unfortunately for the exponential model, increased the degrees of freedom from 2 (the standard deviation and the zero point) to 4 (two standard deviations, the zero point, and the weighting).

Also, since Mr. Falkovich wisely has not released details of his decision matrix, I do not know if the scores have to come out to a multiple of 0.1 or if he rounded the scores to the nearest tenth. Since the probability distributions I am considering are all continuous probability distributions, I should integrate each distribution from 0.05 below to 0.05 above the given data point. I did the lazier thing and multiplied the value of the probability distribution by 0.1 for the value of each data point. I think that this won’t affect anything of my analysis but I haven’t done to work to show this.

First, the result for the exponential distribution. I used the measured standard deviation of about 0.99 and the measured starting point (the mean less the standard deviation) of 4.78 to start and to define the exponential distribution above the starting point. I used the measured mean (as there is only one point) of 4.5 for the data less than the starting point and the difference between this and the starting point of about 0.28 as the standard deviation for the exponential distribution bellow the starting point. Finally the lower distribution is multiplied by 1/20 and the upper by 19/20 in proportion to the number of points above and below the starting point.

This yields a likelihood of getting the exact distribution of values of about 3.1 x 10^-13. That’s pretty good for 20 data points. This yields an AICc of about 68.27 (lower AICc’s are better). This shows the <a href=”//imgur.com/F7TKUf9” rel=”nofollow”>PDF of the fit and the CDF of the fit. The exponential model looks like it fits the data very well (look at the CDF as compared with the EDF).

I compare this to the standard normal distribution using a measured mean of about 5.78 and measured standard deviation of about 0.99. The PDF and <a href=”//imgur.com/U8S3EHq” rel=”nofollow”>CDF can be found in these links. It doesn’t look like it fits as well as the exponential and, indeed, the likelihood with the normal model is about 1.4 x 10^-14 with an AICc of about 68.55.

Then there is the log-normal distribution with a measured mean of the natural logarithms of the values of about 1.74 and a standard deviation in the natural logarithms of the values of about .16. The PDF and CDF can be found in these links. To me it looks like the log-normal fits the data better than the normal (which shouldn’t be a surprise) but not as well as the exponential. As it shows, the likelihood for the log-normal model is about 5.3 x 10^-14 which is in-between the two. It, however, has the lowest AICc so far of about 65.82. It seems as though the extra fitting parameters of the exponential model I used has made the log-normal model more attractive.

As a sanity check, I test a uniform model with a minimum of 4.5 (the smallest data point) and a maximum of 8.1 (the largest data point). The PDF and especially the <a href=”//imgur.com/DXYWmVY” rel=”nofollow”>CDF shows this not to be a good fit. However it has the second highest likelihood so far of about 1.8 x 10^-13 and highest AICc so far of about 63.38.

It may be natural to wonder why such a bad fit has such a high likelihood and that is because models which can fit the minimum and/or maximum of the model tend to have artificially high likelihood values. The uniform model analyzed gives a 0% chance to finding any compatibility score lower than 4.5 or higher than 8.1 of anybody that Mr. Falkovich would have been willing to date. The other distributions are open ended on two ends. The fitted uniform model states that slightly more extreme values then have already occurred cannot happen while the other models have to consider a non-zero probability of ludicrously extreme values. This problem is not unique to the uniform model (it would exist if I fitted a beta distribution to the model) but it is still of value to test it.

Lastely, after looking at why the models gave the scores they gave (I discuss this a few paragraphs later), I tested a fitted <a href=”//en.wikipedia.org/wiki/Gamma_distribution” rel=”nofollow”>gamma distribution using a zero point of 0. I chose parameters that maximized the likelihood function and got a shape parameter of about 36.7 and a rate parameter of about 6.36. Here are the PDF and CDF comparisons. This gives a likelihood of about 3.6 x10^-14 and an AICc of about 66.61.

From the AICc’s one can find the relative probability that a model is the correct model for the data if the only things known are the data, the likelihood that the particular model would find that particular data, and the number of fitting parameters. A model with a relative probability of 0.4 would be half as likely from this limited information as a model with a relative probability of 0.8. Other information (such as the feasibility of a model or how the model was chosen) should mater but is not measured with this test.

Here is a summary of the results:

Model (number of fitting paramiters): likelihood; AICc; relative probability

Uniform (2): 1.82×10^-13; 63.38; 1.00

Log Normal (2): 5.35×10^-14; 65.83; 0.294

Gamma fitted (2): 3.61×10^-14; 66.61; .198

Exponential (4): 3.11×10^-13; 68.27; .0867

Normal (2): 1.37×10^-14; 68.55; .0752

Why these models give the values they give is instructive. The uniform model is simple: for any value in its range it gives a 2 7/9 % (about 2.78%) chance of choosing that value. All the other four models give comparable numbers until the final three values (7.2, 7.8, and 8.1) when the probabilities of getting these values drops to less than 1.5% for getting 7.2 in any of the other models and less than .35% for getting 8.1 in any of the other models. It seems that Mr. Falkovich’s desire is fulfilled and the drop off for higher scores is not as severe as the normal (or any of the other models including the exponential) would predict.

Mr. Falkovich is wrong in that the exponential model is better because of behavior at the tails. The normal model, the log-normal model, and the gamma model give higher probabilities to getting 7.2 and 7.8 then the exponential model gives. It’s only in the 8.1 data point does the exponential model give a higher probability then the other models and this is where the probability is very low anyways.

The exponential model gives the highest likelihood because of behavior in the low range. The exponential model gives about a 9.5% chance of getting a value of 4.8 which is the highest of any data point for any model. The other models don’t start giving higher probabilities until the 5.8 data point where the normal, log-normal, and gamma functions all start doing so. The exponential distribution may be the best model of those tested but because of how well it does in predicting the relatively low scores, not its ability to predict relatively high scores (unless someone wants to date more than a hundred people).

Looking at why the uniform model wins is instructive. All of the models tested (including the exponential distribution) fall off too quickly for higher values to be believable. A distribution that does not fall off as quickly would fit better. As the gamma distribution can have different end behavior, I decided to test that (though without fitting for the zero point of the gamma distribution as this can skew the results as talked about with the uniform distribution). A beta distribution would be better but without knowing what maximum and minimum bounds to use it would show an artificially low AICc and is a mess to fit properly when fitting these bounds and didn’t want to bother.

In the end I do not believe that an uniform model is a good model (both because it visually fits so poorly and because there is good reason to believe that it is wrong). None of the relative probabilities are really all that bad. In this situation, analyzing things qualitatively as well as quantitatively, I would say that I don’t know what the true distribution is. Such comparable relative probabilities suggests that one cannot differentiate between the tested models solely on goodness of fit anyways.

I should point out that if I keep the relative likelihood of my exponential model but incorrectly set the number of fitting parameters to 2, then that model would win. I made a choice on how to adjust that model for the 4.5 value and making different choices could very well change what rank this model makes. I doubt any reasonable choice would find that model being a clear winner or clear looser amongst the models chosen to be tested but it is reasonable that someone would come to slightly different fine grain conclusions then I did.

I should point out that I am not a professional at data analysis. I am very much an amateur and do this for fun. I hope that if you’ve reached the end then you have gotten some value out of my discussion but I caution you not to trust my methods or claims I have made without some other supporting evidence. If you actually know what you’re doing and spotted an error of mine (major or minor) then please let people where know. I did all calculations and created all visualizations in Microsoft Excell Home 2013 32-bit version 15.0.4953.1000 except for finding the fit parameters for the gamma model used in which I did the calculations in Octave version 4.2.2. I hope I got all of the formatting in this comment correct.

Post Script: Now that I am ready to post this I have a desire to test a beta distribution with a lower bound of 0 and an upper bound of 10. This is because I am guessing that this is the maximum and minimum possible values that the decision matrix could create. I’ve already written everything up and don’t want to spend time rewriting it or delaying in publishing. I don’t want to be scooped.

LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>