Statistics and Online Matchmaking

Posted in Statistics on November 26th, 2009 by Econmancer

The dating site OKCupid maintains a blog about the data they collect from their users. OKTrends has some extremely entertaining posts, and they all happen to be about statistics. The site has data from 100 times the number of people polled for nation-wide Gallup poles (300k people v. 3000 people).

Commenters point out that, as opposed Gallup polling random people, OKCupid polls only members. The people who are members of OKCupid could be very different than the people who are not. Still, it’s a lot of data and it’s interesting to see differences between people by state and gender. For example; the length of a introductory message and the chance of a response, how different religions and races interact on the site.

I found the site through a post on BoingBoing. The interesting part of the post was how men and women respond to attractiveness. Men gave a fairly even bell-curve to the female attractiveness on the site. Men replied more often as the attractiveness of the woman increases, until the woman was so attractive that messages dropped off. This is about what I’d expect. Guys like attractive women and many guys would feel they had no chance with the extremely attractive memebers, so they wouldn’t even try.

Woman, however, were different. The women on the site found very few males on the site to be attractive (rating 80% below average in looks) and tended to message the slightly below average men the most.

The discussion on the BoingBoing  post and the original post by OKTrends are both worth reading if you are interested. There is a lot of discussion about why women found so many of the men below average in looks. An important part of that discussion has to do with how OKCupid works. I don’t understand it fully, I’m not a member, but from what I read in the comments there is a system that automatically alerts a person if you rate them at a certain level and above. The commenters suggested that women are intentionally not  selecting higher levels to avoid unwanted contact from the men they are rating. Another commenter complained that they system “baits and switches” when you are rated high by someone, by showing other profiles with the one person that thought you were attractive. The woman said she rates all of those people as” unattractive” in protest of this feature.

Tags: , , , , , ,

Engine Displacement and Automotive Performance

Posted in Statistics on November 22nd, 2009 by Econmancer

I entered all of the data from the SCCOIA website into a spreadsheet and used a random number generator to pick a sample of fifty vehicles from the list of 1845. I then found the cubic inch displacement for each of the vehicles that were randomly chosen.

I made graphs of the displacement (x) and 1/4 mile times times (y), and displacement (x) and 0-60 mph times (y), then looked at the linear regression of the data. Please remember that this does not take into account the weight of the vehicles, gearing, tires, or countless other factors that change the performance of a vehicle.

Interpretation of R and R^2

The correlation coefficient of R is -.4739 for 1/4 mile times, and -0.4791 for 0-60 times. This indicates a negative linear association between the displacement of an engine and its track times. It makes sense that as the displacement gets larger, the times get smaller (quicker).

The coefficient of determination, R^2, is .2246 for 1/4 mile times and .2295 for 0-60 times. This indicates that about 22.5% – 23% of the performance times from these cars are accounted for by the displacement of the engines. This leaves 77% – 78% of the variation in the residuals.

t-Scores and P-Values

The t-scores are 28.95 (1/4 times) and 13.45 (0-60 times) and the P-values are 0.0000000000000000000000000000000048 (1/4 mile times) and 0.00000000000000000071 (0-60 times). This shows that the slope for either line is not zero.

Some interesting highlights from the random sample

The largest displacement was a 1973 Pontiac Firebird with a 455ci engine. The smallest displacement was a 1992 Geo Metro LSi with 61ci engine. The quickest 0-60 time was a 2002 Porsche 911 GT2 with 3.6 seconds. The slowest was a 1967 MG Midget III with 14.7 seconds. The quickest 1/4 mile time was the 2002 911 GT2 with 11.9 seconds. The slowest was the 1992 Geo Metro LSi with 19.4 seconds. The average car from the random sample would have a 212.28 cubic inch engine, go 0-60 in 8.02 seconds and have a 1/4 mile time of 15.91 seconds.

Tags: , , , , , , , , , , , ,

Linear Regression: Diamond Carat Weight and Price

Posted in Statistics on November 22nd, 2009 by Econmancer

I was interested in seeing exactly how the weight of a diamond relates to the price. I gathered some data and did a simple linear regression.

I collected a random sample of loose diamonds listed for sale on BlueNile.com. The sample was taken from all “round” cut diamonds graded as having clarity with very slight inclusions (VS1) and color ranging from D-H (near colorless) and having carat weights of .25-.50. Blue Nile had 614 individual stones that matched these categories. I labeled the diamonds 1-614 and took a random sample size of 40 using a random number generator. I then collected the size in carats (x) and price in dollars (y) of the selected diamonds. The data is listed below.

Size(ct.) Price ($)
0.27 509
0.27 509
0.28 518
0.25 582
0.31 597
0.35 660
0.36 661
0.32 670
0.32 670
0.36 678
0.3 696
0.36 700
0.33 727
0.28 747
0.31 782
0.31 782
0.3 800
0.33 808
0.33 829
0.3 849
0.37 851
0.32 852
0.32 859
0.39 890
0.39 890
0.41 981
0.4 1017
0.4 1068
0.41 1071
0.39 1113
0.39 1113
0.42 1119
0.42 1310
0.42 1331
0.41 1398
0.43 1476
0.42 1516
0.46 1543
0.46 1595
0.48 1803

Interpretation of Regression Coefficients.

The y-intercept, b0= -836.79 and the slope b1= 4950. These numbers might seem unusual, but it’s easy to think about what they mean when you use the y-intercept and slope to calculate that each .01ct increase of weight adds $49.50 to the predicted price of these diamonds.

Interpretation of R and R^2

The correlation coefficient of R is .8836 and indicates a strong positive linear association between the weight of a diamond and its price. The coefficient of determination, R^2, is .7807. This indicates that about 78.1% of the cost of these diamonds is accounted for in the weight of the stones. This leaves 21.9% of the variation in the residuals. This is the linear association expected from the scatter plot.
Sample Predictions of Price from Weight

The regression equation can now be used to predict the price of a diamond by its weight. For the example we use a diamond that is .25 carats because it is within the domain of the regression line.

4950.6(.25)-836.79=400.86

So it would be reasonable to expect to pay $400.86 for a .25 ct diamond.

None of the random diamonds in this data set were .5 ct, but I can use the equation to predict a price:

4950.6(.5)-836.79=1638.51

The equation predicts the cost of a .5 ct diamond to be $1638.51, so that is a price that I could reasonably expect to pay for a .5 ct stone.

Confidence Interval for Predicted Mean Value

There is 95% confidence that the average price of a .25 ct diamond falls between $293.22 and $508.50. The interval is not too broad and gives an idea of the average prices for a diamond that is .25 ct.

Prediction Interval for Individual Predicted Value

There is 95% confidence that the price of a .25 ct diamond will fall between $234.91 and $566.95. This interval is once again not too broad and gives a rough idea of the prices for a .25 ct stone, with the understand that selling a stone for less than $234.91 is pricing the diamond too cheap and more than $566.95 for a .25 ct diamond is overpriced.
Confidence Interval for Slope.

There is 95% confidence that the slope of the true regression line is between 4086.51 and 5814.77. This means that we can be 95% confident that the price of a diamond rises between $40.87 and $58.15 for every additional .01 ct in weight. It can be concluded that, because zero is not in the interval, there is a positive linear association between the variables.

Hypothesis:
Ho: B1== 0 The null hypotheses is that there is no linear association between the price of a diamond and its weight. (slope is zero)

Ha: B1=/= 0 The alternative hypotheses is that there is a linear association between the price of a diamond and its carat weight. (slope is not zero)

Model:
All of the linear regression t-test conditions are met
The scatter plot appears linear
The residual plot has no apparent pattern
The residuals are relatively spread consistently
The normal probability plot appears basically straight

Mechanics:
The statistics in the test have been calculated by Gnumeric. The statistics are also calculated below.

t=b1-0/SE(b1)= 4950.64/425.68=11.63

P=P(|t|>11.63)=4.37E-014 or about 0.000000000000004

Conclusion:
With this very small P-Value, the null hypothesis is rejected. The probability of calculating a slope of 4950.64 if the actual slope is zero is extremely small. This serves as significant evidence that there is a positive linear relationship between diamond weight and price.

From Amazon:
14k White Gold, Round, Diamond Stud Earrings (1/3 cttw, K-L Color, I3 Clarity)

Tags: , , , , , , , , , , , , , , ,