A colleague asked me recently how I explain “degrees of freedom*.” This got me thinking; here’s a story that might help to clarify the concept.
Imagine that you must determine the average number of children to be picked up at each stop along a school bus route. There are 20 stops along the route. This particular bus serves areas of private residences and apartment complexes, so you are pretty sure that there will be variability in the numbers.
In a perfect world where all data are readily available, you would simply count the kids at each stop. This would yield an answer without error. You set out to do this, but at the end of the route you realize that you have only 19 data points—hard to believe, but you must have dozed off at some point and missed a stop. You now have all but one of the data points necessary for an errorless answer; if you use these 19 points to estimate the average, how far off could you be?
The missing point (in fact, each data point) contributes 1/20 of the information that factors into the final True Answer. The missing data point accounts for 5% of the total information, but you have 95% of the information—surely your answer must be close to correct.
The answer is that you don’t really know. The missing data point could have had any value. However, reason and common sense suggest that it is not very likely that the missing number would be lower than the smallest number that you recorded, or larger than the largest number. It might be, of course, but if you were to randomly exclude one of the 20 points (which is essentially what you have done) there is only a 1 in 10 chance that you would exclude either the lowest or highest number—you can be 90% sure that the missing point is not an extreme. If you use the average of the 19 points as your estimate of the True Answer, and if you excluded the lowest number, then your estimate will be too high by (the difference between your estimate and this lowest number)/20, and if you excluded the highest number than your estimate will be too low by (the difference between your estimate and this highest number)/20. The divisor of 20 is there because each of the 20 points contributed 1/20 of the estimate. You are therefore 90% sure that the True Answer lies in the range
[Your Estimate – (Your Estimate – Lowest Number)/20] < True Answer < [Your Estimate + (Your Estimate – Highest Number)/20]
Algebraically this reduces to a 90% chance that the maximum error is YE/10 – (LN + HN)/20**.
This represents the amount that the true answer is free to vary from your estimate when 19 of the 20 values were free to vary. Once you decided to accept the mean of these 19 as the average for the entire group of 20, the data point that you missed could no longer vary – its value was fixed at the mean of the others. This estimate, then reflects a situation with 19 points free to vary: 19 degrees of freedom.
Now repeat this thought experiment, except that this time you collected your data after an especially rough night of little sleep. At the end of the route you realize that you have only 1 of the 20 data points (it was a really hard night!). If you decide to use this single data point as an estimate of the true average for all the stops, how far off will you be? The missing values represent 95% of the information that you need—you have only 5% of the information. Surely your answer must be way off. In this case only one value was free to vary—19 were determined when you decided to use the one recorded measure as the mean for all of them. Here you had only 1 degree of freedom.
Finally, consider an intermediate situation – you miss 10 of the 20 stops. You have 50% of the information, and lack 50%, The average of your 10 data points will be far more likely to be close to the True Answer than would the “average” of the 1 data point following the rough night, but not as accurate as the estimate when you have 19 of the 20 data points. Here there are 10 points that are free to take on any value: 10 degrees of freedom.
Each missing data point allows the estimate to vary from the True Answer. The more missing points, the more the estimate will vary. Each missing point represents a lost degree of freedom: therefore the more missing data points, the fewer degrees of freedom. If there is only one missing or unknown value, there are many degrees of freedom. As the number of missing values increases, the number of degrees of freedom will decrease. With more degrees of freedom, your certainty about the accuracy of your answer increases; another way of saying this is that with more degrees of freedom, the range within which the True Answer could fall decreases.
In more general terms, the extent to which you estimate some of the statistics that enter into your ultimate answer, the fewer degrees of freedom that you have, and the more likely it is that your ultimate answer will deviate from the True Answer.
I appreciate comments. Please email me at email@example.com, especially if you want to point out errors in my logic or my algebra. They likely exist.
* If you don’t know the term just imagine a concept essential to your job but difficult to explain to newbies and poorly understood by most everyone, even you.
** (YE – LN)/20 + (YE – HN)/20 = YE/20 – LN/20 + YE/20 – HN/20 = YE/10 – (LN + HN)/20