ANOVA is a useful procedure, as is its simpler cousin the t-test. Both of these tests are “parametric” — a term that means they rely on assumptions about the parameters or characteristics of the underlying population from which the data were drawn. Tests exist that make no such assumptions. These tesrs are called “non-parametric.”
Test of Exact Probability
This is a wonderful inferential statistical test, but for large data sets it can be too computationally intensive (requiring lots of computer power, or lots of time). However, it is conceptually very simple and logical:
Of the many different ways the data could be different between the two conditions (Pre and Post, say, or Experimental and Control), how many are as extreme or more extreme than the obtained difference? If few (by convention, <.05 of the all the possible differences) are as extreme or more extreme, then the obtained result is unlikely; conclude that it reflects the effect of what you did and is not just a chance effect. For example:
Expt: 1, 2, 4 -- Sum = 7, Mean = 7/3
Ctrl: 3, 5, 6 -- Sum = 14, Mean = 14/3
H-0, Null Hypothesis: If the group results were totally unrelated to anything that was done to the subjects, then the numbers within each group are there simply by chance – a subject would have produced a given score regardless of group, so the group differences simply reflect the chance, random assignment of subjects to groups.
H-1, Alternate Hypothesis: The group results at least in part reflect what was done to the groups. If the difference between the groups is not likely to have occurred just by chance, we will conclude that the experimental manipulation caused the difference.
Total number of ways of assigning these 6 scores to two different groups by chance:
Possible Data Group Sums
1 2 3 | 4 5 6 6 | 15
1 2 4 | 3 5 6 7 | 14
1 2 5 | 3 4 6 8 | 13
1 2 6 | 3 4 5 9 | 12
1 3 4 | 2 5 6
1 3 5 | 2 4 6 .
1 3 6 | 2 4 5
1 4 5 | 2 3 6 .
1 4 6 | 2 3 5
1 5 6 | 2 3 4 .
2 3 4 | 1 5 6
2 3 5 | 1 4 6 .
2 3 6 | 1 4 5
2 4 5 | 1 3 6 .
2 4 6 | 1 3 5
2 5 6 | 1 3 4 .
3 4 5 | 1 2 6
3 4 6 | 1 2 5 13 | 8
3 5 6 | 1 2 4 14 | 7
4 5 6 | 1 2 3 15 | 6
20 different possible outcomes could occur just by chance. If our outcome is among the .05 x 20 = 1 most extreme we would consider it unlikely to have occurred by chance, and we would reject the null hypothesis. Thus we would conclude that the manipulation had an effect. As it happens, there are two equally most extreme results, and our answer is not one of them, so we retain the null hypothesis.
What is the probability of getting a result as extreme or more extreme than our obtained result, just by chance? There are two results that are as extreme as our obtained result, and two that are more extreme, so the probability of getting a result at least this extreme by chance is 4/20 = .20.
Enumerating the possible outcomes becomes unwieldy as the number of data points increases. If you break the subjects into equal (or nearly equal) groups, here is what you will encounter:
N k Outcomes
6 3 20
7 3 35
8 4 70
9 4 126
10 5 252
11 5 462
12 6 924
13 6 1716
14 7 3432
15 7 6435
16 8 12870
17 8 24310
18 9 48620
19 9 92378
20 10 184756
In theory you could enumerate the 9238 of the most extreme outcomes for the N=20 situation, but it would not be fun.
Alternatives to Test of Exact Probability:
Our situation is a bit different. In the simplest case of improvement within a session, we have two data points (First and Second) for each participant. If any difference between the two times is due just to chance, then someone’s lowest time would equally likely be first or second. Thus we could have obtained:
Person A B C D E F G
First 1 3 5 7 2 4 5
Second 6 6 8 8 8 2 8
or
Person A B C D E F G
First 6 3 8 8 2 4 5
Second 1 6 5 7 8 2 8
or any other arrangement, all with the same probability. There are 2^7 = 128 different possible outcomes if chance alone is responsible. According to our convention of Unlikely = .05 or less, there are .05 x 128 = 6 or 7 unlikely outcomes. If our obtained outcome is among them we would reject the idea that chance alone was responsible. With our N=14 from the first day of collecting data, the number of possible outcomes increases to 2^14 = 16384, with 819 unlikely outcomes. We could try to enumerate them all in order to determine if the results were due to chance, but with this many possibilities we should probably use a computer.
Fortunately there are some easily calculated and commonly used statistical tests that will address these questions for you with little trouble. We will talk about two of them, and will pretend that the first set of scores listed above is our actual, obtained data.
The Sign Test
The sign test throws out much of the information in the data set, and examines only the direction of the difference between each participant’s two scores. First let’s calculate the differences between First and Second scores for each participant, then throw away the number and just keep the sign.
Person A B C D E F G
First 1 3 5 7 2 4 5
Second 6 6 8 8 8 2 8
Diff -5 -3 -3 -1 -6 2 -3
Sign - - - - - + -
There are 128 different ways of assigning + and – signs to 7 participants. The 2 most extreme of these have all changes going in the same direction, and 14 of them include a single + or a single -, thus the probability of a result as extreme or more extreme than the obtained result is 16/128 = .125. As this is not unlikely, we retain the null hypothesis and conclude that the result could have occurred by chance. If there are too many possible permutations to list, you can
use a table to assess the probability of your result (or a more extreme result). To use the table, determine C = the number of changes in either the positive or negative direction, whichever is greater.
The Wilcoxon Signed Ranks Test
To reach this conclusion we discarded lots of information – specifically the magnitude of the changes from First to Second. The Wilcoxon signed ranks test maintains some of that information.
Person A B C D E F G
First 1 3 5 7 2 4 5
Second 6 6 8 8 8 2 8
Diff -5 -3 -3 -1 -6 2 -3
|Diff| 5 3 3 1 6 2 3
Rank 5 4 4 1 6 2 4
Signed Rank -5 -4 -4 -1 -6 2 -4
Wilcoxon then calculates a value T = | smallest sum of signed ranks | = 2 (Low T is unlikely)
The probability that a given T occurred just by chance is given by
consulting a table. In this case the probability of a T this low or lower is .05, so we can reject the null hypothesis and conclude that the first and second scores differed significantly. Note that by keeping some information about the magnitude of the differences we were able to reject the null hypothesis; keeping this info gave our test more power.
Mann-Whitney U test
For independent groups, the Wilcoxon or Sign tests will not work, as there is no way to pair up the scores across the conditions. Mann-Whitney U test is the test to use as a nonparametric test of two independent samples.
Rank the scores across both groups, with the lowest score given the rank of 1, and the highest score the rank of (n1 + n2). Ties share a rank. Once this is done. calculate the value of U for each of the two groups, as follows:
where R1 and R2 are the sums of the ranks for each group. The test statistic U is the lowest of these two Us.
Consult a table of critical values. If your U is less than or equal to the critical U, then the p is < alpha.
Example: here are two groups:
Our U is the smaller of the two: 15
Critical U (from the table): 6. Our U is LARGER THAN the critical U, so our p is GREATER THAN .05. Retain the null hypothesis of no effect, and conclude that the groups do not differ.
[Some info and the formulas come from this source.]
Summary
Non-parametric stats are especially useful if the underlying population is not normally distributed. They are also nice “down and dirty” tests for a quick assessment of the significance of some results. Unlike the sign test or Wilcoxon, the test of exact probability retains all of the information available in the data, and is thus as powerful as any of the more standard parametric tests.
————————————-
In the case of our data – you could apply one of these tests to determine if improvement occurred within a session, or between the two sessions. A note of caution, though. The convention that we consider an event that occurs less than 5% of the time to be unlikely must be maintained. Suppose you had six different measures for each participant, and you wanted to determine if any of them differed from each other. This would entail 15 different tests of differences between pairs of measures. If you set your criterion of Unlikelihood at .05 for each one, more than half the time one of the comparisons will be significant just by chance. There are ways of correcting for this “inflated alpha level,” but we will not address them.
As we discussed the tests, they apply to only two data points per participant. Variations occur that can examine more than two, but they are beyond the scope of this discussion.
————————-
My source for most information about nonparametric tests is the classic book on the topic:
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York : McGraw-Hill.