Circles at the bottom come from studies with few participants about 20 , studies at the top come from large studies participants or more. Source: Verhaeghen The second reason why published effect sizes are likely to be overestimates, is that researchers have lots of incentives to decrease p-values of almost significant results and none to increase p-values of significant findings John et al. Because p-values are inversely related to effect sizes, these practices will lead to higher published effect sizes than warranted if the noise had not been tinkered with.
The end result of publication bias and p-hacking is that published findings usually are not a good source to estimate the size of the effect you are going to examine remember that power programs critically rely on this information.
Another potential source for estimating the effect size comes from data of a small-scale study you already ran. Indeed, grant submissions have more chances of being awarded if they include pilot testing. Even worse, pilot testing is likely to put you on a false trail if a significant effect in pilot testing is the only reason to embark on a project. Pilot testing does not give valid estimates of effect sizes for the simple reason that they are too small. Everything we discussed related Figure 1 is also true for pilot testing.
Do old adults know more words than young adults? Before continuing, you may want to think what your answer to that question is. The nice aspect about the question is that there are many data around. In many studies, young adults are compared to old adults, and quite often a vocabulary test is included to get an estimate of the crystallized intelligence of the participants. So, all we have to do is to search for studies including vocabulary tests and write down 1 the sample sizes tested, and 2 the effect size reported.
Verhaeghen did exactly this analysis. Figure 2 shows the outcome. Two aspects are noteworthy in Figure 2. First, it looks like old adults know more words than young adults. This illustrates the problem you are confronted with when you run a single, small-scale pilot study. You have no idea where your single data point falls relative to the entire picture.
This will inform you that with small samples the confidence intervals around obtained effect sizes cover almost all effect sizes going from big effect sizes in favor of the hypothesis to big effect sizes against the hypothesis. Hopefully this will save you from overinterpreting effect sizes obtained in pilot studies. Another nice aspect of Figure 2 is that the question was theory neutral.
The data had not been collected to test one or the other theory; they were just collected for descriptive purposes. As a result, the figure looks symmetric as it should be. However, such techniques require highly powered studies to see the funnel i. Thus far the story has been largely negative and in this respect mimics many existing papers of power analysis. You need an estimate of effect size to get started, and it is very difficult to get a useful estimate.
There is a way out, however. This is nicely summarized in the statistical guidelines of the Psychonomic Society, one of the larger and thoughtful publishers in scientific psychology:. Statistical power refers to the sensitivity of a test to detect hypothetical true effects. Studies with low statistical power often produce ambiguous results.
Thus it is highly desirable to have ample statistical power for effects that would be of interest to others and to report a priori power at several effect sizes not post hoc power for tests of your main hypotheses. Best practice is to determine what effects would be interesting e. For a repeated-measures factor it means that two thirds of the participants show the effect. Finally, in all our examples we assume that you have balanced designs.
That is, each cell of the design contains the same number of observations. Unbalanced designs are known to have lower power, the more so when the imbalance becomes stronger. Imbalances are more prevalent for between-groups variables than repeated-measures variables, as participants in the various groups must be recruited, whereas most participants take part in all within conditions.
Nearly every text on power deals with the simplest cases: t-tests and single correlations. For t-tests, a distinction is made between a t-test for a between-groups factor and a t-test for a repeated-measures factor see also under ANOVA. These numbers are part of the guidelines to be used for more complex designs. As a rule of thumb, never expect the numbers for complex designs to be lower than the numbers for simple designs see below for the few exceptions.
For between-groups designs, assume that the number of participants per condition is the one you have to keep constant so, a design with three groups will require at least 3 groups of participants. The number of data pairs for a simple correlation can be extended to multiple regression analysis.
So, for one predictor you need participants, for 2 predictors you need participants, and so on. The required numbers are higher when the intercorrelations among the predictors are higher than the correlations of the predictors with the dependent variable Maxwell, In recent years, Bayesian analysis has been proposed as an alternative to the traditional frequentist tests, such as t-tests and ANOVAs e.
An advantage of Bayesian analysis is that it gives information about the relative likelihood not only of the alternative hypothesis but also of the null-hypothesis. A Bayesian factor of 10 or more is considered as strong evidence for the alternative hypothesis; a Bayesian factor of. In the analyses below, we use a crude method, recommended for researchers without detailed knowledge of the processes they are investigating. It is the analysis likely to be used by researchers unfamiliar with the details of Bayesian analysis, who want to use the technique for null hypothesis significance testing.
There are no power calculators for Bayesian analyses yet, but we can estimate the power of existing algorithms e. An advantage of Bayesian analysis is that it allows you to conclude in favor of the null hypothesis. What is usually not mentioned is that you need many data for that. Because Bayes factors of. They are more feasible, as can be seen below.
A typical mistake made within traditional statistical testing is that the absence of a significant effect is interpreted as evidence for the null hypothesis. This is wrong, because only some of the non-significant effects are due to the null hypothesis being true. To show that the data are in line with the null hypothesis, you must go further and demonstrate that the effect is so small that it does not have theoretical importance Lakens et al.
This can be done by examining whether the effect remains within two pre-established narrow borders around zero. The simple tests just described are the backbone of power analysis, because we use them every time we calculate post hoc pairwise comparisons to understand the pattern of effects observed in more complicated designs.
If anything, these post hoc tests will require more participants than the numbers reported above because they need to be corrected for multiple testing e. On the positive side, if we have specific predictions for pairwise comparisons, we can use one-tailed tests, which reduce the numbers of participants needed. In terms of power, simple designs one independent variable, two levels are preferable and researchers are advised to keep their designs as simple as possible.
However, sometimes it makes sense to have three levels of a categorical variable. This is the case, for instance, when there is a known difference between two conditions, and a third condition is examined which is expected to yield results in line with one of the two conditions, or results in-between. Then it makes sense to compare the new condition to the two already well-established conditions.
For example, it is known that associated words prime each other. Associated words are words that spontaneously come to mind upon seeing a prime word. The second word the target is processed faster when it follows the first word prime than when it follows an unrelated word girl-boy vs. Suppose a researcher now wants to know to what extent, non-associated, semantically related words prime target words e. Then it makes sense to present the three types of primes in a single experiment to a make sure that a priming effect is observed for the associated pair, and b to examine how large the priming effect is for the new primes relative to the associated pairs.
The semantic priming example is likely to be a repeated-measures experiment. However, the same reasoning applies to a between-groups design. We start with the latter design three independent groups , because the independence of observations makes the calculations easier. It is also important to be able to find the population pattern in post hoc tests why else include three conditions.
Simulations indicate the following numbers. The numbers are especially high in the last design because of the need for significant post-hoc tests. This illustrates once again that you must have good reasons to add extra conditions to your design! When we increase the number of groups to 3, G-Power informs us that we now only need participants or 82 per group Figure 3. In addition, this is only the power of the omnibus ANOVA test with no guarantee that the population pattern will be observed in pairwise comparisons of the sample data.
We can also run Bayesian analysis for a design with three independent groups. In Bayesian analysis, the Bayes factors have been argued not to require adjustment for multiple tests e.
Because the numbers of participants in properly powered between-groups designs are so high, many researchers are motivated to run designs within participants, also called repeated-measures designs. Unfortunately, different effect sizes can be calculated for repeated-measures designs and this is the source of much confusion and incorrect use of power calculators.
To illustrate the issue, we make use of a toy dataset, shown in Table 2. It includes the average reaction times in milliseconds of 10 participants to target words preceded by related primes and unrelated primes.
Every participant responded to both conditions. So, the manipulation is within subjects. Example of data reaction times from a word recognition experiment as a function of prime type related, unrelated.
As we can see in Table 2 , almost all participants showed the expected priming effect faster responses after a related prime than after an unrelated prime. Only the last participant had a difference in the opposite direction. The effect size d in a t-test for related samples is based on the difference scores. Notice that the effect size is uncommonly large, as often happens in statistics textbooks when small datasets are used as examples. The d-value for a t test is the one we implicitly assume when we think of a pairwise effect size in a repeated-measures design.
Unfortunately, this equation is only valid for between-groups. For repeated-measures the correct equation is:. The multiplication by 2 is not warranted because the two observations per participant are not independent as can be seen in the degrees of freedom in the t test. As a result, the typical conversion overestimates the d-value by a factor of two. Still, the error is easily made because people use the same equation for between-subjects designs and within-subjects designs.
Another way in which we can go astray with the calculation of d in repeated-measures designs is due to the fact that d can be defined in two ways. First, it can be defined as we just did on the basis of difference scores, and this definition is the one that matters for power analysis. However, d can also be defined as the difference in means divided by the mean standard deviation. The latter definition is interesting for meta-analysis because it makes the effect size comparable in between-groups designs and repeated-measures designs.
Some of the studies included in the meta-analysis have the same participants in both conditions; other studies have different participants in the conditions.
Because there are two definitions of d for pairwise comparisons in repeated-measures designs, it makes sense to give them different names and to calculate both. The d-value based on the t-test for related samples is traditionally called d z , and the d-value based on the means d av e. Because d z and d av are related to each other, we can derive the mathematical relation between them.
The crucial variable is the correlation of the repeated-measures. This is because all participants show more or less the same difference score, despite large differences in overall response times going from ms to ms. More specifically, it can be shown that e. The inclusion of the correlation in the equation makes sense, because the more correlated the observations are across participants, the more stable the difference scores are and, hence, the larger d z. We can easily see how things can go wrong in using a power calculator.
It might look like the correlation of. However, it is a value that can be observed in well-run reaction time experiments.
To have a better idea of the correlations observed in psychological research, we analyzed some of the repeated-measures experiments studied in two large replication projects: Camerer et al. The advantage of replication studies is that the full datasets are available. Table 3 shows the results.
In particular studies with reaction times and ratings two heavily used dependent variables in psychological research have high intercorrelations between the levels of a repeated-measures factor. We will return to these correlations in the second half of the article. Correlations observed between the levels of a repeated-measures factor in a number of studies with different dependent variables.
There are two surprising cases of negative correlations in Table 3. The first is a study of valence ratings from negative to positive on a Likert scale from 1 to 9. Apparently, the participants who rated the positive images very positively also rated the negative images very negatively, whereas other participants had less extreme ratings. The second negative correlation comes from a study comparing memories for information not presented false memories to memories for information presented.
Apparently, participants who remembered less had a tendency to report more false memories. For three levels, it becomes 42; for four levels 36; and for five levels This is because we wrongly assume that the f-value does not decrease as more levels with in-between values are added, and because we are only asking for the significance of the omnibus ANOVA test.
This is where simulations form a nice addition. There are two ways in which we can simulate the design. As a matter of fact, adding a condition increases the number of participants to be tested, even in a repeated-measures design. This is not because the omnibus ANOVA fails to reach significance as it happens, it is seriously overpowered with the numbers recommended , but because many observations are needed to replicate the pattern of pairwise population differences that drive the interaction.
Because this was the case, we no longer report the separate results. Just know that the numbers are valid for correlations from. Sometimes we want to include two variables in our design, for instance two repeated-measures factors. There are two reasons why we may want to include an extra variable in a study. The first is that we want to control for a possible nuisance variable. In that case we are primarily interested in the main effect of the target variable.
We do not expect the other variable to have much effect, but we include it for certainty. Basically, the question is the same as in a t-test for repeated-measures. The only things that differ are that the design has become more complex and that we collect four observations from each participant instead of two. The number of required participants is about half that of the t test for related samples. This is because the effect of A is observed across both levels of B and we have twice as many observations per participant four instead of two.
A mistake often made in this respect, however, is that researchers assume that the reduction of participants remains valid when they have 80 stimuli e. The second reason why we may want to include two variables in the design is that we are interested in the interaction between the variables. There are three possible scenarios Figure 3. In the first case we expect a full crossing of the effects. That is, for one level of factor A the effect of factor B will be positive, and for the other level of A it will be negative left panel of Figure 3.
The most common origin of such an interaction is when a control variable has an effect as strong as the variable we are interested in.
For instance, we have to use different stimulus lists for a memory experiment and it turns out that some lists are considerably easier than the other. Then we are likely to find a cross-over interaction between the variable we are interested in and the lists used in the different conditions. Ironically, this interaction is the easiest to find.
Different types of interactions researchers may be interested in. Left panel: fully crossed interaction. Middle panel: the effect of A is only present for one level of B. Right panel: The effect of A is very strong for one level of B and only half as strong for the other level.
The power requirements of this type of interaction have long been misunderstood. Needless to say, this reduction of effect size has serious consequences for the number of participants required to find the interaction.
Simonsohn showed that the numbers have to be multiplied by four see also Giner-Sorolla, ; Perugini et al. In addition, we not only want to find a significant interaction. We also want to find the significant effect of A at the appropriate level of B in a separate pairwise test, and the absence of the effect at the other level. Indeed, much scientific research examines the boundary conditions of well-established effects. It is only interpretable when the size of the interaction is larger than the smallest main effect.
When the size of the interaction is equal to the smallest main effect as in the right panel of Figure 3 , the interaction is borderline interpretable. However, when the size of the interaction is smaller than the smallest main effect, it cannot be interpreted, because the interaction could be due to a non-linear relationship between the unobservable process of interest and the overt response that can be measured.
Garcia-Marques et al. More in general, Garcia-Marques et al. As a rule of thumb, the interaction will not be smaller than both main effects when the lines touch or cross each other at some point. The numbers of participants required are very similar to the situation depicted in the middle panel of Figure 3. So, the interaction is the same. The remaining small difference in numbers is due to the extra requirements related to the pairwise post hoc tests.
When performance of two groups is compared, researchers often use a so-called split-plot design with one between-groups variable and one repeated-measures factor.
Indeed, researchers often wonder whether such a design is not more powerful than a simple between-groups comparison. Suppose you want to examine whether students with dyslexia are disadvantaged in naming pictures of objects.
What is to be preferred then? Use a simple one-way design in which you compare students with dyslexia and controls on picture naming? For a long time, the author was convinced that the latter option was to be preferred because of what power calculators told me , but is this confirmed in simulations?
Before we start with the interactions, it is good to have a look at the main effect of the repeated-measures factor. In a first scenario, the between-groups variable is not expected to have a main effect or to interact with the repeated-measures factor. It just increases the complexity of the design. In a second scenario, the Latin-square group interacts with the main effect of the repeated-measures variable.
One stimulus set is easier than the other, and this introduces an effect of equal size. How many participants do we need in such a scenario to find a main effect of the repeated-measures variable with a power of.
This is interesting news, because it tells us that we can add extra between-groups control factors to our design, without having much impact on the power to detect the main effect of the repeated-measures variable, as was indeed argued by Pollatsek and Well We can also look at the power of the between-groups variable. Is it the same as for the between-groups t test, or does the fact that we have two observations per participant make a difference?
And does the outcome depend on the correlation between the levels of the repeated-measures variable? Here are the data:. The lower the correlation between the levels of the repeated-measures variable, the smaller the number of participants becomes. This can be understood, because highly correlated data do not add much new information and they do not much decrease the noise in the data.
In contrast, uncorrelated data add new information. When the interaction is the focus of attention, we have to make a distinction between the three types of interactions illustrated in Figure 3. The fully crossed interaction is most likely to be found with control variables e. The other two interactions are more likely to be of theoretical interest. If we only look at the significance of the interaction, then two groups of 27 participants each are enough for an F-test.
Half of the time, however, the interaction will not be accompanied by the right pattern of post-hoc effects in the groups. For the complete pattern to be present, we need two groups of 67 participants for the F-test and two groups of participants for the Bayesian analysis.
So, a split-plot design is not more powerful than a between-subjects design in terms of participants required. It does give more information, though, because it adds information about a possible main effect of the between-groups variable, and the group dependency of the repeated-measures effect.
Notice how different the outcome is from the conviction mentioned in the introduction that you can find small effect sizes in a split-plot design with 15 participants per group. When seeing this type of output, it good to keep in mind that you need 50 participants for a typical effect in a t-test with related samples. This not only sounds too good to be true, it is also too good to be true. Some authors have recommended using an analysis of covariance for a situation in which pretest and posttest scores of two groups of people are compared e.
Different numbers of participants are appropriate for different confidence levels and desired margins of error. Even though there are many different recommendations for sample sizes in quantitative usability testing, they are all consistent with each other — they simply make slightly different assumptions.
We think the user guideline is the simplest and the most likely to lead to good results — namely, a relatively small margin of error with a high confidence level. Moreover, if you also have tolerance for a larger margin of error, you can drop the number of users to 20 or even fewer, but that is generally a lot riskier. An acceptable strategy especially if you are on a tight budget and mostly interested in continuous metrics such as task time and satisfaction is to start with as many users as you can comfortably afford — say, 20—25 [RB3] users.
If they are too wide, then consider adding more users. Otherwise, you risk compromising the validity of your study. To learn how to correctly analyze and interpret your quantitative data, check out our full-day seminar, How to Interpret UX Numbers: Statistics for UX.
Jeff Sauro, James Lewis. Raluca Budiu is Director of Research at Nielsen Norman Group, where she consults for clients from a variety of industries and presents tutorials on mobile usability, designing interfaces for multiple devices, quantitative usability methods, cognitive psychology for designers, and principles of human-computer interaction. She also serves as editor for the articles published on NNgroup.
She holds a Ph. She conducts research and leads training seminars to help digital product teams expand and improve their UX practice. Her research findings and recommendations are informed by her background in information theory and design, as well as her development experience. The latest articles about interface usability, website design, and UX research from the Nielsen Norman Group. Subscribe to the weekly newsletter to get notified about future articles.
Better UX Deliverables. Between-Subject vs. Statistical Significance in UX. Share this article:. CHART 1 Description of different parameters to be considered in the calculation of sample size for a study aiming at estimating the frequency of health ouctomes, behaviors or conditions. Parameter Description Remark Population size Total population size from which the sample will be drawn and about which researchers will draw conclusions target population Information regarding population size may be obtained based on secondary data from hospitals, health centers, census surveys population, schools etc.
The smaller the target population for example, less than individuals , the larger the sample size will proportionally be. Information regarding expected prevalence rates should be obtained from the literature or by carrying out a pilot-study. Sample error for estimate The value we are willing to accept as error in the estimate obtained by the study.
The smaller the sample error, the larger the sample size and the greater the precision. In health studies, values between two and five percentage points are usually recommended. Significance level It is the probability that the expected prevalence will be within the error margin being established.
The higher the confidence level greater expected precision , the larger will be the sample size. Design effect It is necessary when the study participants are chosen by cluster selection procedures. This means that, instead of the participants being individually selected simple, systematic or stratified sampling , they are first divided and randomly selected in groups census tracts, neighborhood, households, days of the week, etc.
Thus, greater similarity is expected among the respondents within a group than in the general population. This generates loss of precision, which needs to be compensated by a sample size adjustment increase. The principle is that the total estimated variance may have been reduced as a consequence of cluster selection.
The value of the design effect may be obtained from the literature. When not available, a value between 1. The greater the homogeneity within each group the more similar the respondents are within each cluster , the greater the design effect will be and the larger the sample size required to increase precision. In studies that do not use cluster selection procedures simple, systematic or stratified sampling , the design effect is considered as null or 1.
Acceptable Error 5 p. Acceptable Error 2 p. CHART 3 Description of different parameters to be considered in the calculation of sample size for a study aiming at estimating the frequency of health ouctomes, behaviors or conditions. It is expressed by the p value.
The smaller the Alpha error greater confidence level , the larger will be the sample size. Statistical Power 1-Beta It is the ability of the test to detect a difference in the sample, when it exists in the target population. Calculated as 1-Beta. The greater the power, the larger the required sample size will be. For observational studies, the data are usually obtained from the scientific literature.
In intervention studies, the value is frequently adopted, indicating that half of the individuals will receive the intervention and the other half will be the control or comparison group. Some intervention studies may use a larger number of controls than of individuals receiving the intervention. The more distant this ratio is from one, the larger will be the required sample size. Data usually obtained from the literature. Usually, a value between 1.
Physical Science. Earth and Environmental Science. Behavioral and Social Science. Explore Our Science Videos. Turn Milk into Plastic!
0コメント