The Purpose of Null Hypothesis Testing

As we’ve seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. Generally, however, the researcher’s goal isn’t to draw conclusions that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values within the population. These corresponding values within the population are called parameters. Imagine, for instance, that a researcher measures the amount of depressive symptoms exhibited by each of fifty clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics aren’t perfect estimates of their corresponding population parameters. This is often because there’s a particular amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms could be 8.73 in one sample of clinically depressed adults, 6.45 during a second sample, and 9.44 during a third—even though these samples are selected randomly from an equivalent population. Similarly, the correlation (Pearson’s r) between two variables could be +.24 in one sample, −.04 during a second sample, and +.15 during a third—again, albeit these samples are selected randomly from an equivalent population. This random variability during a statistic from sample to sample is named sampling error. (Note that the term error here refers to random variability and doesn’t imply that anyone has made an error. nobody “commits a sampling error.”)

One implication of this is often that when there’s a statistical relationship during a sample, it’s not always clear that there’s a statistical relationship within the population. Little difference between two group means during a sample might indicate that there’s little difference between the 2 group means within the population. But it could even be that there’s no difference between the means within the population which the difference within the sample is simply a matter of sampling error. Similarly, a Pearson’s r value of −.29 during a sample might mean that there’s a negative relationship within the population. But it could even be that there’s no relationship within the population which the connection within the sample is simply a matter of sampling error.

In fact, any statistical relationship during a sample is often interpreted in two ways:

There is a relationship within the population, and therefore the relationship within the sample reflects this.

There is no relationship within the population, and therefore the relationship within the sample reflects only sampling error.

The purpose of null hypothesis testing is just to assist researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing may be a formal approach to deciding between two interpretations of a statistical relationship during a sample. One interpretation is named the null hypothesis (often symbolized H0 and skim as “H-naught”). This is often the thought that there’s no relationship within the population which the connection within the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred accidentally.” the opposite interpretation is named the choice hypothesis (often symbolized as H1). This is often the thought that there’s a relationship within the population which the connection within the sample reflects this relationship within the population.

Again, every statistical relationship during a sample is often interpreted in either of those two ways: it’d have occurred accidentally, or it’d reflect a relationship within the population. So researchers need how to make a decision between them. Although there are many specific null hypothesis testing techniques, they’re all supported an equivalent general logic. The steps are as follows:

Assume for the instant that the null hypothesis is true. There’s no relationship between the variables within the population.

Determine how likely the sample relationship would be if the null hypothesis were true.

If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the choice hypothesis. If it might not be extremely unlikely, then retain the null hypothesis.

Following this logic, we will begin to know why Mehl and his colleagues concluded that there’s no difference in talkativeness between women and men within the population. In essence, they asked the subsequent question: “If there have been no difference within the population, how likely is it that we might find little difference of d = 0.06 in our sample?” Their answer to the present question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there’s no evidence of a sex difference within the population. We will also see why Kanner and his colleagues concluded that there’s a correlation between hassles and symptoms within the population. They asked, “If the null hypothesis were true, how likely is it that we might find a robust correlation of +.60 in our sample?” Their answer to the present question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favour of the choice hypothesis—concluding that there’s a direct correlation between these variables within the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is named the p value. A coffee p value means the sample result would be unlikely if the null hypothesis were true and results in the rejection of the null hypothesis. A high p value means the sample result would be likely if the null hypothesis were true and results in the retention of the null hypothesis. But how low must the p value is before the sample results considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is named α (alpha) and is nearly always set to .05. If there’s but a 5% chance of a result as extreme because the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the results said to be statistically significant. If there’s greater than a 5% chance of a result as extreme because the sample result when the null hypothesis is true, then the null hypothesis is retained. This doesn’t necessarily mean that the researcher accepts the null hypothesis as true—only that there’s not currently enough evidence to conclude that it’s true. Researchers often use the expression “fail to reject the null hypothesis” instead of “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood p Value

The p value is one among the foremost misunderstood quantities in psychological research (Cohen, 1994)[1]. Even professional researchers misinterpret it, and its commonplace for such misinterpretations to seem in statistics textbooks!

The most common misinterpretation is that the p value is that the probability that the null hypothesis is true—that the sample result occurred accidentally. For instance, a misguided researcher might say that because the p value is .02, there’s only a 2% chance that the result’s thanks to chance and a 98% chance that it reflects a true relationship within the population. But this is often incorrect. The p value is basically the probability of a result a minimum of as extreme because the sample result if the null hypothesis were true. So a p value of .02 means if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the p value isn’t the probability that any particular hypothesis is true or false. Instead, it’s the probability of obtaining the sample result if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what’s the probability of a sample result as extreme as this one?” In other words, “What is that the p value?” It is often helpful to ascertain that the solution to the present question depends on just two considerations: the strength of the connection and therefore the size of the sample. Specifically, the stronger the sample relationship and therefore the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the p value. This could add up. Imagine a study during which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s d may be a strong 0.50. If there have been really no sex difference within the population, then a result this strong supported such an outsized sample should seem highly unlikely. Now imagine an identical study during which a sample of three women is compared with a sample of three men, and Cohen’s d may be a weak 0.10. If there have been no sex difference within the population, then a relationship this weak supported such alittle sample should seem likely. And this is often precisely why the null hypothesis would be rejected within the first example and retained within the second.

Of course, sometimes the result are often weak and therefore the sample large or the result are often strong and therefore the sample small. In these cases, the 2 considerations trade off against one another in order that a weak result are often statistically significant if the sample is large enough and a robust relationship are often statistically significant albeit the sample is little. Table 13.1 shows roughly how relationship strength and sample size combine to work out whether a sample result’s statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and powerful. The rows represent four sample sizes which will be considered small, medium, large, and additional large within the context of psychological research. Thus each cell within the table represents a mixture of relationship strength and sample size. If a cell contains the word Yes, then this mix would be statistically significant for both Cohen’s d and Pearson’s r. If it contains the word No, then it might not be statistically significant for either. There’s one cell where the choice for d and r would vary and another where it’d vary counting on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships supported medium or small samples are never statistically significant which strong relationships supported medium or larger samples are always statistically significant. If you retain this lesson in mind, you’ll often know whether a result’s statistically significant supported the descriptive statistics alone. It’s extremely useful to be ready to develop this type of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are getting to begin, which successively allows you to detect problems in your analyses. For instance, if your sample relationship is robust and your sample is medium, then you’d expect to reject the null hypothesis. If for a few reason your formal null hypothesis test indicates otherwise, then you would like to double-check your computations and interpretations. A second reason is that the power to form this type of intuitive judgment is a sign that you simply understand the essential logic of this approach additionally to having the ability to try to the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result’s not necessarily a robust one. Even a really weak result are often statistically significant if it’s supported an outsized enough sample. This is often closely associated with Janet Shibley Hyde’s argument about sex differences (Hyde, 2007)[2]. The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word significant can cause people to interpret these differences as strong and important—perhaps even important enough to influence the school courses they take or maybe who they vote for. As we’ve seen, however, these statistically significant differences are literally quite weak—perhaps even “trivial.”

This is why it’s important to differentiate between the statistical significance of a result and therefore the practical significance of that result. Practical significance refers to the importance or usefulness of the end in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they’re not practically significant. In clinical practice, this same concept is usually mentioned as “clinical significance.” for instance , a study on a replacement treatment for phobia might show that it produces a statistically significant positive effect. Yet this effect still won’t be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that employment almost also exist already . Although statistically significant, this result would be said to lack practical or clinical significance.

Null hypothesis testing may be a formal approach to deciding whether a statistical relationship during a sample reflects a true relationship within the population or is simply thanks to chance.

The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a choice. If the sample result would be unlikely if the null hypothesis were true, then it’s rejected in favour of the choice hypothesis. If it might not be unlikely, then the null hypothesis is retained.

The probability of obtaining the sample result if the null hypothesis were true (the p value) is predicated on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.

Statistical significance isn’t an equivalent as relationship strength or importance. Even weak relationships are often statistically significant if the sample size is large enough. It’s important to think about relationship strength and therefore the practical significance of an end in addition to its statistical significance.