1
$\begingroup$

In different areas of STEM different p-values are accepted as significant differences between groups. For example, in behavioral biology p-value of only 0.05 is often considered to indicate statistical significance (primarily because of very small sample sizes), while in some areas of physics or bioinformatics p-values of 1e-5 or lower can be considered as a threshold for rejecting the null hypothesis.

For example, here i compare samples from same distributions, but the p-value noticeably trends downwards with the increase in sample size:

import numpy as np; from scipy import stats;
N1 = stats.norm(loc=0, scale=1)
N2 = stats.norm(loc=0.1, scale=1)
for n in [10, 100, 1000]:
    np.random.seed(1)
    print(f"{n=} p={stats.ttest_ind(N1.rvs(n), N2.rvs(n)).pvalue:.3f}")
# n=10 p=0.958
# n=100 p=0.138
# n=1000 p=0.049

# simulate multiple tests
N_simulations = 10_000
observations_pvalues = {
    10:   np.zeros(N_simulations),
    100:  np.zeros(N_simulations),
    1000: np.zeros(N_simulations),
}
np.random.seed(1)
for n_observations, pvalues in observations_pvalues.items():
    for i in range(N_simulations):
        pvalues[i] = stats.ttest_ind(N1.rvs(n_observations),   N2.rvs(n_observations)).pvalue
for n_observations, pvalues in observations_pvalues.items():
    print(f"n={n_observations} p-value: mean={pvalues.mean():.3f} std.dev.={pvalues.std():.4f}")
# n=10 p-value: mean=0.492 std.dev.=0.2895
# n=100 p-value: mean=0.427 std.dev.=0.3014
# n=1000 p-value: mean=0.107 std.dev.=0.1826

Is there any theory or know-how on how to choose p-value threshold? Maybe something that depends on number of data points or the type of statistical test?

New contributor
Daniil Zuev is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
4
  • 4
    $\begingroup$ Welcome to CV, Daniil. It is a common misconception, even among some statisticians, that the threshold can depend on the dataset size, especially for distributional tests (such as tests of normality). But the short answer is no, because the threshold reflects how you balance statistical risks and that's a completely independent consideration. You can find scattered remarks among our higher-voted posts on p-values, so take a look at them. $\endgroup$ Commented 21 hours ago
  • 3
    $\begingroup$ (Continued) Another fruitful avenue to investigate is that of the power of a test, because that quantifies the other side of the risk-balancing equation. $\endgroup$ Commented 21 hours ago
  • 1
    $\begingroup$ A recent discussion on a very similar topic can be found here. Also, you don't "compare samples from the same distribution": N1 and N2 are shifted in location (so smaller P-values as sample size increases can be expected). $\endgroup$ Commented 20 hours ago
  • $\begingroup$ @PBulls It is a convention, albeit unfortunate, to use the term "distribution" in the sense of "distributional family," and I believe that was the intended meaning in this question. $\endgroup$ Commented 19 hours ago

3 Answers 3

3
$\begingroup$

Before deciding on which threshold of p-value to choose for separating statistically "significant" from "not significant", first decide on whether you really want that all-or-none dichotomy at all. Many (me, of course!) argue for more evidentially based statistical examination of the data in support of inference.

One such approach is a neo-Fisherian significance test where a p-value is interpreted as an index of the strength of evidence in the data against the null hypothesis (according to the statistical model); smaller p-value is stronger evidence. That will, no doubt be familiar to you in concept, but note that it allows the evidence to be expressed continuously rather than the all-or-none "significant/not significant result of the Neyman—Pearsonian hypothesis test approach where a critical threshold is chosen in advance of analysis. An important advantage is that the inference to draw from a sigificance test p-value is left up to the experimenter who almost always knows more about the system and background than the statistical recipe does.

Here are a few links to help you explore that difference:

A reckless guide to p-values: local evidence and global errors.

$p$-value: Fisherian vs. contemporary frequentist definitions

Why was the term "significance" ($\alpha$) chosen for the probability of Type I error?

Are effect sizes really superior to p-values?

Interpretation of p-value in hypothesis testing

Is the "hybrid" between Fisher and Neyman-Pearson approaches to statistical testing really an "incoherent mishmash"?

$\endgroup$
2
$\begingroup$

As @whuber said in a comment, you should not choose the threshold based on sample size but, rather, on a comparison of risks.

How bad is a type 1 error? How bad is a type 2 error?

Suppose you work in pharma and you develop new drugs.

Case 1: You develop a new drug to treat common teenage acne. You think it works better than the current treatments, but it is very expensive and has bad side effects.

Type 1 error means that teens will pay a lot of money and endure side effects for no reason. Type 2 error means that teens will use a less effective medication for a fairly innocuous ailment.

Case 2: Same as above, but the ailment is pancreatic cancer, which is one of the worst kinds. But now:

Type 1 error means people who are dying take a useless medicine. Type 2 means people who could have lived die.

We ought to weigh each case and use judgment to decide. Instead, we often rely on the usual standard in our field.

$\endgroup$
0
$\begingroup$
  1. You do not choose a p-value threshold (called $\alpha$, or significance level) based on sample size. The simulation you ran, where the p-values decreased as sample size increased are completely expected behavior. The 2 samples you compare are from different distributions (there is a 0.1 shift in location). Therefore, as sample sizes increase, the power of your test increases, which means you are more likely to get a p-value, which implies the p-values are decreasing. So your significance level is independent of the sample size.
  2. You also do not choose it based on the type of statistical test. For a t-test, you use $\alpha=.05$, but for a median test (Mood) you use $\alpha=.025$? No, the significance level is independent of the test type.
  3. No, there is no "theory or know-how", or algorithm, or rule for how to choose a significance level. And yes, $\alpha=.05$ is the most generally used, by a wide margin, and the one which will generate the least amount of push-back.

The $p=0.05$ threshold is generally attributed to (incriminated upon?) Ronald Fisher. Indeed, his statement in his book “Statistical Methods for Research Workers” (1925) seems to be the first specific mention of this threshold determining statistical significance:

It is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are this formally regarded as significant. (p. 13)

As one can infer, this value is related to (approximately) 2 standard deviations away from the mean of a normal distribution. This excellent paper by Cowles and Davis actually shows that such a threshold, at 2 sd’s of a normal deviation, has a much older, and established origin.

For example, Student (aka Gosset) stated in his paper that

three times the probable error in the normal curve, for most purposes, would be considered significant. (p. 13)

It turns out, as Cowles and Davis note, that 3 probable errors (and older measure of variability), is approximately 2 standard deviations (our newer measure of variability), which indeed corresponds to $p=.05$.

So Fisher may have been wrongly “accused”, and this threshold may not be as “arbitrary” as many make it to be.

Note that Fisher did not think that this $p=.05$ threshold should be universally applied, for all experiments. In a 1926 article, he wrote:

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level.

So indeed, one can pick other significance level, depending on the risks (and benefits) of mistaken conclusions from a statistical test, as other answer have already stated.

However, in the same article, he added

A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

“Rarely fails” implies that an experiment should be repeated (not a wild ask; that is a tenet of proper scientific practice, but which seems to have fallen out of customary use these days...). So this means that, for Fisher at least, the significance level was a long-term threshold, at least over a set of experiments and replications, one which guaranteed that one would be “wrong” no more than for 1 in 20 experiments, and that this was something that a statistician should be able to “leave with”.
Why is that? Because, as a smarter (but apparently anonymous) statistician said

Statistics means never having to say that you are certain

(and here is even a little, hopefully humorous, song about it)

So we need to accept that the data will fool us sometimes, and we must learn to leave with it. And as such, a threshold at 5% is not an unreasonable compromise (but is certainly not the only compromise one could make).

To finish I will share some practices from my own experience, as an engineer working on medical devices.
I usually, by default, use $\alpha=.05$. Why? Because (almost) no one will ever challenge it, including the FDA, and because it provides an acceptable compromise between low risk of Type I errors, and required sample size (and associated time, effort and $$).
But, say that I am testing a particular attribute of a life support device, and that if that attribute were to fail, it would very likely result in the death of the patient? Then I am almost sure to use $\alpha=.01$. Why? Because I want to sleep peacefully at night knowing that I tried to minimize risks to patients, and because I want to protect the reputations of my product and employer.
But now, say that I am testing the readability of a user manual (IFU in our lingo)? I pretty much always have used $\alpha=.2$, indeed because the risks associated are low, but also because when we (in industry) all know that hardly any user will read it. And I can tell you that FDA did not object to this relaxed significance level (they might if I tried $\alpha=.25$).

TL:DR; .05 is not as arbitrary as some may say, is not a bad compromise, but certainly can be adjusted based on the specific experiment, but not based on sample size of type of test.

$\endgroup$
1
  • 3
    $\begingroup$ It is certainly worth pointing out that Fisher used "significant" to say that a result is worthy of notice and (ideally) follow-on experiments rather than in the one and done way that is built into the Neyman—Pearson hypothesis test method. $\endgroup$ Commented 5 hours ago

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.