- You do not choose a p-value threshold (called $\alpha$, or significance level) based on sample size. The simulation you ran, where the p-values decreased as sample size increased are completely expected behavior. The 2 samples you compare are from different distributions (there is a 0.1 shift in location). Therefore, as sample sizes increase, the power of your test increases, which means you are more likely to get a p-value, which implies the p-values are decreasing. So your significance level is independent of the sample size.
- You also do not choose it based on the type of statistical test. For a t-test, you use $\alpha=.05$, but for a median test (Mood) you use $\alpha=.025$? No, the significance level is independent of the test type.
- No, there is no "theory or know-how", or algorithm, or rule for how to choose a significance level. And yes, $\alpha=.05$ is the most generally used, by a wide margin, and the one which will generate the least amount of push-back.
The $p=0.05$ threshold is generally attributed to (incriminated upon?) Ronald Fisher. Indeed, his statement in his book “Statistical Methods for Research Workers” (1925) seems to be the first specific mention of this threshold determining statistical significance:
It is convenient to take this point as a limit in judging whether a
deviation is to be considered significant or not. Deviations exceeding
twice the standard deviation are this formally regarded as
significant. (p. 13)
As one can infer, this value is related to (approximately) 2 standard deviations away from the mean of a normal distribution. This excellent paper by Cowles and Davis actually shows that such a threshold, at 2 sd’s of a normal deviation, has a much older, and established origin.
For example, Student (aka Gosset) stated in his paper that
three times the probable error in the normal curve, for most
purposes, would be considered significant. (p. 13)
It turns out, as Cowles and Davis note, that 3 probable errors (and older measure of variability), is approximately 2 standard deviations (our newer measure of variability), which indeed corresponds to $p=.05$.
So Fisher may have been wrongly “accused”, and this threshold may not be as “arbitrary” as many make it to be.
Note that Fisher did not think that this $p=.05$ threshold should be universally applied, for all experiments. In a 1926 article, he wrote:
If one in twenty does not seem high enough odds, we may, if we prefer
it, draw the line at one in fifty (the 2 per cent. point), or one in a
hundred (the 1 per cent. point). Personally, the writer prefers to set
a low standard of significance at the 5 per cent. point, and ignore
entirely all results which fail to reach this level.
So indeed, one can pick other significance level, depending on the risks (and benefits) of mistaken conclusions from a statistical test, as other answer have already stated.
However, in the same article, he added
A scientific fact should be regarded as experimentally established
only if a properly designed experiment rarely fails to give this level
of significance.
“Rarely fails” implies that an experiment should be repeated (not a wild ask; that is a tenet of proper scientific practice, but which seems to have fallen out of customary use these days...). So this means that, for Fisher at least, the significance level was a long-term threshold, at least over a set of experiments and replications, one which guaranteed that one would be “wrong” no more than for 1 in 20 experiments, and that this was something that a statistician should be able to “leave with”.
Why is that? Because, as a smarter (but apparently anonymous) statistician said
Statistics means never having to say that you are certain
(and here is even a little, hopefully humorous, song about it)
So we need to accept that the data will fool us sometimes, and we must learn to leave with it. And as such, a threshold at 5% is not an unreasonable compromise (but is certainly not the only compromise one could make).
To finish I will share some practices from my own experience, as an engineer working on medical devices.
I usually, by default, use $\alpha=.05$. Why? Because (almost) no one will ever challenge it, including the FDA, and because it provides an acceptable compromise between low risk of Type I errors, and required sample size (and associated time, effort and $$).
But, say that I am testing a particular attribute of a life support device, and that if that attribute were to fail, it would very likely result in the death of the patient? Then I am almost sure to use $\alpha=.01$. Why? Because I want to sleep peacefully at night knowing that I tried to minimize risks to patients, and because I want to protect the reputations of my product and employer.
But now, say that I am testing the readability of a user manual (IFU in our lingo)? I pretty much always have used $\alpha=.2$, indeed because the risks associated are low, but also because when we (in industry) all know that hardly any user will read it. And I can tell you that FDA did not object to this relaxed significance level (they might if I tried $\alpha=.25$).
TL:DR; .05 is not as arbitrary as some may say, is not a bad compromise, but certainly can be adjusted based on the specific experiment, but not based on sample size of type of test.
N1andN2are shifted in location (so smaller P-values as sample size increases can be expected). $\endgroup$