The p-value is the most used and most misread number in applied statistics. This section clears up the common misinterpretations, shows how statistical significance differs from practical significance, walks through the standard α thresholds used across fields, and explains why p-hacking and multiple comparisons have driven a full-blown replication crisis in published research.
What a P-Value Actually Means
A p-value is the probability of observing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true. It is a statement about the data under a specific hypothetical world, not a statement about the hypothesis itself.
| Common wrong reading | What p actually is |
|---|
| "p = 0.03 means there is a 3% chance the null hypothesis is true." | p is P(data | null), not P(null | data). Flipping that conditional requires Bayes and a prior. |
| "p = 0.03 means there is a 97% chance my finding is real." | p says nothing about the probability your effect is real. That depends on power, prior plausibility, and bias. |
| "p = 0.03 means the effect is due to chance with 3% probability." | p assumes chance is the only explanation (the null) and asks how surprising the data would be under that assumption. |
| "p > 0.05 means no effect exists." | Absence of evidence is not evidence of absence. A large p often just means the sample is too small to detect a real effect. |
A cleaner plain-English reading: "If the null hypothesis were true and I repeated this experiment many times, I would see a result this extreme or more extreme p × 100 percent of the time."
Statistical Significance vs Practical Significance
A significant p-value answers the question "is the effect detectably different from zero?" It does not answer "is the effect big enough to care about?" With a large enough sample, a useless 0.1% difference becomes highly significant. With a tiny sample, a 30% difference can come back p = 0.3.
The fix is to always report an effect size and a confidence interval next to the p-value. Common effect size measures include Cohen's d for mean differences, r or R² for correlation and regression, odds ratio or risk ratio for 2×2 tables, and η² for ANOVA. A Cohen's d of 0.2 is small, 0.5 is medium, 0.8 is large. A p = 0.001 paired with d = 0.05 is a statistically loud but practically silent finding.
Common Significance Thresholds Across Fields
| Field | Typical α | Corresponding two-tailed z | Why this threshold |
|---|
| Psychology, social science, education | 0.05 | 1.96 | Fisher's 1925 convention, still the default in most journals |
| Medical trials (primary endpoint) | 0.05 (often 0.025 one-sided) | 1.96 | Regulated by FDA / EMA protocols |
| Biomedical, clinical chemistry | 0.01 | 2.576 | Stricter because of downstream risk |
| Genomics, GWAS | 5 × 10⁻⁸ | 5.45 | Bonferroni correction for ~10⁶ independent tests |
| Particle physics (discovery) | ≈ 5.7 × 10⁻⁷ (5σ) | 5.00 | Historical standard for new-particle claims, e.g. Higgs boson |
| Proposed reform (Benjamin et al. 2018) | 0.005 | 2.81 | Raise the bar to reduce false positives in behavioral science |
The takeaway: 0.05 is not a law of nature, it is a convention. Pick the threshold before you look at the data, justify it from the cost of a false positive, and stick with it.
P-Hacking, Multiple Comparisons, and the Replication Crisis
If you run 20 independent tests at α = 0.05 under a true null, you expect 1 of them to come back "significant" by pure chance. This is the multiple-comparisons problem, and it is the engine behind a large share of findings that do not replicate.
P-hackingis the practice, intentional or not, of running many analyses and only reporting the ones that crossed the 0.05 line. Variants include trying different outcome measures, dropping "outliers" after seeing the result, adding covariates until p dips below 0.05, and optional stopping (peeking at the data and stopping when significance hits). All of these inflate the true false-positive rate far above the stated α.
The standard fixes, each with tradeoffs:
- Bonferroni correction: divide α by the number of tests m. With 10 tests at family-wise α = 0.05, each individual test needs p < 0.005. Conservative but simple.
- Holm-Bonferroni: step-down version of Bonferroni that is uniformly more powerful.
- Benjamini-Hochberg (FDR): controls the expected proportion of false positives among rejections rather than the chance of any false positive. Standard in genomics and large-scale screening.
- Pre-registration: lock in the hypothesis, sample size, and analysis plan before collecting data. Removes most informal p-hacking.
- Replication: the only real test. A single p < 0.05 is a hint, not a finding.
Ioannidis's 2005 paper "Why most published research findings are false" and the subsequent Open Science Collaboration replication projects (roughly 40% of psychology findings replicated, 60% of cancer biology findings failed) are the reason modern style guides now push for effect sizes, confidence intervals, pre-registration, and Bayesian reporting alongside, or instead of, a bare p-value.