P-Value Calculator

Calculate p-value from Z-score or T-score for one-tailed and two-tailed hypothesis tests.

P-value

0.049996

Significant at α = 0.05?Yes
Significant at α = 0.01?No

CDF at 1.96: 0.975002

Test: two tailed

How to Use the P-Value Calculator

This p value calculator converts a test statistic into a p-value for any standard hypothesis test. Drop in a z-score, t-score, chi-square, or F statistic and you get the exact tail probability, plus a quick verdict on significance at α = 0.05 and α = 0.01. It works as a p-value from z-score calculator, a p-value from t-score calculator, and a general significance test calculator in one tool.

  1. Choose Z-score or T-score.Use Z when you know the population standard deviation or have a large sample (n > 30). Use T for small samples where the standard deviation was estimated from the sample.
  2. Enter your test statistic. This is the number you computed from your data, for example z = (x̄ − μ) ÷ (σ ÷ √n) or t = (x̄ − μ) ÷ (s ÷ √n). Typical real results land between −4 and 4.
  3. Enter degrees of freedom if you picked T. For a one-sample t-test df = n − 1. For a two-sample t-test with equal variances df = n₁ + n₂ − 2. For a paired t-test df = number of pairs − 1.
  4. Select one-tailed or two-tailed.Two-tailed (≠) tests whether the parameter differs in either direction. Right-tailed (>) and left-tailed (<) only check one side. Two-tailed is the default for almost all published research.
  5. Read the p-value and significance flags. A small p means the data would be unlikely under the null hypothesis. If p is below your pre-chosen α (usually 0.05), you reject the null and call the effect statistically significant.

The output also shows the CDF value at your statistic, which is handy for sanity-checking tables. Remember that a p-value describes how surprising the data is under the null, not how large or important the effect is. Always pair p with an effect size and a confidence interval before writing up a result.

P-Value Formulas and Worked Examples

A p-value is the probability of getting a test statistic at least as extreme as the one you observed, assuming the null hypothesis is true. The exact formula depends on which distribution your test statistic comes from.

General rules across all distributions:

Two-tailed:   p = 2 × min(CDF, 1 − CDF)
Right-tailed: p = 1 − CDF
Left-tailed:  p = CDF

CDF = cumulative area to the left of the observed statistic.

P-Value From a Z-Score

For a z-score you use the standard normal CDF, written Φ. For a two-tailed test, p = 2 × (1 − Φ(|z|)). For a one-tailed upper test, p = 1 − Φ(z). For a one-tailed lower test, p = Φ(z). The normal curve is symmetric, so flipping the sign of z only flips the one-tailed side, never the two-tailed answer.

P-Value From a T-Score With df

For a t-score you use the Student t CDF with the matching degrees of freedom, written T(t, df). Two-tailed: p = 2 × (1 − T(|t|, df)). One-tailed upper: p = 1 − T(t, df). With df > 30 the t distribution is close to the normal, so z and t give almost identical p-values. With df below 10 the tails are noticeably heavier, so the same raw statistic produces a larger p under t than under z.

P-Value From a Chi-Square Statistic

Chi-square tests (goodness of fit, independence, variance) are one-tailed on the upper side because χ² is always non-negative and larger values mean worse fit to the null. p = 1 − F(χ², df), where F is the chi-square CDF. Degrees of freedom come from the table dimensions: for an r × c contingency table, df = (r − 1)(c − 1).

P-Value From an F-Statistic

F-tests (ANOVA, comparing two variances, regression overall fit) are also one-tailed upper. p = 1 − F(F, df₁, df₂), where df₁ is the numerator degrees of freedom and df₂ is the denominator. For a one-way ANOVA with k groups and N total observations, df₁ = k − 1 and df₂ = N − k.

Worked Examples

Example 1: z = 2.05, two-tailed
  Φ(2.05) ≈ 0.9798
  p = 2 × (1 − 0.9798) = 2 × 0.0202 ≈ 0.0404
  Significant at α = 0.05, not at α = 0.01.

Example 2: z = 1.96, two-tailed
  Φ(1.96) ≈ 0.9750
  p = 2 × (1 − 0.9750) = 0.0500
  Exactly on the classic 0.05 cutoff.

Example 3: z = 2.576, two-tailed
  Φ(2.576) ≈ 0.9950
  p = 2 × (1 − 0.9950) = 0.0100
  The α = 0.01 critical value.

Example 4: t = 2.05, df = 10, two-tailed
  T(2.05, 10) ≈ 0.9665
  p = 2 × (1 − 0.9665) ≈ 0.0670
  Same raw statistic, larger p because df is small.

Quick-Reference Z to P Table

Use this as a sanity check when you calculate a p-value from a z-score by hand. The one-tailed column is the upper-tail probability; left-tailed is the mirror image.

zOne-tailed pTwo-tailed pNotes
0.500.30850.6171Not close to significant
1.000.15870.31731 standard deviation
1.2820.10000.2000α = 0.10 one-tailed
1.6450.05000.1000α = 0.05 one-tailed
1.9600.02500.0500α = 0.05 two-tailed cutoff
2.0000.02280.0455Just past 0.05
2.3260.01000.0200α = 0.01 one-tailed
2.5760.00500.0100α = 0.01 two-tailed cutoff
3.0000.001350.00273-sigma
3.2900.00050.0010α = 0.001 two-tailed

What a P-Value Really Tells You (and What It Does Not)

The p-value is the most used and most misread number in applied statistics. This section clears up the common misinterpretations, shows how statistical significance differs from practical significance, walks through the standard α thresholds used across fields, and explains why p-hacking and multiple comparisons have driven a full-blown replication crisis in published research.

What a P-Value Actually Means

A p-value is the probability of observing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true. It is a statement about the data under a specific hypothetical world, not a statement about the hypothesis itself.

Common wrong readingWhat p actually is
"p = 0.03 means there is a 3% chance the null hypothesis is true."p is P(data | null), not P(null | data). Flipping that conditional requires Bayes and a prior.
"p = 0.03 means there is a 97% chance my finding is real."p says nothing about the probability your effect is real. That depends on power, prior plausibility, and bias.
"p = 0.03 means the effect is due to chance with 3% probability."p assumes chance is the only explanation (the null) and asks how surprising the data would be under that assumption.
"p > 0.05 means no effect exists."Absence of evidence is not evidence of absence. A large p often just means the sample is too small to detect a real effect.

A cleaner plain-English reading: "If the null hypothesis were true and I repeated this experiment many times, I would see a result this extreme or more extreme p × 100 percent of the time."

Statistical Significance vs Practical Significance

A significant p-value answers the question "is the effect detectably different from zero?" It does not answer "is the effect big enough to care about?" With a large enough sample, a useless 0.1% difference becomes highly significant. With a tiny sample, a 30% difference can come back p = 0.3.

The fix is to always report an effect size and a confidence interval next to the p-value. Common effect size measures include Cohen's d for mean differences, r or R² for correlation and regression, odds ratio or risk ratio for 2×2 tables, and η² for ANOVA. A Cohen's d of 0.2 is small, 0.5 is medium, 0.8 is large. A p = 0.001 paired with d = 0.05 is a statistically loud but practically silent finding.

Common Significance Thresholds Across Fields

FieldTypical αCorresponding two-tailed zWhy this threshold
Psychology, social science, education0.051.96Fisher's 1925 convention, still the default in most journals
Medical trials (primary endpoint)0.05 (often 0.025 one-sided)1.96Regulated by FDA / EMA protocols
Biomedical, clinical chemistry0.012.576Stricter because of downstream risk
Genomics, GWAS5 × 10⁻⁸5.45Bonferroni correction for ~10⁶ independent tests
Particle physics (discovery)≈ 5.7 × 10⁻⁷ (5σ)5.00Historical standard for new-particle claims, e.g. Higgs boson
Proposed reform (Benjamin et al. 2018)0.0052.81Raise the bar to reduce false positives in behavioral science

The takeaway: 0.05 is not a law of nature, it is a convention. Pick the threshold before you look at the data, justify it from the cost of a false positive, and stick with it.

P-Hacking, Multiple Comparisons, and the Replication Crisis

If you run 20 independent tests at α = 0.05 under a true null, you expect 1 of them to come back "significant" by pure chance. This is the multiple-comparisons problem, and it is the engine behind a large share of findings that do not replicate.

P-hackingis the practice, intentional or not, of running many analyses and only reporting the ones that crossed the 0.05 line. Variants include trying different outcome measures, dropping "outliers" after seeing the result, adding covariates until p dips below 0.05, and optional stopping (peeking at the data and stopping when significance hits). All of these inflate the true false-positive rate far above the stated α.

The standard fixes, each with tradeoffs:

  • Bonferroni correction: divide α by the number of tests m. With 10 tests at family-wise α = 0.05, each individual test needs p < 0.005. Conservative but simple.
  • Holm-Bonferroni: step-down version of Bonferroni that is uniformly more powerful.
  • Benjamini-Hochberg (FDR): controls the expected proportion of false positives among rejections rather than the chance of any false positive. Standard in genomics and large-scale screening.
  • Pre-registration: lock in the hypothesis, sample size, and analysis plan before collecting data. Removes most informal p-hacking.
  • Replication: the only real test. A single p < 0.05 is a hint, not a finding.

Ioannidis's 2005 paper "Why most published research findings are false" and the subsequent Open Science Collaboration replication projects (roughly 40% of psychology findings replicated, 60% of cancer biology findings failed) are the reason modern style guides now push for effect sizes, confidence intervals, pre-registration, and Bayesian reporting alongside, or instead of, a bare p-value.

Frequently Asked Questions

A p-value of 0.05 means that if the null hypothesis were true, you would see results at least this extreme about 5% of the time by chance alone. By long-standing convention, p < 0.05 is called statistically significant and you reject the null. It does not mean there is a 5% chance the null is true, and it does not mean the effect is large or important. Always report an effect size and confidence interval next to any p-value.

Related Calculators