Drag & drop or click to add images or PDF
A p-value is the probability of observing test results as extreme as, or more extreme than, the actual results — assuming the null hypothesis is true.
Formally, for a test statistic with observed value :
Interpretation: a small p-value means the observed data would be surprising if were true, so we have evidence against . A large p-value means the data are consistent with — but does not prove is true.
Decision rule: compare to a pre-chosen significance level (typically 0.05):
What p-value is NOT:
For a standard normal :
Quick reference: → two-tailed . → two-tailed .
Use the t-distribution with degrees of freedom (or as specified by the test). Same tail logic as z, but the distribution has slightly heavier tails for small df.
Chi-square tests are inherently right-tailed because and larger values indicate worse fit to :
Never pick the tail after seeing the data — that's p-hacking.
| Common label | |
|---|---|
| 0.10 | suggestive |
| 0.05 | standard |
| 0.01 | strong |
| 0.001 | very strong |
The American Statistical Association has warned against treating as a bright line — context and effect size matter more than crossing a threshold.
It means the observed data (or more extreme data) would occur in less than 5% of repeated samples if the null hypothesis were true. By convention, this is treated as 'statistically significant' — but it doesn't mean the null hypothesis is necessarily false, and it doesn't measure the size of the effect.
The p-value is calculated *assuming* H₀ is true — it is conditional on H₀. Computing P(H₀ true | data) requires Bayesian methods with a prior probability for H₀, which the frequentist p-value does not use.
Only when the research question is genuinely directional and pre-specified before seeing the data — e.g., a new drug must perform *better* than placebo to be useful, with worse performance equivalent to no effect. Choosing the tail post-hoc is p-hacking.
P-hacking is the practice of running many analyses (different subsets, transformations, exclusions) and reporting only the significant ones, or switching test directions after seeing the data. It inflates false-positive rates and is a major contributor to the replication crisis.
Get step-by-step solutions to any math problem. Upload a photo or type your question.
Start Solving