Calculate all classification performance metrics from your confusion matrix
A confusion matrix is the cornerstone of evaluating machine learning classifiers, medical diagnostic tests, spam filters, fraud detection systems, and any binary or multi-class prediction model. At its simplest, a confusion matrix is a 2×2 table that organizes a model's predictions into four categories: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). From these four numbers alone, you can calculate more than 15 distinct performance metrics — each answering a different question about how well your model is performing. Accuracy tells you the overall fraction of correct predictions: (TP + TN) / (TP + FP + TN + FN). It is the most intuitive metric, but notoriously misleading for imbalanced datasets. If 99% of your samples are negative, a model that predicts negative for everything achieves 99% accuracy while catching zero positive cases — a phenomenon known as the accuracy paradox. This is why precision, recall, and the F1 score were developed. Precision (also called Positive Predictive Value or PPV) answers: of all the cases the model flagged as positive, how many actually were positive? It is defined as TP / (TP + FP). High precision means few false alarms — critical in spam filtering, where flagging a legitimate email as spam is costly. Recall (Sensitivity or True Positive Rate) answers the opposite question: of all actual positive cases, how many did the model find? It is TP / (TP + FN). High recall is vital in medical screening, where missing a cancer diagnosis (a false negative) could be life-threatening. The F1 Score is the harmonic mean of precision and recall, providing a single number that balances both concerns. When β ≠ 1, the F-beta score lets you weight recall (β > 1) or precision (β < 1) more heavily depending on your application. The Matthews Correlation Coefficient (MCC) is widely regarded as the most reliable single metric for imbalanced binary classification. It ranges from −1 (perfectly wrong) to +1 (perfectly correct), with 0 indicating random guessing — and unlike accuracy and F1, it accounts for all four cells of the confusion matrix symmetrically. Specificity (True Negative Rate) tells you how well the model identifies negative cases: TN / (TN + FP). Together with sensitivity, it defines the ROC space — plotting sensitivity on the y-axis against 1−specificity (False Positive Rate) on the x-axis. Youden's J Statistic summarizes both as Sensitivity + Specificity − 1, ranging from 0 (worthless) to 1 (perfect). Positive and Negative Likelihood Ratios (LR+ and LR−) are particularly valued in diagnostic medicine: LR+ above 10 provides strong evidence to rule in a disease, while LR− below 0.1 provides strong evidence to rule it out. Cohen's Kappa measures inter-rater agreement corrected for chance. It compares your model's observed accuracy against what would be expected if predictions were made randomly based on class frequencies. Kappa values above 0.8 indicate almost perfect agreement, 0.6–0.8 substantial, 0.4–0.6 moderate, 0.2–0.4 fair, and below 0.2 slight. Balanced Accuracy — the arithmetic mean of sensitivity and specificity — is another robust alternative to accuracy for imbalanced datasets. For multi-class classification problems (3 or more classes), the confusion matrix becomes an N×N grid. Each cell M[i][j] represents cases where the true class was i but the model predicted class j. The diagonal contains correct predictions. Per-class precision and recall can be computed for each class, then averaged as macro-average (equal weight to all classes) or weighted average (weight by class frequency). Cohen's Kappa extends naturally to the multi-class case. This calculator handles all of the above. Enter your TP, FP, TN, FN values directly or use one of the preset scenarios — Balanced Model, High Precision, High Recall, Medical Screening, or Spam Filter — to explore how different error distributions affect each metric. Switch to multi-class mode to enter an N×N matrix for up to 6 classes and get per-class breakdowns with macro and weighted averages. Adjust the F-beta slider to tune the precision-recall tradeoff for your use case. Use the prevalence adjustment to see how PPV and NPV change when applied to a population with a different prevalence than your test sample.
Understanding Confusion Matrix Metrics
What Is a Confusion Matrix?
A confusion matrix is a tabular summary of the prediction results of a classification algorithm. For binary classification, it is a 2×2 table with rows representing actual classes and columns representing predicted classes. The four cells are True Positives (TP: correctly predicted positives), True Negatives (TN: correctly predicted negatives), False Positives (FP: negatives incorrectly predicted as positive — Type I Error), and False Negatives (FN: positives incorrectly predicted as negative — Type II Error). The matrix provides a complete picture of classification performance, revealing not just how often the model is correct overall, but specifically where it goes wrong — whether it tends to over-predict or under-predict the positive class. For multi-class problems, it extends to an N×N grid where the diagonal represents correct predictions for each class and off-diagonal cells represent misclassifications between specific class pairs.
How Are the Metrics Calculated?
All metrics derive from the four values TP, FP, TN, FN and their combinations. Primary metrics: Accuracy = (TP+TN)/N; Precision = TP/(TP+FP); Recall = TP/(TP+FN); Specificity = TN/(TN+FP); F1 = 2TP/(2TP+FP+FN). Error rates: FPR = FP/(FP+TN) = 1−Specificity; FNR = FN/(FN+TP) = 1−Recall; FDR = FP/(FP+TP) = 1−Precision; FOR = FN/(FN+TN) = 1−NPV. Advanced metrics: MCC = (TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)); Balanced Accuracy = (Recall + Specificity) / 2; Cohen's Kappa = (Po − Pe) / (1 − Pe) where Pe is the expected accuracy by chance; Youden's J = Sensitivity + Specificity − 1; LR+ = Sensitivity/FPR; LR− = FNR/Specificity. When a denominator is zero (e.g., no positive samples in the test set), the corresponding metric is undefined and shown as N/A.
Why Do Different Metrics Matter?
Choosing the right metric depends entirely on the cost structure of your application. In medical screening, missing a disease (false negative) can be fatal, so maximizing recall (sensitivity) is paramount — even at the cost of more false alarms. In spam filtering, marking a legitimate email as spam (false positive) damages user trust, so precision takes priority. Fraud detection requires high recall to catch most fraud, while accepting some false positives as the cost of vigilance. The MCC is considered the most informative single metric for imbalanced binary classification because it accounts for all four cells symmetrically and is not inflated by class imbalance. Balanced Accuracy and Kappa are also robust to imbalance. For medical diagnostics, Likelihood Ratios directly update pre-test probabilities to post-test probabilities via Bayes' theorem, making them directly actionable in clinical settings.
Limitations and the Accuracy Paradox
The accuracy paradox is the most important limitation to understand: when classes are severely imbalanced, a naive model that always predicts the majority class can achieve high accuracy while being completely useless. For example, if 95% of patients are healthy, predicting everyone as healthy gives 95% accuracy with 0% recall for disease detection. This is why accuracy alone should never be the sole metric for imbalanced problems. Additional limitations: Precision and recall are undefined when there are no positive predictions or no actual positives, respectively. MCC requires all four cells to be non-zero for a fully meaningful result. Kappa can be misleading for very imbalanced distributions. Confidence intervals (available in full statistical software) are important when sample sizes are small — a high recall on 10 samples is far less reliable than on 1,000. Always report multiple metrics and consider your specific application's cost asymmetry between false positives and false negatives.
Formules
The fraction of all predictions that were correct. Intuitive but misleading for imbalanced datasets where one class dominates.
Precision measures the fraction of positive predictions that are correct. Recall measures the fraction of actual positives that were found. F1 is their harmonic mean, balancing both concerns.
A balanced metric ranging from −1 (perfectly wrong) to +1 (perfect). Unlike accuracy, MCC accounts for all four cells symmetrically and is robust to class imbalance.
Measures agreement corrected for chance. Pₒ is observed accuracy; Pₑ is expected accuracy if predictions were random based on class frequencies. Values: <0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect.
Reference Tables
Metric Interpretation Guide
| Mesure | Plage | Best For | Key Limitation |
|---|---|---|---|
| Accuracy | 0 – 1 | Balanced datasets | Misleading when classes are imbalanced |
| Precision | 0 – 1 | Spam filters, fraud alerts | Ignores false negatives |
| Recall | 0 – 1 | Medical screening, safety | Ignores false positives |
| F1 Score | 0 – 1 | General single metric | Does not account for true negatives |
| MCC | −1 – +1 | Imbalanced binary classification | Requires all four cells non-zero |
| Cohen's Kappa | −1 – +1 | Inter-rater agreement | Can understate agreement for skewed classes |
| Balanced Accuracy | 0 – 1 | Imbalanced datasets | Does not penalize false positives separately |
| Youden's J | 0 – 1 | Diagnostic tests | Only meaningful for binary classification |
Cohen's Kappa Interpretation Scale
| Kappa Range | Strength of Agreement |
|---|---|
| < 0.00 | Less than chance agreement |
| 0.00 – 0.20 | Slight agreement |
| 0.21 – 0.40 | Fair agreement |
| 0.41 – 0.60 | Moderate agreement |
| 0.61 – 0.80 | Substantial agreement |
| 0.81 – 1.00 | Almost perfect agreement |
Worked Examples
Medical Screening Test — High Recall Scenario
Total N = 95 + 40 + 850 + 5 = 990
Accuracy = (95 + 850) / 990 = 0.9545 (95.45%)
Precision = 95 / (95 + 40) = 0.7037 (70.37%)
Recall = 95 / (95 + 5) = 0.9500 (95.00%)
F1 = 2 × (0.7037 × 0.9500) / (0.7037 + 0.9500) = 0.8085
MCC = (95×850 − 40×5) / √(135 × 100 × 890 × 855) = 0.7984
Spam Filter — High Precision Scenario
Total N = 180 + 3 + 800 + 17 = 1000
Accuracy = (180 + 800) / 1000 = 0.9800 (98.00%)
Precision = 180 / (180 + 3) = 0.9836 (98.36%)
Recall = 180 / (180 + 17) = 0.9137 (91.37%)
F1 = 2 × (0.9836 × 0.9137) / (0.9836 + 0.9137) = 0.9474
MCC = (180×800 − 3×17) / √(183 × 197 × 803 × 817) = 0.9389
Imbalanced Dataset — Accuracy Paradox Demonstration
Accuracy = (0 + 9900) / 10000 = 0.9900 (99.00%)
Precision = 0 / (0 + 0) = undefined (no positive predictions)
Recall = 0 / (0 + 100) = 0.0000 (0%)
MCC = (0×9900 − 0×100) / √(0 × 100 × 9900 × 9900) = 0.000
Balanced Accuracy = (0 + 1.0) / 2 = 0.5000 (50%)
How to Use the Confusion Matrix Calculator
Enter Your TP, FP, TN, FN Values
Type your confusion matrix values directly into the 2×2 grid. True Positives (TP) are cases correctly predicted as positive. False Positives (FP, Type I Error) are negatives predicted as positive. True Negatives (TN) are correctly predicted negatives. False Negatives (FN, Type II Error) are positives predicted as negative. Or click any preset scenario to load example values.
Review All Performance Metrics
After entering your values, all 15+ metrics calculate instantly. Check the Performance Metrics section for accuracy, precision, recall, specificity, F1, and balanced accuracy with a visual bar chart. The Error Rates section shows FPR, FNR, FDR, FOR, and NPV. The Advanced Metrics section shows MCC, Cohen's Kappa, Youden's J, likelihood ratios, and prevalence.
Adjust Beta and Prevalence for Your Use Case
Use the F-beta slider to tune the balance between precision and recall. Set β < 1 (e.g. 0.5) to penalize false positives more heavily — ideal for spam filters. Set β > 1 (e.g. 2) to penalize false negatives more — ideal for medical screening. Enter a custom prevalence percentage in Advanced Options to see adjusted PPV and NPV for a target population that differs from your test sample.
Export or Switch to Multi-Class Mode
Click Export CSV to download all computed metrics as a spreadsheet. Click Print Results for a print-friendly report. For multi-class models, switch to the Multi-Class (N×N) tab, set the number of classes (2–6), name each class, and fill in the matrix cells. You will get per-class precision, recall, and F1 scores plus macro and weighted averages and Cohen's Kappa.
Questions Fréquemment Posées
What is the difference between precision and recall?
Precision and recall measure different failure modes. Precision (TP / (TP + FP)) asks: of everything the model predicted as positive, what fraction was actually positive? A low precision means many false alarms. Recall (TP / (TP + FN)) asks: of all actual positive cases, what fraction did the model catch? A low recall means the model misses many real positives. They exist in a fundamental tradeoff — increasing the threshold for predicting positive typically raises precision and lowers recall, while lowering it does the opposite. The F1 score balances both; the F-beta score lets you weight one over the other depending on whether false positives or false negatives are more costly in your application.
When should I use MCC instead of accuracy or F1?
The Matthews Correlation Coefficient (MCC) should be your go-to metric whenever your dataset is imbalanced — when one class significantly outnumbers the other. MCC accounts for all four cells of the confusion matrix (TP, TN, FP, FN) symmetrically and is not inflated by class imbalance. For example, if 95% of samples are negative, a model predicting everything as negative gets 95% accuracy and a high F1 for the majority class, but an MCC of 0, correctly signaling random performance. MCC ranges from −1 (perfectly wrong) to +1 (perfect predictions), with 0 indicating chance-level performance. It is also invariant to swapping positive and negative class labels, making it more objective.
What does Cohen's Kappa tell me and how do I interpret it?
Cohen's Kappa (κ) measures how much better your model performs compared to a random classifier that makes predictions based purely on class frequency proportions. It corrects for chance agreement. Kappa of 0 means the model performs no better than random chance; negative values mean it performs worse. Interpretation guidelines: below 0.20 = slight agreement; 0.20–0.40 = fair; 0.40–0.60 = moderate; 0.60–0.80 = substantial; 0.80–1.00 = almost perfect agreement. Kappa is especially useful when evaluating inter-rater reliability in annotation tasks, and for imbalanced datasets where accuracy overstates performance.
What is the accuracy paradox and how do I detect it?
The accuracy paradox occurs when a naive model that always predicts the majority class achieves high accuracy despite being completely useless. For example, if only 1% of emails are spam, predicting every email as not spam yields 99% accuracy but catches no spam at all. This calculator warns you automatically when one class makes up more than 80% of your dataset. When you see this warning, shift focus to metrics that are robust to imbalance: MCC, Balanced Accuracy (average of sensitivity and specificity), F1 Score, and Cohen's Kappa. These metrics will give you a much more honest picture of whether your model has learned anything meaningful.
What are likelihood ratios and when are they used?
Likelihood Ratios (LRs) are used primarily in diagnostic medicine to quantify how much a positive or negative test result changes the probability that a patient has a disease. The Positive Likelihood Ratio (LR+ = Sensitivity / (1 − Specificity)) tells you how much more likely a positive test result is in a truly diseased patient than in a healthy one. LR+ above 10 is considered strong evidence to rule in a diagnosis. The Negative Likelihood Ratio (LR− = (1 − Sensitivity) / Specificity) below 0.1 is considered strong evidence to rule out a diagnosis. LRs work with Bayes' theorem to convert pre-test probability (disease prevalence) into post-test probability, making them directly actionable for individual patient decisions.
How does the multi-class confusion matrix work?
For a model that classifies into N categories, the confusion matrix becomes an N×N grid. Each row represents the actual class and each column represents the predicted class. The diagonal cells (top-left to bottom-right) count correct predictions for each class; off-diagonal cells count misclassifications between specific pairs of classes. Per-class precision is computed as the diagonal value divided by the column sum (how many of the model's predictions for that class were correct). Per-class recall is the diagonal divided by the row sum (how many of the actual cases of that class were correctly identified). Macro-average weights each class equally; weighted average weights by class frequency. Cohen's Kappa extends naturally to multi-class settings and accounts for chance agreement across all classes simultaneously.