F1 Score Calculator - Free Online Tool

Compute classification metrics from confusion matrix values or precision and recall

The F1 score is one of the most widely used evaluation metrics in machine learning and data science. It provides a single number that balances two competing concerns in classification models: precision (how many of the model's positive predictions are actually correct) and recall (how many of the actual positive cases the model successfully identified). This calculator lets you compute F1 score along with a comprehensive suite of related metrics from either raw confusion matrix values or directly from precision and recall scores. When building a binary classifier — whether it is detecting spam emails, identifying fraudulent transactions, diagnosing diseases from medical scans, or flagging defective products on a production line — accuracy alone can be misleading. Consider a fraud detection model applied to a dataset where only 1% of transactions are fraudulent. A naive model that simply labels every transaction as legitimate achieves 99% accuracy, yet it catches zero fraud cases. The F1 score avoids this trap by focusing on the model's performance on the positive class, making it particularly valuable when dealing with imbalanced datasets. Understanding the four cells of a confusion matrix is the starting point for any thorough model evaluation. True Positives (TP) are cases correctly identified as positive. True Negatives (TN) are cases correctly identified as negative. False Positives (FP) — sometimes called Type I errors — are negative cases incorrectly predicted as positive. False Negatives (FN) — Type II errors — are positive cases the model failed to detect. From these four numbers, you can derive precision, recall, specificity, negative predictive value, false positive rate, false negative rate, false discovery rate, Matthews Correlation Coefficient, and of course the F1 score itself. Precision answers the question: of everything the model flagged as positive, what fraction was actually positive? A high-precision model is conservative — it only predicts positive when it is very confident. Recall answers: of all the actual positives in the data, what fraction did the model catch? A high-recall model is aggressive — it would rather over-predict than miss cases. These two goals are often in tension. Increasing a model's decision threshold raises precision but lowers recall; lowering the threshold does the opposite. The F1 score is the harmonic mean of precision and recall: F1 = 2 × (P × R) / (P + R). The harmonic mean is used instead of the arithmetic mean because it punishes extreme imbalances — a model with 100% precision but 0% recall gets an F1 of exactly 0, just as it should, since it catches nothing. F1 ranges from 0 (worst) to 1 (perfect). The F-beta score generalises the F1 score by introducing a parameter beta (β) that controls how much weight to give recall relative to precision: F-beta = (1 + β²) × (P × R) / (β² × P + R). When β = 1, this reduces to F1. When β = 0.5, precision is weighted more — useful when false positives are costly, such as in spam filtering where blocking a legitimate email is worse than missing one spam. When β = 2, recall is weighted more — essential in medical screening where missing a true positive (failing to diagnose a sick patient) is far worse than a false alarm. For multi-class classification problems, this calculator also supports macro, micro, and weighted F1 averaging. Macro F1 computes per-class F1 scores then averages them equally, treating all classes the same regardless of size — ideal when class imbalance should not influence the overall metric. Micro F1 aggregates all true positives, false positives, and false negatives globally before computing — it weights by class frequency, making it equivalent to accuracy in many multiclass settings. Weighted F1 weights each class's F1 by its support (number of actual instances), striking a balance useful when class imbalance is real and should be acknowledged. This calculator provides a visual confusion matrix, a precision–recall comparison chart, and a progress ring showing where your F1 score falls on the 0–1 scale with colour-coded performance zones. Use the preset scenarios to explore how different model behaviour patterns (high precision vs. high recall, imbalanced datasets, near-random classifiers) affect all the metrics at once. Export your results to CSV for inclusion in reports or further analysis.

Understanding F1 Score and Classification Metrics

What Is the F1 Score?

The F1 score is the harmonic mean of precision and recall, computed as F1 = 2 × (Precision × Recall) / (Precision + Recall). It ranges from 0 (worst) to 1 (perfect). Unlike accuracy, which can be inflated by class imbalance, the F1 score focuses on the model's positive-class performance. It penalises models that sacrifice either precision or recall. An F1 score above 0.85 is generally considered excellent, 0.70–0.85 is good, 0.50–0.70 is fair, and below 0.50 suggests the model is near-random. The F1 score is especially popular in information retrieval, NLP, medical diagnostics, and fraud detection tasks.

如何计算？

From a binary confusion matrix with True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN): Precision = TP / (TP + FP); Recall = TP / (TP + FN); F1 = 2×TP / (2×TP + FP + FN). The F-beta generalisation is: F-beta = (1 + β²) × P × R / (β²×P + R). Matthews Correlation Coefficient (MCC) is computed as MCC = (TP×TN − FP×FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)), providing a balanced measure even on imbalanced datasets. For multi-class problems, per-class metrics are averaged using macro (equal weight), micro (pooled counts), or weighted (by class support) strategies.

为什么这很重要？

Choosing the right evaluation metric shapes how you tune and compare models. Accuracy is misleading on imbalanced data — a model predicting the majority class always scores high accuracy but is useless. The F1 score balances the cost of false positives and false negatives. In medical screening, missing a positive case (FN) may be catastrophic, so high recall is prioritised — use F2 (β=2). In spam filtering, a false positive (blocking legitimate mail) is more annoying than missing spam, so use F0.5. MCC is considered the most informative single metric for binary classification as it accounts for all four confusion matrix cells and is robust to class imbalance.

需要注意的限制

The F1 score ignores True Negatives entirely. When TN performance matters — for example, in imbalanced anomaly detection where the negative class is huge and important — MCC or the area under the ROC curve (AUC-ROC) may be more appropriate. F1 also assumes that precision and recall have equal importance when β=1; this is not always true. Macro F1 gives equal weight to all classes regardless of frequency, which can mislead in severely imbalanced multi-class scenarios. Weighted F1 may mask poor performance on rare but important minority classes. Always consider multiple metrics together rather than relying on any single score.

如何使用此计算器

选择您的模式

Select Binary for a standard two-class classification problem (e.g., spam vs. not-spam, fraud vs. legitimate) or Multi-class for models with 2–5 output categories. Within binary mode, choose Confusion Matrix to enter TP/FP/TN/FN, or Precision & Recall to enter the scores directly.

输入您的数值

For confusion matrix mode, type your True Positives, False Positives, True Negatives, and False Negatives from your model's evaluation output. These are available from scikit-learn's confusion_matrix() function, PyTorch metrics, or any evaluation framework. Use the preset buttons to load example scenarios and see how different model behaviours compare.

Set Beta for F-Beta Score

Choose β=1 for the standard F1 score (equal precision/recall weight), β=0.5 to prioritise precision (e.g., spam filters where false alarms are annoying), or β=2 to prioritise recall (e.g., medical screening where missing a case is dangerous). Use Custom to enter any positive beta value.

查看结果并导出

The progress ring shows your F1 score on a colour-coded 0–1 scale (green = excellent, red = poor). The metrics chart compares precision, recall, and accuracy visually. The confusion matrix grid shows the breakdown with percentages. Click Export CSV to download all metrics for reports or further analysis.

常见问题

What is a good F1 score?

An F1 score above 0.85 is generally considered excellent and indicates a strong model for most real-world applications. Scores between 0.70 and 0.85 are good and acceptable for many use cases. Scores between 0.50 and 0.70 are fair and suggest the model needs further tuning or more data. Scores below 0.50 indicate poor performance — roughly equivalent to or worse than random guessing for a balanced dataset. However, what counts as 'good' is domain-dependent. In medical diagnostics, even a 0.95 F1 may be insufficient if the cost of missed diagnoses is catastrophic. Always set thresholds based on the specific risk tolerance of your application.

What is the difference between precision and recall?

Precision measures how many of the model's positive predictions are actually correct: Precision = TP / (TP + FP). It answers 'When the model says positive, how often is it right?' Recall (also called sensitivity or true positive rate) measures how many of the actual positives the model successfully detected: Recall = TP / (TP + FN). It answers 'Of all real positives, how many did the model catch?' These metrics trade off: raising the decision threshold increases precision (fewer but more confident positives) while lowering recall. The F1 score balances both into a single metric using their harmonic mean.

When should I use F-beta instead of F1?

Use F-beta (F-β) when precision and recall are not equally important in your application. Set β < 1 (e.g., β=0.5) when false positives are more costly than false negatives — for example, in spam detection where blocking legitimate email is worse than missing spam. Set β > 1 (e.g., β=2) when false negatives are more costly — for example, in cancer screening where missing a diagnosis is far more dangerous than a false alarm. β=1 gives the standard F1 score for equal importance. The formula is: F-beta = (1 + β²) × P × R / (β²×P + R).

What is the Matthews Correlation Coefficient (MCC)?

The Matthews Correlation Coefficient (MCC) is a measure of classification quality that accounts for all four confusion matrix cells — TP, FP, TN, and FN. It ranges from -1 (perfect inverse prediction) through 0 (random) to +1 (perfect prediction). Unlike F1, MCC is not affected by class imbalance and rewards models that correctly classify both the positive and the negative class. Many researchers consider MCC the single most informative metric for binary classification. A model with a high F1 score but a low MCC likely performs well on positives but poorly on negatives. MCC = (TP×TN − FP×FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)).

What is the difference between macro, micro, and weighted F1?

For multi-class classification, per-class F1 scores must be averaged. Macro F1 averages all class F1 scores with equal weight, regardless of how many samples each class has — it treats all classes equally and highlights poor performance on rare classes. Micro F1 pools TP, FP, and FN across all classes before computing — it is dominated by the majority class and equals overall accuracy when each sample belongs to exactly one class. Weighted F1 computes per-class F1 then weights by class support (number of actual instances), balancing micro and macro. Use macro when minority classes matter equally; use weighted when class imbalance reflects reality and should be respected.

Why use F1 score instead of accuracy?

Accuracy measures the proportion of all predictions (both positive and negative) that are correct: (TP + TN) / total. This works well when classes are balanced, but fails on imbalanced datasets. Consider a disease screening test where 1% of patients are positive. A model that always predicts negative achieves 99% accuracy while catching zero cases — F1 = 0. The F1 score focuses on the positive class, making it much more meaningful when the cost of missed detections is high or when the negative class vastly outnumbers the positive class. Always use F1 (or MCC) alongside accuracy for any imbalanced classification task.