Monday, May 11, 2026

Model Evaluation Metrics


Category Metric Formula / Idea What It Measures Best Used When Weakness
Classification Accuracy Accuracy = (TP + TN) / (TP + TN + FP + FN) Overall correctness Classes are balanced Misleading on imbalanced datasets
Classification Precision Precision = TP / (TP + FP) How correct positive predictions are False positives are costly May miss many real positives
Classification Recall Recall = TP / (TP + FN) How many actual positives are caught Missing positives is dangerous Can increase false alarms
Classification F1 Score F1 = 2 × (Precision × Recall) / (Precision + Recall) Balance between precision & recall Both FP and FN matter Harder to interpret intuitively
Classification ROC-AUC (Receiver Operating Characteristic - Area Under Curve) Area under ROC curve Class separation capability Comparing probabilistic classifiers Can look good on imbalanced data
Classification Log Loss / Cross Entropy Penalizes confident wrong predictions Probability quality Neural networks, probabilistic outputs Less interpretable
Classification MCC (Mathews Correlation Coefficient)  Correlation between predictions & truth Balanced evaluation Imbalanced datasets More mathematically complex
Regression MAE MAE = (1/n) Σ |yi - ŷi| Average absolute error Want interpretable error Treats all errors equally
Regression MSE (Mean Square Error) MSE = (1/n) Σ (yi - ŷi)2 Squared prediction error Large errors must be punished Sensitive to outliers
Regression RMSE (Root Mean Square Error) RMSE = √[(1/n) Σ (yi - ŷi)2] Root of squared error Need same unit as target Still sensitive to outliers
Ranking / Retrieval Precision@K Relevant items in top K results Retrieval accuracy in top results Search, RAG, recommenders Ignores missed relevant items
Ranking / Retrieval Recall@K Relevant items retrieved in top K Retrieval coverage RAG retrieval Can retrieve irrelevant items
Ranking / Retrieval MRR (Mean Reciprocal Rank) Reciprocal rank of first correct result How early first correct answer appears QA systems, search Ignores later results
Ranking / Retrieval NDCG (Normalized Discounted Cumulative Gain) Ranking quality with graded relevance Overall ranking usefulness Search/recommendation More complex
Ranking / Retrieval MAP (Mean Average Precision) Mean average precision across queries Retrieval quality across dataset Information retrieval Computationally heavier

Precision vs Recall vs F1

Metric Core Meaning Focus Minimize
Precision “Be correct when predicting positive” Prediction purity False Positives
Recall “Catch all real positives” Coverage False Negatives
F1 Score “Balance both precision & recall” Overall balance Both FP & FN

When to Optimize What

Situation Optimize Why
Spam filtering Precision Avoid important emails marked as spam
Facial recognition unlock Precision Avoid unauthorized access
Search engines Precision Show relevant results
RAG reranking Precision Filter irrelevant retrieved documents
Cancer detection Recall Missing disease is dangerous
Fraud detection Recall Missing fraud causes loss
Intrusion detection Recall Missing attacks is risky
RAG retrieval Recall first Retrieve enough relevant documents
Chatbot intent classification F1 Both wrong predictions and missed intents matter
Imbalanced datasets F1 / MCC Accuracy becomes misleading
Balanced datasets Accuracy Simple overall correctness works

Threshold Tuning Effects

Threshold Strategy Precision Recall Behavior
Higher threshold Increases Decreases Stricter predictions
Lower threshold Decreases Increases More permissive predictions

Common Real-World ML Pipeline Strategy


Stage Typical Goal Main Metric Focus
Initial Retrieval Retrieve broadly Recall
Reranking Remove irrelevant items Precision
Final Generation Correct grounded answers Faithfulness / Accuracy

Finally


Metric Easy Way to Remember
Accuracy “Overall correctness”
Precision “Be accurate”
Recall “Catch everything”
F1 “Balance both”
MAE “Average error”
MSE “Punish large mistakes”
RMSE “Error in original units”
ROC-AUC “How well classes separate”

No comments:

Post a Comment

TF-IDF

TF-IDF , which stands for Term Frequency-Inverse Document Frequency , is a numerical statistic used in information retrieval and text mining...