Monday, May 11, 2026

TF-IDF

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic used in information retrieval and text mining to reflect how important a word is to a document in a collection or corpus.

It is essentially a way to convert text into numbers so that a machine can understand which words carry the most "meaning" in a specific context.

Its roots can be traced back to 1972, when Karen Sparck Jones conceived the idea of IDF.


The Core Components

To understand TF-IDF, you have to break it down into its two constituent parts:

1. Term Frequency (TF)

This measures how frequently a term occurs in a document. The logic is simple: the more a word appears in a document, the more important it likely is for that specific text.

Formula:

TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

2. Inverse Document Frequency (IDF)

This measures how important a term is across the entire corpus. While TF rewards common words, IDF penalizes words that appear too frequently across all documents.

Formula:

IDF(t, D) = log( Total number of documents N / Number of documents containing term t )


How It Works Together

The final TF-IDF score is calculated by multiplying the two values:

TF-IDF = TF × IDF

  • High TF-IDF: Occurs when a word has a high frequency in one document but appears rarely in other documents.
  • Low TF-IDF: Occurs when a word is very common across all documents.

Why Do We Use It?

  • Filtering out noise: Common filler words become less important.
  • Highlighting distinctiveness: Unique technical words get higher importance.
  • Search Engine Ranking: Helps identify which documents best match a query.

A Practical Example

Imagine you have a collection of 1,000 documents about animals.

  • The word "the" appears in all 1,000 documents. Its IDF becomes very low.
  • The word "Giraffe" appears many times in one document but only a few times elsewhere. Its TF-IDF becomes high.

This tells the computer that the document is specifically about giraffes.

No comments:

Post a Comment

Where do RAG systems fail ?

  Case Example paraphrased queries “Why do pods keep dying?” ambiguous intent “network issue” multi-hop reasoning “Why can’t traffic reach m...