TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic used in information retrieval and text mining to reflect how important a word is to a document in a collection or corpus.
It is essentially a way to convert text into numbers so that a machine can understand which words carry the most "meaning" in a specific context.
Its roots can be traced back to 1972, when Karen Sparck Jones conceived the idea of IDF.
The Core Components
To understand TF-IDF, you have to break it down into its two constituent parts:
1. Term Frequency (TF)
This measures how frequently a term occurs in a document. The logic is simple: the more a word appears in a document, the more important it likely is for that specific text.
Formula:
TF(t, d) =
(Number of times term t appears in document d)
/
(Total number of terms in document d)
2. Inverse Document Frequency (IDF)
This measures how important a term is across the entire corpus. While TF rewards common words, IDF penalizes words that appear too frequently across all documents.
Formula:
IDF(t, D) =
log(
Total number of documents N
/
Number of documents containing term t
)
How It Works Together
The final TF-IDF score is calculated by multiplying the two values:
TF-IDF = TF × IDF
- High TF-IDF: Occurs when a word has a high frequency in one document but appears rarely in other documents.
- Low TF-IDF: Occurs when a word is very common across all documents.
Why Do We Use It?
- Filtering out noise: Common filler words become less important.
- Highlighting distinctiveness: Unique technical words get higher importance.
- Search Engine Ranking: Helps identify which documents best match a query.
A Practical Example
Imagine you have a collection of 1,000 documents about animals.
- The word "the" appears in all 1,000 documents. Its IDF becomes very low.
- The word "Giraffe" appears many times in one document but only a few times elsewhere. Its TF-IDF becomes high.
This tells the computer that the document is specifically about giraffes.
No comments:
Post a Comment