Thursday, May 7, 2026

RAKE and YAKE

RAKE (Rapid Automatic Keyword Extraction) and YAKE (Yet Another Keyword Extractor) are popular, unsupervised, and lightweight keyword extraction algorithms used in Natural Language Processing (NLP) to identify the most relevant words or phrases within a document
They are designed to analyze text, remove unnecessary words, and rank the remaining terms by importance without needing prior training, external corpora, or labeled data. 
RAKE (Rapid Automatic Keyword Extraction)
RAKE is designed for high efficiency, making it ideal for processing individual documents quickly. 
  • How it works:
    1. Stoplist Filtering: Removes stopwords (e.g., "the", "and") and punctuation to split the text into candidate phrases.
    2. Word Scoring: It computes a score for each word based on its frequency f(w) and degree d(w)—the number of words it co-occurs with in candidate phrases.
    3. Final Score: The score is calculated as degree/frequency.
    4. Ranking: RAKE ranks candidate phrases by summing the scores of individual words within them.
  • Best for: When speed is prioritized, such as analyzing large amounts of text quickly. 
YAKE (Yet Another Keyword Extractor)
YAKE is a more modern, flexible alternative that is independent of language, domain, and document size. 
  • How it works:
    1. Candidate Selection: Similar to RAKE, it identifies potential keywords.
    2. Statistical Features: It uses a more complex, multi-faceted approach than RAKE, looking at features like word frequency, word position, and how often a word appears in different contexts.
    3. Scoring: It assigns a score to candidates where lower scores represent better keywords.
    4. Best for: When higher accuracy is needed, as it often produces more precise results than RAKE. 
    5. Key Differences at a Glance
      Feature RAKEYAKE
      ApproachFrequency & Co-occurrenceStatistical & Contextual Features
      SpeedExtremely fastFast, but often slower than RAKE
      AccuracyGoodBetter/Higher
      IndependenceDomain independentLanguage & Domain independent
      Usage Example (Python)
      Both algorithms can be implemented in Python. RAKE is often used via rake-nltk, and YAKE has its own library, yake

      # YAKE Example
      import yake
      text = "Natural Language Processing is a branch of Artificial Intelligence."
      kw_extractor = yake.KeywordExtractor()
      keywords = kw_extractor.extract_keywords(text)
      print(keywords)

No comments:

Post a Comment

NumPy functions for dot product and cosine similarity

  To calculate dot products and cosine similarity in NumPy, you primarily use np.dot() and np.linalg.norm() . While NumPy has a direct func...