Keyword Extraction & Topic Modelling

Keyword Extraction
1. TF
2. TF-IDF
3. RAKE (rake-keyword, python-rake)
4. YAKE (yake)
5. KeyBERT (KeyBERT)
6. Spacy
7. Spark NLP
8. TextRank
Topic Modeling - Topic modeling requires a large collection of text documents to work, not very suitable for extracting topics from single sentences, paragraphs or tweets etc.
1. Latent Dirichlet Allocation (LDA)
2. BERTopic
3. Non-Negative Matrix Factorization (NMF)

Zero-shot classification can also be an option, however it requires a list of predefined keywords or tags.

Alternative approach: Using LLMs (could be local, LLMs could be better understanding context compared to the libraries above)

List of approaches:

Method	Supervised/Unsupervised	Description	Notes
Keyword Extraction	Unsupervised	Extracts important words/phrases directly from text	Algorithms: RAKE, YAKE, KeyBERT, TF-IDF
Topic Modeling	Unsupervised	Discovers hidden themes across document collections	Algorithms: LDA, NMF, BERTopic
Zero-Shot Classification	Unsupervised	Classifies text into predefined categories without training	Requires candidate labels but no training data
Generative Tagging	Unsupervised	Uses LLMs to generate tags from scratch	GPT, Claude, local LLMs; no predefined labels needed
Semantic Similarity Matching	Unsupervised	Matches text to tags using embedding similarity	Uses sentence transformers; requires tag vocabulary
~~Rule-Based Tagging~~	~~Unsupervised~~	~~Uses predefined patterns and rules~~	~~If text contains “python” or “javascript” → tag as “programming”~~
Clustering	Unsupervised	Groups similar documents, then manually labels clusters	K-means, DBSCAN, Hierarchical clustering
Graph-Based Methods	Unsupervised	Uses text as graph (words as nodes, co-occurrence as edges)	TextRank, LexRank for keyword extraction
Taxonomy Learning	Unsupervised	Automatically builds hierarchical tag structures	Discovers parent-child relationships between tags
Co-occurrence Analysis	Unsupervised	Finds words/phrases that frequently appear together	Useful for discovering tag relationships
Prompt Engineering	Unsupervised	Carefully crafted prompts to guide LLM tagging	No training but requires prompt iteration
Lexicon-Based Methods	Unsupervised	Uses predefined dictionaries/wordlists for categories	If text contains “goal”, “score”, “team” → sports
Named Entity Recognition (NER)	Supervised/Unsupervised	Identifies specific entities (people, places, organizations)	Can use pre-trained models (unsupervised) or custom-trained
Supervised Text Classification	Supervised	Traditional ML trained on labeled examples	Requires 1000+ labeled examples; methods: Naive Bayes, SVM, Logistic Regression
Fine-tuned Transformers	Supervised	Neural models (BERT, RoBERTa) trained on your specific data	Best accuracy but requires significant labeled data (5000+)
Multi-label Classification	Supervised	Assigns multiple tags per document simultaneously	Binary Relevance, Classifier Chains, Label Powerset
Transfer Learning	Supervised	Uses pre-trained model, fine-tunes on small dataset	Requires only 100-500 labeled examples
Attention-Based Tagging	Supervised	Neural network learns which parts of text matter for each tag	Modern approach; part of transformer models
Few-Shot Learning	Semi-supervised	Learns from just a few examples per category	Needs 5-50 examples per tag; uses models like SetFit
Active Learning	Semi-supervised	Iteratively asks human to label most informative examples	Reduces labeling effort by 50-80%
Weak Supervision	Semi-supervised	Uses noisy/programmatic labels instead of manual labels	Tools: Snorkel; combines heuristics and patterns
Contrastive Learning	Semi-supervised	Learns by comparing similar vs dissimilar examples	Good when you have unlabeled data + some labels
Ensemble Methods	Either	Combines multiple tagging approaches	Vote/average results from different methods for better accuracy

Harbor 🪼

Explorer

Keyword Extraction & Topic Modelling

Backlinks