1. Keyword Extraction
    1. TF
    2. TF-IDF
    3. RAKE (rake-keyword, python-rake)
    4. YAKE (yake)
    5. KeyBERT (KeyBERT)
    6. Spacy
    7. Spark NLP
    8. TextRank
  2. Topic Modeling - Topic modeling requires a large collection of text documents to work, not very suitable for extracting topics from single sentences, paragraphs or tweets etc.
    1. Latent Dirichlet Allocation (LDA)
    2. BERTopic
    3. Non-Negative Matrix Factorization (NMF)

Zero-shot classification can also be an option, however it requires a list of predefined keywords or tags.

Alternative approach: Using LLMs (could be local, LLMs could be better understanding context compared to the libraries above)

List of approaches:

MethodSupervised/UnsupervisedDescriptionNotes
Keyword ExtractionUnsupervisedExtracts important words/phrases directly from textAlgorithms: RAKE, YAKE, KeyBERT, TF-IDF
Topic ModelingUnsupervisedDiscovers hidden themes across document collectionsAlgorithms: LDA, NMF, BERTopic
Zero-Shot ClassificationUnsupervisedClassifies text into predefined categories without trainingRequires candidate labels but no training data
Generative TaggingUnsupervisedUses LLMs to generate tags from scratchGPT, Claude, local LLMs; no predefined labels needed
Semantic Similarity MatchingUnsupervisedMatches text to tags using embedding similarityUses sentence transformers; requires tag vocabulary
Rule-Based TaggingUnsupervisedUses predefined patterns and rulesIf text contains “python” or “javascript” → tag as “programming”
ClusteringUnsupervisedGroups similar documents, then manually labels clustersK-means, DBSCAN, Hierarchical clustering
Graph-Based MethodsUnsupervisedUses text as graph (words as nodes, co-occurrence as edges)TextRank, LexRank for keyword extraction
Taxonomy LearningUnsupervisedAutomatically builds hierarchical tag structuresDiscovers parent-child relationships between tags
Co-occurrence AnalysisUnsupervisedFinds words/phrases that frequently appear togetherUseful for discovering tag relationships
Prompt EngineeringUnsupervisedCarefully crafted prompts to guide LLM taggingNo training but requires prompt iteration
Lexicon-Based MethodsUnsupervisedUses predefined dictionaries/wordlists for categoriesIf text contains “goal”, “score”, “team” → sports
Named Entity Recognition (NER)Supervised/UnsupervisedIdentifies specific entities (people, places, organizations)Can use pre-trained models (unsupervised) or custom-trained
Supervised Text ClassificationSupervisedTraditional ML trained on labeled examplesRequires 1000+ labeled examples; methods: Naive Bayes, SVM, Logistic Regression
Fine-tuned TransformersSupervisedNeural models (BERT, RoBERTa) trained on your specific dataBest accuracy but requires significant labeled data (5000+)
Multi-label ClassificationSupervisedAssigns multiple tags per document simultaneouslyBinary Relevance, Classifier Chains, Label Powerset
Transfer LearningSupervisedUses pre-trained model, fine-tunes on small datasetRequires only 100-500 labeled examples
Attention-Based TaggingSupervisedNeural network learns which parts of text matter for each tagModern approach; part of transformer models
Few-Shot LearningSemi-supervisedLearns from just a few examples per categoryNeeds 5-50 examples per tag; uses models like SetFit
Active LearningSemi-supervisedIteratively asks human to label most informative examplesReduces labeling effort by 50-80%
Weak SupervisionSemi-supervisedUses noisy/programmatic labels instead of manual labelsTools: Snorkel; combines heuristics and patterns
Contrastive LearningSemi-supervisedLearns by comparing similar vs dissimilar examplesGood when you have unlabeled data + some labels
Ensemble MethodsEitherCombines multiple tagging approachesVote/average results from different methods for better accuracy