How to represent a word?
- Knowledge-based representation (e.g. WordNet)
- Might miss nuance (e.g. âproficientâ is listed as a synonym for âgoodâ. This is only correct in some contexts)
- Might miss new meanings of words (e.g., wicked, badass, nifty, wizard, etc.)
- Impossible to keep up-to-date (requires human labor to create and adapt)
- Canât compute accurate word similarity
- Example: WordNet, a dictionary of words with synonyms and hypernyms (âis aâ relationship)z
- One-hot representation
- Words are represented as one-hot-encodings (lots of zeros!)
- Limitations: High dimensional, sparsity leading to inefficiency in storage and computation, lack of contextual information, no quantification of similarity, cannot handle out-of-vocabulary or unseen words
- Co-occurance matrix
- Low-dimentional dense word vector (dimention reduction on the co-occurance matrix)
Similarity Metrics
- Cosine similarity
where the numerator is the dot product.
How to represent documents?
- TFIDF: Documents can be represented as vectors
- doc1 = [0 0 3 0 2 0 0 0 0 0 5 0 0 1 0]
- doc2 = [0 1 0 0 0 2 0 0 4 0 0 1 1 0 0]
TF-IDF
Term Frequency (TF) of word in document :
Note: Document Frequency (DF) can be adjusted using smoothing if its value is too small.
Example
- Doc 1: âthe cat sat on the matâ
- Doc 2: âthe dog sat on the logâ
- Doc 3: âcats and dogs are animalsâ
Where is TF-IDF used?
- Search Engine Ranking: Early search engines used TF-IDF to find the most relevant page for a query. If you search âBlueberry Muffin Recipe,â the engine looks for pages where those three words have the highest TF-IDF scores, ensuring you get recipes rather than just a page that happens to mention âtheâ or âandâ a lot.
- Keyword Extraction: If you upload a long document to a tool that generates âTags,â it is likely using TF-IDF. It calculates the scores for every word and picks the top 5 or 10. Words with high TF (frequent in your doc) and high IDF (rare in the general world) make the best tags.
- Document Clustering: News apps like Google News group similar stories together using TF-IDF. By comparing the vectors of hundreds of articles, the system can see that 50 different articles all have high TF-IDF scores for âEarthquakeâ and âTokyo,â so it groups them into a single âBreaking Newsâ cluster.
- Recommendation Engines: If you read an article about âQuantum Physics,â a site can recommend other articles by finding the documents whose TF-IDF vectors have the highest Cosine Similarity to the one you just read.
- Spam Filtering: Old-school spam filters look for words that are rare in normal emails but common in junk (high TF-IDF words like âLottery,â âInheritance,â or âWinnerâ). If an incoming emailâs vector aligns too closely with the âSpam Vector,â it gets filtered out.
Brain teaser: What if you get a DF score of 1? What does this mean?
Representing with a Co-occurrence Matrix
There are two possible ways:
Window (use window around each word)
Example with window size 1:
| Count | I | like | enjoy | deep | learning |
|---|---|---|---|---|---|
| I | 0 | 2 | 1 | 0 | 0 |
| like | 2 | 0 | 0 | 1 | 0 |
| enjoy | 1 | 0 | 0 | 0 | 0 |
| deep | 0 | 1 | 0 | 0 | 1 |
| learning | 0 | 0 | 0 | 1 | 0 |
Problems:
- Sparsity issues
- High dimensional, size increases with vocabulary (requires more storage)
Possible solution: Singular Value Decomposition
Word-document co-occurrence matrix
You create a matrix where rows are words and columns are documents. You mark a â1â or a count every time a word appears in a specific file. It identifies general topics. For example, âtouchdown,â âquarterback,â and âstadiumâ will all have high counts in the same sports articles, even if they arenât right next to each other in a sentence. This leads to Latent Semantic Analysis (LSA), which helps a computer realize that two documents are about âSportsâ even if they use slightly different vocabulary.
Word Vector (Embeddings)
- Word2vec (Word Embedding and Word2Vec)
- Two variants:
- Skip-grams (SG): predict context (âoutsideâ) words
- Continuous Bag of Words (CBOW): predict center
- Two variants:
- GloVe