3. Word Vectors

How to represent a word?

Knowledge-based representation (e.g. WordNet)
- Might miss nuance (e.g. “proficient” is listed as a synonym for “good”. This is only correct in some contexts)
- Might miss new meanings of words (e.g., wicked, badass, nifty, wizard, etc.)
- Impossible to keep up-to-date (requires human labor to create and adapt)
- Can’t compute accurate word similarity
- Example: WordNet, a dictionary of words with synonyms and hypernyms (“is a” relationship)z
One-hot representation
- Words are represented as one-hot-encodings (lots of zeros!)
- Limitations: High dimensional, sparsity leading to inefficiency in storage and computation, lack of contextual information, no quantification of similarity, cannot handle out-of-vocabulary or unseen words
Co-occurance matrix
Low-dimentional dense word vector (dimention reduction on the co-occurance matrix)

Similarity Metrics

Cosine similarity

$C os in e S imi l a r i t y = \frac{A . B}{∣∣ A ∣∣.∣∣ B ∣∣}$

where the numerator is the dot product.

How to represent documents?

TFIDF: Documents can be represented as vectors
- doc1 = [0 0 3 0 2 0 0 0 0 0 5 0 0 1 0]
- doc2 = [0 1 0 0 0 2 0 0 4 0 0 1 1 0 0]

TF-IDF

Term Frequency (TF) of word $w_{i}$ in document $d_{i}$ :

$TF (w_{i}, d_{i}) = \frac{Count of w _{i} in d _{i}}{Total number of words in d _{i}}$

$I D F (w_{i}, D) = l o g (\frac{Total number of documents in corpus D}{Number of documents containing word w _{i}})$

$TF - I D F = TF \cdot I D F$

Note: Document Frequency (DF) can be adjusted using smoothing if its value is too small.

Example

Doc 1: “the cat sat on the mat”
Doc 2: “the dog sat on the log”
Doc 3: “cats and dogs are animals”

$TF (c a t, Doc 1) = 1/6 = 0.167$

$I D F (c a t) = l o g (3/1) = 1.099$

$TF - I D F (c a t, Doc 1) = TF \cdot I D F = 0.167 \times 1.099 = 0.183$

Where is TF-IDF used?

Search Engine Ranking: Early search engines used TF-IDF to find the most relevant page for a query. If you search “Blueberry Muffin Recipe,” the engine looks for pages where those three words have the highest TF-IDF scores, ensuring you get recipes rather than just a page that happens to mention “the” or “and” a lot.
Keyword Extraction: If you upload a long document to a tool that generates “Tags,” it is likely using TF-IDF. It calculates the scores for every word and picks the top 5 or 10. Words with high TF (frequent in your doc) and high IDF (rare in the general world) make the best tags.
Document Clustering: News apps like Google News group similar stories together using TF-IDF. By comparing the vectors of hundreds of articles, the system can see that 50 different articles all have high TF-IDF scores for “Earthquake” and “Tokyo,” so it groups them into a single “Breaking News” cluster.
Recommendation Engines: If you read an article about “Quantum Physics,” a site can recommend other articles by finding the documents whose TF-IDF vectors have the highest Cosine Similarity to the one you just read.
Spam Filtering: Old-school spam filters look for words that are rare in normal emails but common in junk (high TF-IDF words like “Lottery,” “Inheritance,” or “Winner”). If an incoming email’s vector aligns too closely with the “Spam Vector,” it gets filtered out.

Brain teaser: What if you get a DF score of 1? What does this mean?

Representing with a Co-occurrence Matrix

There are two possible ways:

Window (use window around each word)

Example with window size 1:

Count	I	like	enjoy	deep	learning
I	0	2	1	0	0
like	2	0	0	1	0
enjoy	1	0	0	0	0
deep	0	1	0	0	1
learning	0	0	0	1	0

Problems:

Sparsity issues
High dimensional, size increases with vocabulary (requires more storage)

Possible solution: Singular Value Decomposition

Word-document co-occurrence matrix

You create a matrix where rows are words and columns are documents. You mark a “1” or a count every time a word appears in a specific file. It identifies general topics. For example, “touchdown,” “quarterback,” and “stadium” will all have high counts in the same sports articles, even if they aren’t right next to each other in a sentence. This leads to Latent Semantic Analysis (LSA), which helps a computer realize that two documents are about “Sports” even if they use slightly different vocabulary.

Word Vector (Embeddings)

Word2vec (Word Embedding and Word2Vec)
- Two variants:
  - Skip-grams (SG): predict context (“outside”) words
  - Continuous Bag of Words (CBOW): predict center

GloVe

Harbor 🪼

Explorer