3. Word Vectors

Knowledge-based representation (e.g. WordNet)
- Might miss new meanings of words (e.g., wicked, badass, nifty, wizard, etc.)
- Impossible to keep up-to-date (requires human labor to create and adapt)
- Can’t compute accurate word similarity
One-hot representation
- Words are represented as one-hot-encodings (lots of zeros!)
- Limitations: High dimensional, sparsity leading to inefficiency in storage and computation, lack of contextual information, no quantification of similarity, cannot handle out-of-vocabulary or unseen words
Co-occurance matrix
Low-dimentional dense word vector (dimention reduction on the co-occurance matrix)
Similarity Metrics
Cosine similarity
How to represent documents?
TFIDF: Documents can be represented as vectors
- doc1 = [0 0 3 0 2 0 0 0 0 0 5 0 0 1 0]
- doc2 = [0 1 0 0 0 2 0 0 4 0 0 1 1 0 0]
  TF-IDF
  
  Term Frequency (TF) of word $w_i$ in document $d_i$:

$TF(w_i,d_i) = {\text{Count of } w_i \text{ in } d_i \over \text{Total number of words in } d_i}$

$IDF(w_i,D) = log({\text{Total number of documents in corpus D} \over \text{Number of documents containing word } w_i})$

$TF\text{-}IDF = TF \cdot IDF$

Note: Document Frequency (DF) can be adjusted using smoothing if its value is too small.

Example

$TF(cat,Doc 1) = 1/6 = 0.167$

$IDF(cat)=log(3/1)=1.099$

$TF\text{-}IDF(cat,Doc 1)=TF\cdot IDF=0.167×1.099=0.183$

There are two possible ways:

Example with window size 1:

Count	I	like	enjoy	deep	learning
I	0	2	1	0	0
like	2	0	0	1	0
enjoy	1	0	0	0	0
deep	0	1	0	0	1
learning	0	0	0	1	0

Problems:

Possible solution: Singular Value Decomposition

Word2vec (Word Embedding and Word2Vec)
- Two variants:
  - Skip-grams (SG): predict context (“outside”) words
  - Continuous Bag of Words (CBOW): predict center
GloVe