How to represent a word?
- Knowledge-based representation (e.g. WordNet)
- Might miss new meanings of words (e.g., wicked, badass, nifty, wizard, etc.)
- Impossible to keep up-to-date (requires human labor to create and adapt)
- Can’t compute accurate word similarity
- One-hot representation
- Words are represented as one-hot-encodings (lots of zeros!)
- Limitations: High dimensional, sparsity leading to inefficiency in storage and computation, lack of contextual information, no quantification of similarity, cannot handle out-of-vocabulary or unseen words
- Co-occurance matrix
- Low-dimentional dense word vector (dimention reduction on the co-occurance matrix)
Similarity Metrics
- Cosine similarity
How to represent documents?
- TFIDF: Documents can be represented as vectors
- doc1 = [0 0 3 0 2 0 0 0 0 0 5 0 0 1 0]
- doc2 = [0 1 0 0 0 2 0 0 4 0 0 1 1 0 0]
TF-IDF
Term Frequency (TF) of word $w_i$ in document $d_i$:
Note: Document Frequency (DF) can be adjusted using smoothing if its value is too small.
Example
- Doc 1: “the cat sat on the mat”
- Doc 2: “the dog sat on the log”
- Doc 3: “cats and dogs are animals”
Representing with a Co-occurrence Matrix
There are two possible ways:
Window (use window around each word)
Example with window size 1:
| Count | I | like | enjoy | deep | learning |
|---|---|---|---|---|---|
| I | 0 | 2 | 1 | 0 | 0 |
| like | 2 | 0 | 0 | 1 | 0 |
| enjoy | 1 | 0 | 0 | 0 | 0 |
| deep | 0 | 1 | 0 | 0 | 1 |
| learning | 0 | 0 | 0 | 1 | 0 |
Problems:
- Sparsity issues
- High dimensional, size increases with vocabulary (requires more storage)
Possible solution: Singular Value Decomposition
Word-document co-occurrence matrix
- Word2vec (Word Embedding and Word2Vec)
- Two variants:
- Skip-grams (SG): predict context (“outside”) words
- Continuous Bag of Words (CBOW): predict center
- Two variants:
- GloVe