How to represent a word?

  • Knowledge-based representation (e.g. WordNet)
    • Might miss new meanings of words (e.g., wicked, badass, nifty, wizard, etc.)
    • Impossible to keep up-to-date (requires human labor to create and adapt)
    • Can’t compute accurate word similarity
  • One-hot representation
    • Words are represented as one-hot-encodings (lots of zeros!)
    • Limitations: High dimensional, sparsity leading to inefficiency in storage and computation, lack of contextual information, no quantification of similarity, cannot handle out-of-vocabulary or unseen words
  • Co-occurance matrix
  • Low-dimentional dense word vector (dimention reduction on the co-occurance matrix)

    Similarity Metrics

  • Cosine similarity

    How to represent documents?

  • TFIDF: Documents can be represented as vectors
    • doc1 = [0 0 3 0 2 0 0 0 0 0 5 0 0 1 0]
    • doc2 = [0 1 0 0 0 2 0 0 4 0 0 1 1 0 0]

      TF-IDF

      Term Frequency (TF) of word $w_i$ in document $d_i$:

TF(wi,di)=Count of wi in diTotal number of words in diTF(w_i,d_i) = {\text{Count of } w_i \text{ in } d_i \over \text{Total number of words in } d_i}

IDF(wi,D)=log(Total number of documents in corpus DNumber of documents containing word wi)IDF(w_i,D) = log({\text{Total number of documents in corpus D} \over \text{Number of documents containing word } w_i})

TF-IDF=TFIDFTF\text{-}IDF = TF \cdot IDF

Note: Document Frequency (DF) can be adjusted using smoothing if its value is too small.

Example

  • Doc 1: “the cat sat on the mat”
  • Doc 2: “the dog sat on the log”
  • Doc 3: “cats and dogs are animals”

TF(cat,Doc 1)=1/6=0.167TF(cat,Doc 1) = 1/6 = 0.167

IDF(cat)=log(3/1)=1.099IDF(cat)=log(3/1​)=1.099

TF-IDF(cat,Doc 1)=TFIDF=0.167×1.099=0.183TF\text{-}IDF(cat,Doc 1)=TF\cdot IDF=0.167×1.099=0.183

Representing with a Co-occurrence Matrix

There are two possible ways:

Window (use window around each word)

Example with window size 1:

Count I like enjoy deep learning
I 0 2 1 0 0
like 2 0 0 1 0
enjoy 1 0 0 0 0
deep 0 1 0 0 1
learning 0 0 0 1 0

Problems:

  • Sparsity issues
  • High dimensional, size increases with vocabulary (requires more storage)

Possible solution: Singular Value Decomposition

Word-document co-occurrence matrix
  • Word2vec (Word Embedding and Word2Vec)
    • Two variants:
      • Skip-grams (SG): predict context (“outside”) words
      • Continuous Bag of Words (CBOW): predict center
  • GloVe