How to make a computer to understand human language..hmm..???
You possibly guess it right – TEXT processing. How do we make computers to perform clustering, classification etc. on a text data since we know that they are generally inefficient at handling and processing strings or texts for any fruitful outputs?
Sure, a computer can match two strings and tell you whether they are same or not. But how do we make computers tell you about football or Ronaldo when you search for Messi? How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?
The answer to the above question lie in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.
All of these are implemented by using Word Embeddings or numerical representation of texts, so that computers may handle them. Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text.
Ok, But - Why we need Word Embedding?
Machine learning and Deep learning are incapable of processing strings as raw input. They requires numbers as their inputs to do any job Classification or Regression.
Hmm.. Are there any types of Embedding?
- Frequency Based Embedding: They are generally three types of Vectors
i. Count Vector
ii. TF-IDF Vector
iii. Co-occurrence Vector - Prediction based Embedding:
i. CBOW (Continuous bag of words)
ii. Skip Gram model
Count Vector: Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens will form our dictionary and the size of the Count Vector matrix M will be given by
DXN. Each row in the matrix M contains the frequency of tokens in document D(i)
Example:
D1: He is lazy guy. She is lazy too
D2: Nikita is lazy person
Dictionary of unique tokens: [‘He’,’lazy’,’guy’,’She’,Nikita,’person’]
Here, D=2, N=6
He lazy guy She Nikita Person
D1 1 2 1 1 0 0
D2 0 1 0 0 1 1
Now, a column can also be understood as word vector for the corresponding word in the matrix M.
For example, the word vector for ‘lazy’ in the above matrix is [2,1] and so on. Here, the rows correspond to the documents in the corpus and the columns correspond to the tokens in the dictionary. The second row in the above matrix may be read as – D2 contains ‘lazy’: once, ‘Nikita’: once and ‘person’ once.
TF-IDF Vector: This is another method which is based on the frequency method but it is different to the count vectorization in the sense that it takes into account not just the occurrence of a word in a single document but in the entire corpus.
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurrences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.
We would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.
TF-IDF works by penalizing these common words by assigning them lower weights while giving importance to words that appear in a subset of documents.
Co-occurrence Vector: Similar words tend to occur together and will have similar context.
For example – Apple is a fruit. Mango is a fruit.
Apple and mango tend to have a similar context i.e. fruit
Co-occurrence Means – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.
Context Window – Context window is specified by a number and the direction.
Let’s say there are V unique words in the corpus. So Vocabulary size = V. The columns of the Co-occurrence matrix form the context words. Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation.
Advantages of Co-occurrence Matrix:
- It preserves the semantic relationship between words. i.e. man and woman tend to be closer than man and apple.
- It uses SVD at its core, which produces more accurate word vector representations than existing methods.
- It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.
Disadvantages of Co-Occurrence Matrix:
It requires huge memory to store the co-occurrence matrix. But, this problem can be circumvented by factorizing the matrix out of the system for example in Hadoop clusters etc. and can be saved.
https://github.com/Niki1ta96/Data-Science/blob/master/Python/Word%20Embedding%20basics.ipynb