Word2vec from scratch in Python

aquib (45)in #nlp • 6 years ago

Word2Vec is a method used for creating word embeddings. Word embedding is a feature engineering technique used in NLP where words or phrases from a vocabulary mapped to a vector of real numbers. Word2Vec have two different methods, continuous Bag of Words (CBOW) and Skip-gram method. We can use any one of them and create our word embeddings.
CBOW: predicting the word given its context/surrounding words.
Skip-Gram: Predicting the context from the given word.

the figure https://arxiv.org/pdf/1301.3781.pdf
Skip Gram model: In skip-gram, the current word is taken as Input and predicts words within a certain range before and after the current word.

Example: Ashraf Marwan is remembered most famously for spying for the Egyptian intelligence agency.
After removing the stop words and all in lowercase:

ashraf marwan remembered famously spying egyptian intelligence agency

If we take the certain range 2, also called WINDOW_SIZE
Screen Shot 2018-10-16 at 6.01.50 PM.png

There are three main steps for training word2vec

clean the text documents, pre-processing.
converting the input and target into one hot vector using vocabulary.
creating a shallow Neural network and train it.

All the above steps and code are explained here: https://aquibjk.wordpress.com/2018/10/03/word2vec-analysis-and-implementation/

#deep #learning #word2vec

6 years ago in #nlp by aquib (45)

$0.00

Sort:

steemitboard (66) 6 years ago

Congratulations @aquib! You have received a personal award!

1 Year on Steemit
_{Click on the badge to view your Board of Honor.}

Do not miss the last post from @steemitboard:

SteemitBoard Ranking update - Resteem and Resteemed added

Support SteemitBoard's project! Vote for its witness and get one more award!

$0.00

steemitboard (66) 5 years ago

Congratulations @aquib! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

_{You can view your badges on your Steem Board and compare to others on the Steem Ranking}

Do not miss the last post from @steemitboard:

SteemFest⁴ commemorative badge refactored

Vote for @Steemitboard as a witness to get one more award and increased upvotes!

$0.00

STEEM 0.16

TRX 0.23

JST 0.033

BTC 95972.50

ETH 2611.77

USDT 1.00

SBD 2.55