Setup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
from nltk.text import Text
from nltk import FreqDist
from nltk.corpus import wordnet as wn
import nltk
import re
from nltk.corpus import inaugural
from collections import Counter
from nltk import CFG
from nltk.tree import Tree
Use the text of the Universal Declaration of Human Rights (UDHR). Create a table for 5 languages in which you will collect statistics about the languages used. Place in that table the number of words in each language in UDHR, number of unique words, average length of words, number of sentences contained in UDHR and average number of words per sentence. Create a distribution of sentence lengths for each language. Plot those (non-cumulative) distributions on one diagram.
Import the UDHR text and show some of the available languages.
from nltk.corpus import udhr
udhr.fileids()[100:110]
I'm selecting 5 languages, those are English, German, French, Czech and Spanish.
languages = ['English-Latin1', 'German_Deutsch-Latin1', 'French_Francais-Latin1',
'Czech-Latin2', 'Spanish-Latin1']
In order to calculate the statistics we have to loop trough the languages and extract the different values.
# Define NP object
np_udhr=[]
# Loop trough languages
for i in languages:
# Get statistics
n_char = len(udhr.raw(i))
n_words = len(udhr.words(i))
n_sents = len(udhr.sents(i))
n_unique = len(set(udhr.words(i)))
mean_word_length = round(n_char/n_words)
mean_n_words_sents = round(n_words/n_sents)
# Define array
np_stat = [i, n_words, n_unique, n_sents,
mean_word_length, mean_n_words_sents]
np_udhr.append(np_stat)
Next we can output the calculated values
df_stat = pd.DataFrame(np_udhr, columns=['Language', 'Word Count',
'Word Count Unique',
'Number of Sentences',
'Word Length (Mean)',
'Words Per Sentence (Mean)'])
df_stat
Next we're creating a conditional frequency distribution of the different sentence lengths for each individual language from above.
cfd = nltk.ConditionalFreqDist(
(lang, len(sent))
for lang in languages
for sent in udhr.sents(lang))
We can visualize the data either trough a sample table or a cfd plot.
cfd.tabulate(conditions=languages, samples=range(10), cumulative=False)
Next we're plotting the data.
cfd.plot(cumulative=False)
Identify 10 most frequently used words longer than 7 characters in the entire corpus of Inaugural Addresses. Do not identify 10 words for every speech but rather 10 words for the entire corpus. Which among those words has the largest number of synonyms? List all synonyms for those 10 words. Which one of those 10 words has the largest number of hyponyms? List all hyponyms of those 10 most frequently used “long” words. The purpose of this problem is to familiarize you with WordNet and concepts of synonyms and hyponyms.
First we're loading all the words from the inaugural corpus.
np_words = nltk.corpus.inaugural.words(nltk.corpus.inaugural.fileids())
Next we can limit the length of the words to 7 characters and identify the 10 most frequently used words. With that we can seperate the words and values.
# Filter
filt_char = ([w.lower() for w in np_words if len(w) > 7])
# Frequency distribution
f_dist = FreqDist(filt_char)
freq = f_dist.most_common(10)
# Creaty empty arrays
np_1 = []
np_2 = []
# Seperate words
for i in range(10):
np_1.append(freq[i][0])
for i in range(10):
np_2.append(freq[i][1])
# Create dataframe
df_freq = pd.DataFrame({'Top 3':np_1, 'n':np_2})
With that we can print the top 10 words and their number of occurrences.
df_freq
As we can see above, governement appears the most often from the top 10 words and principles the least.
Next we can get synomyns for all the words. We're printing all the synonyms for the words.
# Create array
np_syn = []
# Loop trough distribution
for i in f_dist.most_common(10):
# Print word
print 'Word: ', i[0]
count=0
# Print synonym
print 'Synonymns: '
for j in wn.synsets(i[0]):
print ' ',j.lemma_names()
count+=len(j.lemma_names())
np_syn.append([i[0], count])
# Creaty empty arrays
np_1 = []
np_2 = []
# Get synonyms
for i in range(10):
np_1.append(np_syn[i][0])
for j in range(10):
np_2.append(np_syn[j][1])
# Create dataframe
df_syn = pd.DataFrame({'Top 3':np_1, 'n':np_2})
The top 10 words with its synonym counts are:
df_syn.sort_values(by="n", ascending=False)
As we can see above, interests appears to have the most synonyms and citizens the least.
We're next listing all the hyponyms of those 10 most frequently used words. The goal is to find the largest numberof hyponyms.
# Create array
np_hyp = []
# Loop trough distribution
for i in f_dist.most_common(10):
# Print word
print 'Word: ', i[0]
count=0
# Print Hyponyms
print 'Hyponyms: '
for j in wn.synsets(i[0]):
print j
print j.hyponyms()
count+= len(j.hyponyms())
np_hyp.append([i[0], count])
# Creaty empty arrays
np_1 = []
np_2 = []
# Get synonyms
for i in range(10):
np_1.append(np_hyp[i][0])
for j in range(10):
np_2.append(np_hyp[j][1])
# Create dataframe
df_hyp = pd.DataFrame({'Top 3':np_1, 'n':np_2})
The top 10 words with hyponyms are:
df_hyp.sort_values(by="n", ascending=False)
As can be seen above, the word american has the most hyponyms and policial the least amount.
Create your own grammar for the following sentence: "Describe every step of your work and present all intermediate and final results in a Word document".
We're first creating and splitting the above sentence
text = """Describe every step of your work and present all intermediate
and final results in a Word document"""
sentence = text.split()
Next we can print the sentence.
print sentence
Next we're defining a context free grammar for the above sentence.
# define a simple context-free grammar.
grammar = CFG.fromstring("""
S -> VP | VP Cnj VP
VP -> V NP | V PP
NP -> Det N | Det N PP | Adj N | Adj N PP
PP -> P NP
N -> 'step' | 'work' | 'results' | 'Word' 'document'
P -> 'in' | 'of' | 'all'
V -> 'Describe' | 'present'
Det -> 'an' | 'my' | 'every' | 'your' | 'a'
Adj -> 'intermediate' | 'final' | Adj Cnj Adj
Cnj -> 'and' | 'or'
""")
Next we can parse and print the above tree
# Parse
par_grammar = nltk.ChartParser(grammar, trace=0)
par_trees = par_grammar.parse(sentence)
# Print
for i in par_trees:
print i
Install and compile Word2Vec C executables. Train CBOW model and create 200 dimensional embedding of Word Vectors. Demonstrate that you could run analogical reasoning when searching for country’s favorite food starting with japan and sushi. Note that words might have to be in lower case. Find favorite food for 5 different countries. Report improbable results as well as good results. Use scripts provided with original Google C code.
I downloaded the Word2Vec C executables from the following github repository by William Yeh to my EC2 Ubuntu instance. Somehow I wasn't able to install Word2Vec directly. The nice thing about the code in the repository by Yeh is, that it has a make file can can be easily compiled.
git clone https://github.com/William-Yeh/word2vec-mac.git
Cloning into 'word2vec-mac'...
remote: Counting objects: 123, done.
remote: Total 123 (delta 0), reused 0 (delta 0), pack-reused 123
Receiving objects: 100% (123/123), 111.30 KiB | 0 bytes/s, done.
Resolving deltas: 100% (97/97), done.
Checking connectivity... done.
Next I compiled the word2vec program. As explained above, installation trough pip somewho failed.
cd word2vec-mac/
make
gcc word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall \
-funroll-loops -Wno-unused-result
gcc word2phrase.c -o word2phrase -lm -pthread -Ofast -march=native \
-Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -Ofast -march=native -Wall \
-funroll-loops -Wno-unused-result
gcc word-analogy.c -o word-analogy -lm -pthread -Ofast -march=native \
-Wall -funroll-loops -Wno-unused-result
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -Ofast \
-march=native -Wall -funroll-loops -Wno-unused-result
I next trained a neural net using a CBOW model and created 200 dimensional embeddings of word vectors. This was done using the demo-words.sh, using the text8 as the training data. I entered Japan and Sushi as words.
chmod +x *.sh
./demo-word.sh
make: Nothing to be done for 'all'.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 29.8M 100 29.8M 0 0 1913k 0 0:00:16 0:00:16 --:--:-- 1963k
Archive: text8.zip
inflating: text8
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 97.71k
real 1m30.997s
user 2m54.924s
sys 0m0.236s
Enter word or sentence (EXIT to break): japan
Word: japan Position in vocabulary: 582
Word Cosine distance
------------------------------------------------------------------------
china 0.666397
korea 0.584256
singapore 0.572973
cambodia 0.563123
...
Enter word or sentence (EXIT to break): sushi
Word: sushi Position in vocabulary: 30906
Word Cosine distance
------------------------------------------------------------------------
dashi 0.726945
tofu 0.723628
glutinous 0.705772
steamed 0.696959
...
I also looked at the file demo phrases.
./demo-phrases.sh
Starting training using file text8
Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Words written: 17000K
real 0m24.892s
user 0m23.432s
sys 0m0.760s
Starting training using file text8-phrase
Vocab size: 84069
Words in train file: 16307293
Alpha: 0.000117 Progress: 99.60% Words/thread/sec: 40.23k
real 3m29.039s
user 6m49.088s
sys 0m0.264s
Enter word or sentence (EXIT to break):
i entered sushi japan germany
Enter word or sentence (EXIT to break): japan sushi germany
Word: japan Position in vocabulary: 547
Word: sushi Position in vocabulary: 32615
Word: germany Position in vocabulary: 319
Word Cosine distance
------------------------------------------------------------------------
exports_partners 0.509551
russia 0.486049
italy 0.485402
france 0.481288
...
The closest word to the above search is exports_partners, followed by Russia and Italy.
Next I run demo-analogy.sh in order to find the food analogies. Japan susi is used as the analogy. I tried to find the favorite foods for Germany, France, Italy, Spain and USA.
./demo-analogy.sh
tim@ip-172-31-24-35:~/word2vec-mac$ ./demo-analogy.sh
make: Nothing to be done for 'all'.
----------------------------------------------------------
Note that for the word analogy to perform well, the models
should be trained on much larger data sets
Example input: paris france berlin
----------------------------------------------------------
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 98.08k
real 1m30.090s
user 2m54.356s
sys 0m0.252s
Enter three words (EXIT to break):
Enter three words (EXIT to break): japan sushi germany
Word: japan Position in vocabulary: 582
Word: sushi Position in vocabulary: 30906
Word: germany Position in vocabulary: 324
Word Distance
------------------------------------------------------------------------
turnips 0.521571
cabbage 0.521392
glazed 0.516707
hams 0.512228
...
Enter three words (EXIT to break): japan sushi france
Word: japan Position in vocabulary: 582
Word: sushi Position in vocabulary: 30906
Word: france Position in vocabulary: 303
Word Distance
------------------------------------------------------------------------
omelette 0.551152
grilled 0.541595
caramel 0.537879
breads 0.536886
...
Enter three words (EXIT to break): japan sushi italy
Word: japan Position in vocabulary: 582
Word: sushi Position in vocabulary: 30906
Word: italy Position in vocabulary: 843
Word Distance
------------------------------------------------------------------------
omelette 0.542516
tofu 0.541085
cooked 0.538601
lettuce 0.538252
...
Enter three words (EXIT to break): japan sushi spain
Word: japan Position in vocabulary: 582
Word: sushi Position in vocabulary: 30906
Word: spain Position in vocabulary: 804
Word Distance
------------------------------------------------------------------------
caramel 0.539891
breads 0.527792
savoury 0.525772
paprika 0.523075
...
Enter three words (EXIT to break): japan sushi usa
Word: japan Position in vocabulary: 582
Word: sushi Position in vocabulary: 30906
Word: usa Position in vocabulary: 1164
Word Distance
------------------------------------------------------------------------
raspberry 0.532414
crepe 0.530286
shallots 0.527220
kodak 0.525473
...
According to the word analogy the favorite foods are:
The resulting foods do make sense to a certain degree and belong to the country in question. However, in most cases, one would expect different results. Such as Burgers for the USA, Sausages for Germany, Pasta for Italy or Tapas for Spain. The reason for the results is that we're limited on the small training set in the text8 file.
Install and run Genism Python Word2Vec API. Find the most probable words you will obtain when you start with an emperor add a woman and subtract a man. Use this tutorial as a guide https://rare-technologies.com/word2vec-tutorial/
Note: Somehow the installation didn't work under python 2.7. I didn't have jupyter notebook installed for python 3. That is why I executed the following steps directly in the shell.
It is possible to download the text8 corpus used (see comments, in the rare-technologies tutorial.
wget http://mattmahoney.net/dc/text8.zip
unzip text8.zip
Next I opened python3 and did the following analyis in the shell.
# import gensim
import gensim, logging
from gensim.models import word2vec
# open text8
sentences = word2vec.Text8Corpus('text8')
# build the model with vector size 2
model = word2vec.Word2Vec(sentences, size=200)
# Run query
top_5 = model.most_similar(positive=['emperor', 'woman'],
negative=['man'], topn=5)
Next we can output the words
# Print list
for word in top_5:
print(word)
>>> for word in top_5:
... print(word)
...
('empress', 0.6854598522186279)
('emperors', 0.6028348207473755)
('ruler', 0.5929163694381714)
('augustus', 0.5781252384185791)
('daughter', 0.5724169015884399)
The most likely words are empress, emperors, ruler, augustus and daughter.