Python ProgrammingPython Programming

Use sklearn CountVectorize vocabulary specification with bigrams

The N-gram technique is comparatively simple and raising the value of n will give us more contexts. Search engines uses this technique to forecast/recommend the possibility of next character/words in the sequence to users as they type.


Bigram-based Count Vectorizer

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample data for analysis
data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them."
data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language"

df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]})

# Initialize
vectorizer = CountVectorizer(ngram_range=(2, 2))
doc_vec = vectorizer.fit_transform(df1.iloc[0])

# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
                   index=vectorizer.get_feature_names())

# Change column headers
df2.columns = df1.columns
print(df2)

                        Assembly  Machine
also either                    0        1
and or                         0        1
are also                       0        1
are readable                   1        0
are still                      1        0
assembly language              5        0
because each                   1        0
but difficult                  0        1
by computers                   0        1
by people                      0        1
can execute                    0        1
comes with                     1        0
compiled and                   0        1
computers but                  0        1
computers can                  0        1
difficult to                   0        1
disadvantage of                1        0
each assembly                  1        0
each platform                  1        0
easily understood              0        1
either compiled                0        1
execute them                   0        1
high level                     0        1
higher level                   0        1
in high                        0        1
in other                       1        0
instruction though             1        0
instruction translates         1        0
interpreted into               0        1
into machine                   0        1
...                          ...      ...
or interpreted                 0        1
other words                    1        0
particular assembly            1        0
people this                    0        1
people use                     0        1
platform comes                 1        0
portable because               1        0
programming language           0        1
programming languages          0        1
programs written               0        1
read by                        0        1
readable the                   1        0
representation of              1        0
so that                        0        1
statements are                 2        0
still low                      1        0
that computers                 0        1
that it                        1        0
the statements                 1        0
this is                        0        1
though assembly                1        0
to machine                     1        0
to read                        0        1
translates to                  1        0
understood by                  0        1
use higher                     0        1
why people                     0        1
with particular                1        0
words each                     1        0
written in                     0        1
 
[83 rows x 2 columns]