Use sklearn CountVectorize vocabulary specification with bigrams
The N-gram technique is comparatively simple and raising the value of n will give us more contexts. Search engines uses this technique to forecast/recommend the possibility of next character/words in the sequence to users as they type.
Bigram-based Count Vectorizer
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer # Sample data for analysis data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them." data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language" df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]}) # Initialize vectorizer = CountVectorizer(ngram_range=(2, 2)) doc_vec = vectorizer.fit_transform(df1.iloc[0]) # Create dataFrame df2 = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names()) # Change column headers df2.columns = df1.columns print(df2)
Assembly Machine also either 0 1 and or 0 1 are also 0 1 are readable 1 0 are still 1 0 assembly language 5 0 because each 1 0 but difficult 0 1 by computers 0 1 by people 0 1 can execute 0 1 comes with 1 0 compiled and 0 1 computers but 0 1 computers can 0 1 difficult to 0 1 disadvantage of 1 0 each assembly 1 0 each platform 1 0 easily understood 0 1 either compiled 0 1 execute them 0 1 high level 0 1 higher level 0 1 in high 0 1 in other 1 0 instruction though 1 0 instruction translates 1 0 interpreted into 0 1 into machine 0 1 ... ... ... or interpreted 0 1 other words 1 0 particular assembly 1 0 people this 0 1 people use 0 1 platform comes 1 0 portable because 1 0 programming language 0 1 programming languages 0 1 programs written 0 1 read by 0 1 readable the 1 0 representation of 1 0 so that 0 1 statements are 2 0 still low 1 0 that computers 0 1 that it 1 0 the statements 1 0 this is 0 1 though assembly 1 0 to machine 1 0 to read 0 1 translates to 1 0 understood by 0 1 use higher 0 1 why people 0 1 with particular 1 0 words each 1 0 written in 0 1 [83 rows x 2 columns]
2019-04-29T15:21:06+05:30
2019-04-29T15:21:06+05:30
Amit Arora
Amit Arora
Python Programming Tutorial
Python
Practical Solution