Python ProgrammingPython Programming

How do we convert text to number using countvectorizer?

In order to apply any algorithms the texts have to be represent as numbers. Count Vectorizer converts a collection of text data to a matrix of token counts. It is simply a matrix with terms as the rows and document names( or dataframe columns) as the columns and a count of the frequency of words as the cells of the matrix.


Converting Text to Numbers Using Count Vectorizing

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."

df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})

# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])

# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
                   index=vectorizer.get_feature_names())

# Change column headers
df2.columns = df1.columns
print(df2)
             Go  Java  Python
and           2     2       2
application   0     1       0
are           1     0       1
bytecode      0     1       0
can           0     1       0
code          0     1       0
comes         1     0       1
compiled      0     1       0
derived       0     1       0
develops      0     1       0
for           0     2       0
from          0     1       0
functional    1     0       1
imperative    1     0       1
included      1     0       1
including     0     1       0
is            0     2       0
java          0     3       0
language      0     1       0
languages     0     1       0
large         1     0       1
library       1     0       1
linux         0     2       0
mac           0     1       0
most          0     2       0
multiple      1     0       1
object        1     0       1
of            0     3       0
on            0     2       0
operating     0     2       0
or            0     1       0
oriented      1     0       1
paradigms     2     0       2
platforms     0     1       0
procedural    1     0       1
programming   1     1       1
python        1     0       1
run           0     1       0
several       0     1       0
software      0     1       0
standard      1     0       1
supports      1     0       1
syntax        0     1       0
system        0     1       0
systems       0     1       0
that          0     1       0
the           0     3       0
up            1     0       1
with          1     0       1