Python ProgrammingPython Programming

Create document term matrix with TF-IDF

Convert a collection of raw documents to a matrix of TF-IDF features. TfidfTransformer applies Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts. However, CountVectorizer tokenize the documents and count the occurrences of token and return them as a sparse matrix.


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."

df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})

# Initialize
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])

# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
                   index=vectorizer.get_feature_names())

# Change column headers
df2.columns = df1.columns
print(df2)
                   Go      Java    Python
and          0.323751  0.137553  0.323751
application  0.000000  0.116449  0.000000
are          0.208444  0.000000  0.208444
bytecode     0.000000  0.116449  0.000000
can          0.000000  0.116449  0.000000
code         0.000000  0.116449  0.000000
comes        0.208444  0.000000  0.208444
compiled     0.000000  0.116449  0.000000
derived      0.000000  0.116449  0.000000
develops     0.000000  0.116449  0.000000
for          0.000000  0.232898  0.000000
from         0.000000  0.116449  0.000000
functional   0.208444  0.000000  0.208444
imperative   0.208444  0.000000  0.208444
included     0.208444  0.000000  0.208444
including    0.000000  0.116449  0.000000
is           0.000000  0.232898  0.000000
java         0.000000  0.349347  0.000000
language     0.000000  0.116449  0.000000
languages    0.000000  0.116449  0.000000
large        0.208444  0.000000  0.208444
library      0.208444  0.000000  0.208444
linux        0.000000  0.232898  0.000000
mac          0.000000  0.116449  0.000000
most         0.000000  0.232898  0.000000
multiple     0.208444  0.000000  0.208444
object       0.208444  0.000000  0.208444
of           0.000000  0.349347  0.000000
on           0.000000  0.232898  0.000000
operating    0.000000  0.232898  0.000000
or           0.000000  0.116449  0.000000
oriented     0.208444  0.000000  0.208444
paradigms    0.416889  0.000000  0.416889
platforms    0.000000  0.116449  0.000000
procedural   0.208444  0.000000  0.208444
programming  0.161876  0.068777  0.161876
python       0.208444  0.000000  0.208444
run          0.000000  0.116449  0.000000
several      0.000000  0.116449  0.000000
software     0.000000  0.116449  0.000000
standard     0.208444  0.000000  0.208444
supports     0.208444  0.000000  0.208444
syntax       0.000000  0.116449  0.000000
system       0.000000  0.116449  0.000000
systems      0.000000  0.116449  0.000000
that         0.000000  0.116449  0.000000
the          0.000000  0.349347  0.000000
up           0.208444  0.000000  0.208444
with         0.208444  0.000000  0.208444