Python ProgrammingPython Programming

Stemming list of sentences words or phrases using NLTK

Stemming is a process of extracting a root word. For example, "jumping", "jumps" and "jumped" are stemmed into jump. Stemming helps us in standardizing words to their base stem regardless of their pronunciations, this helps us to classify or cluster the text. Search engines uses these techniques extensively to give better and more accurate results irrespective of the word form. The nltk package has several implementations for stemmers.


NLTK Stemming using Porter stemmer

from nltk.stem import PorterStemmer

st = PorterStemmer()
text = ['Where did he learn to dance like that?',
        'His eyes were dancing with humor.',
        'She shook her head and danced away',
        'Alex was an excellent dancer.']

output = []
for sentence in text:
    output.append(" ".join([st.stem(i) for i in sentence.split()]))

for item in output:
    print(item)

print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
 

NLTK Stemming using Lancaster stemmer

from nltk.stem import LancasterStemmer

st = LancasterStemmer()
text = ['Where did he learn to dance like that?',
        'His eyes were dancing with humor.',
        'She shook her head and danced away',
        'Alex was an excellent dancer.']

output = []
for sentence in text:
    output.append(" ".join([st.stem(i) for i in sentence.split()]))

for item in output:
    print(item)

print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

wher did he learn to dant lik that?
his ey wer dant with humor.
she shook her head and dant away
alex was an excel dancer.
--------------------------------------------------
jump jump jump
 

NLTK Stemming using Snowball stemmer

from nltk.stem import SnowballStemmer

st = SnowballStemmer("english")
text = ['Where did he learn to dance like that?',
        'His eyes were dancing with humor.',
        'She shook her head and danced away',
        'Alex was an excellent dancer.']

output = []
for sentence in text:
    output.append(" ".join([st.stem(i) for i in sentence.split()]))

for item in output:
    print(item)

print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

where did he learn to danc like that?
his eye were danc with humor.
she shook her head and danc away
alex was an excel dancer.
--------------------------------------------------
jump jump jump
 

NLTK Stemming using RegexpStemmer

from nltk.stem import RegexpStemmer

st = RegexpStemmer('ing$|s$|ed$|er$', min=4)
text = ['Where did he learn to dance like that?',
        'His eyes were dancing with humor.',
        'She shook her head and danced away',
        'Alex was an excellent dancer.']

output = []
for sentence in text:
    output.append(" ".join([st.stem(i) for i in sentence.split()]))

for item in output:
    print(item)

print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

Where did he learn to dance like that?
His eye were danc with humor.
She shook her head and danc away
Alex was an excellent dancer.
--------------------------------------------------
jump jump jump
 

You can see how the stemming results are different for each stemmers. You should choose your stemmer based on your problem. If needed, you can even build your own stemmer with your own defined rules.