Word Tokenization using NLTK and TextBlob
Word tokenization is the process of splitting sentences into their constituent words. This also includes splitting standard contractions (e.g., it's becomes "it" and "a") and treating punctuation marks (like commas, single quotes, and periods followed by white-space) as separate tokens. In below code snippet, we would look at different ways of word tokenize using NLTK and TextBlob library.
Word Tokenizer
import nltk from textblob import TextBlob data = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages." nltk_output = nltk.word_tokenize(data) textblob_output = TextBlob(data).words print(nltk_output) print(textblob_output)
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.'] ['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
2019-04-24T15:30:39+05:30
2019-04-24T15:30:39+05:30
Amit Arora
Amit Arora
Python Programming Tutorial
Python
Practical Solution