How to calculate a word-word co-occurrence matrix?
A co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur. It is way to easily understand relationships between entities or other n-grams, in terms of the frequency with which they appear together.
Generate Co-occurrence Matrix
import numpy as np import nltk from nltk import bigrams import itertools import pandas as pd def generate_co_occurrence_matrix(corpus): vocab = set(corpus) vocab = list(vocab) vocab_index = {word: i for i, word in enumerate(vocab)} # Create bigrams from all words in corpus bi_grams = list(bigrams(corpus)) # Frequency distribution of bigrams ((word1, word2), num_occurrences) bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams)) # Initialise co-occurrence matrix # co_occurrence_matrix[current][previous] co_occurrence_matrix = np.zeros((len(vocab), len(vocab))) # Loop through the bigrams taking the current and previous word, # and the number of occurrences of the bigram. for bigram in bigram_freq: current = bigram[0][1] previous = bigram[0][0] count = bigram[1] pos_current = vocab_index[current] pos_previous = vocab_index[previous] co_occurrence_matrix[pos_current][pos_previous] = count co_occurrence_matrix = np.matrix(co_occurrence_matrix) # return the matrix and the index return co_occurrence_matrix, vocab_index text_data = [['Where', 'Python', 'is', 'used'], ['What', 'is', 'Python' 'used', 'in'], ['Why', 'Python', 'is', 'best'], ['What', 'companies', 'use', 'Python']] # Create one list using many lists data = list(itertools.chain.from_iterable(text_data)) matrix, vocab_index = generate_co_occurrence_matrix(data) data_matrix = pd.DataFrame(matrix, index=vocab_index, columns=vocab_index) print(data_matrix)
best use What Where ... in is Python used best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 [11 rows x 11 columns]
2019-05-24T15:30:33+05:30
2019-05-24T15:30:33+05:30
Amit Arora
Amit Arora
Python Programming Tutorial
Python
Practical Solution