Here, we have 15 project reports from our MSBA students taking BT5153 last year. Each report discusses a machine learing system for text data. These 15 reports are processed to obtain a tiny
dataset via 01 PDF Extraction.ipynb notebook. In this dataset, each report consists of the main text, number of page and its final scores. We are going to utilze natural language processing methods to analyze these papers.
import pandas as pd
# Load the dataset
reports = pd.read_csv('reprots_2018.csv', index_col=None)
# check the first three rows
reports.head(3)
In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.
Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_english = stopwords.words('english')
reports['clean_text'] = reports['text'].apply(lambda x: list(x.split(' ')))
# Remove stop words and single character
reports['clean_text'] = reports['clean_text'].apply(lambda x: [w.lower() for w in x if w.lower() not in stop_english and len(w) >=2])
# Text Normalization: [am, is, are] -> be
reports['stemed_clean_text'] = reports['clean_text'].apply(lambda x: [lemmatizer.lemmatize(w) for w in x])
# Only stop word removal
final_corpus = reports['clean_text'].tolist()
# Stop word removal and words stem
stemed_final_corpus = reports['stemed_clean_text'].tolist()
In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.
Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.
from collections import Counter
# How many high frequent words are selected
N_top = 5
# store the high frequent words
top_nwords = []
# check each report
for final_doc in final_corpus:
c_doc = Counter(final_doc)
# obtain top N_top words
top_words = c_doc.most_common(N_top)
# a list of tuple
print(top_words)
# the first element in tuple is the word
t_nwords = [w[0] for w in top_words]
# concatenate the list, e.g., [a,b,c] + [d,a] = [a,b,c,d,a]
top_nwords = top_nwords + t_nwords
# Import the wordcloud library
import wordcloud
# Join the different processed titles together.
long_string = " ".join(top_nwords)
# Create a WordCloud object
wcloud = wordcloud.WordCloud()
# Generate a word cloud
wcloud.generate(long_string)
# Visualize the word cloud
wcloud.to_image()
from collections import Counter
top_nwords = []
for final_doc in stemed_final_corpus:
c_doc = Counter(final_doc)
top_words = c_doc.most_common(5)
print(top_words)
t_nwords = [w[0] for w in top_words]
top_nwords = top_nwords + t_nwords
long_string = " ".join(top_nwords)
wcloud = wordcloud.WordCloud()
wcloud.generate(long_string)
wcloud.to_image()
The main text analysis method that we will use is latent Dirichlet allocation (LDA). LDA is able to perform topic detection on large document sets, determining what the main 'topics' are in a large unlabeled set of texts. A 'topic' is a collection of words that tend to co-occur often. The hypothesis is that LDA might be able to clarify what the different topics in the research titles are. These topics can then be used as a starting point for further analysis.
LDA does not work directly on text data. First, it is necessary to convert the documents into a simple vector representation. This representation will then be used by LDA to determine the topics. Each entry of a 'document vector' will correspond with the number of times a word occurred in the document. In conclusion, we will convert a list of titles into a list of vectors, all with length equal to the vocabulary. For example, 'Analyzing machine learning trends with neural networks.' would be transformed into [1, 0, 1, ..., 1, 0].
We'll then plot the 10 most common words based on the outcome of this operation (the list of document vectors). As a check, these words should also occur in the word cloud.
# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
import matplotlib.pyplot as plt
words = count_vectorizer.get_feature_names()
total_counts = np.zeros(len(words))
for t in count_data:
total_counts+=t.toarray()[0]
count_dict = (zip(words, total_counts))
count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
words = [w[0] for w in count_dict]
counts = [w[1] for w in count_dict]
x_pos = np.arange(len(words))
plt.bar(x_pos, counts,align='center')
plt.xticks(x_pos, words, rotation=90)
plt.xlabel('words')
plt.ylabel('counts')
plt.title('10 most common words')
plt.show()
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer()
all_cleancorpus = [' '.join(x) for x in stemed_final_corpus]
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(all_cleancorpus)
# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
words = count_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #%d:" % topic_idx)
print(" ".join([words[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
# Tweak the two parameters below (use int values below 15)
number_topics = 6
number_words = 6
# Create and fit the LDA model
lda = LDA(n_components=number_topics)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)