18/19 BT5153 Report Analysis

Tasks

Here, we have 15 project reports from our MSBA students taking BT5153 last year. Each report discusses a machine learing system for text data. These 15 reports are processed to obtain a tiny dataset via 01 PDF Extraction.ipynb notebook. In this dataset, each report consists of the main text, number of page and its final scores. We are going to utilze natural language processing methods to analyze these papers.

  1. Basic NLP Techniques:
    • Tokenization: breaking text into tokens (words, sentences, n-grams)
    • Stop word removal: removing common words
    • TF: computing word importance
    • Stemming and lemmatization: reducing words to their base form
    • LDA: Topic Modelling for Text
In [1]:
import pandas as pd

# Load the dataset  
reports = pd.read_csv('reprots_2018.csv', index_col=None)

# check the first three rows
reports.head(3)
Out[1]:
text numpage
0 Table of Contents Introduction Problem Definit... 16
1 Table of Contents Introduction Background Prob... 11
2 Table of Contents INTRODUCTION DATA COLLECTION... 20

1. Preprocess Text Data

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.

In [6]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
stop_english = stopwords.words('english')
reports['clean_text'] = reports['text'].apply(lambda x: list(x.split(' ')))
# Remove stop words and single character
reports['clean_text'] = reports['clean_text'].apply(lambda x: [w.lower() for w in x if w.lower() not in stop_english and len(w) >=2])
# Text Normalization: [am, is, are] -> be
reports['stemed_clean_text'] = reports['clean_text'].apply(lambda x: [lemmatizer.lemmatize(w) for w in x])
# Only stop word removal
final_corpus = reports['clean_text'].tolist()
# Stop word removal and words stem
stemed_final_corpus = reports['stemed_clean_text'].tolist()

3. A word cloud to visualize the preprocessed text data

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.

In [7]:
from collections import Counter
# How many high frequent words are selected
N_top = 5
# store the high frequent words 
top_nwords = []
# check each report 
for final_doc in final_corpus:
    c_doc = Counter(final_doc)
    # obtain top N_top words
    top_words = c_doc.most_common(N_top)
    # a list of tuple
    print(top_words)
    # the first element in tuple is the word
    t_nwords = [w[0] for w in top_words]
    # concatenate the list, e.g., [a,b,c] + [d,a] = [a,b,c,d,a]
    top_nwords = top_nwords + t_nwords
[('user', 67), ('model', 55), ('restaurant', 53), ('review', 52), ('text', 45)]
[('model', 42), ('word', 30), ('toxic', 30), ('data', 24), ('lstm', 22)]
[('movie', 58), ('movies', 42), ('data', 34), ('reviews', 29), ('revenue', 25)]
[('questions', 44), ('insincere', 35), ('words', 35), ('word', 30), ('sincere', 27)]
[('features', 100), ('model', 81), ('text', 71), ('review', 63), ('models', 46)]
[('talk', 27), ('ted', 22), ('talks', 20), ('string', 15), ('views', 14)]
[('tweets', 56), ('model', 35), ('models', 33), ('topic', 31), ('sentiment', 28)]
[('project', 55), ('model', 20), ('feature', 19), ('words', 19), ('approval', 19)]
[('comments', 36), ('data', 35), ('model', 29), ('toxic', 22), ('online', 21)]
[('features', 51), ('feature', 47), ('data', 29), ('image', 25), ('model', 24)]
[('features', 33), ('comments', 31), ('character', 29), ('comment', 28), ('model', 22)]
[('reviews', 84), ('star', 63), ('app', 61), ('thank', 48), ('replies', 44)]
[('movie', 45), ('tweets', 35), ('cnn', 27), ('tweet', 26), ('features', 26)]
[('wine', 69), ('aroma', 35), ('words', 26), ('model', 25), ('world', 22)]
In [8]:
# Import the wordcloud library
import wordcloud

# Join the different processed titles together.
long_string = " ".join(top_nwords)

# Create a WordCloud object
wcloud = wordcloud.WordCloud()

# Generate a word cloud
wcloud.generate(long_string)

# Visualize the word cloud
wcloud.to_image()
Out[8]:
After Words Stemming
In [9]:
from collections import Counter
top_nwords = []
for final_doc in stemed_final_corpus:
    c_doc = Counter(final_doc)
    top_words = c_doc.most_common(5)
    print(top_words)
    t_nwords = [w[0] for w in top_words]
    top_nwords = top_nwords + t_nwords
[('restaurant', 86), ('user', 79), ('model', 65), ('review', 65), ('feature', 47)]
[('model', 60), ('word', 46), ('comment', 34), ('toxic', 30), ('data', 24)]
[('movie', 100), ('model', 37), ('data', 34), ('review', 33), ('revenue', 28)]
[('word', 65), ('question', 60), ('insincere', 35), ('sincere', 27), ('model', 26)]
[('model', 127), ('feature', 122), ('review', 103), ('text', 71), ('using', 39)]
[('talk', 47), ('ted', 22), ('topic', 17), ('number', 16), ('view', 16)]
[('tweet', 74), ('model', 68), ('word', 43), ('topic', 43), ('sentiment', 42)]
[('project', 68), ('feature', 32), ('model', 31), ('word', 29), ('essay', 25)]
[('comment', 41), ('model', 39), ('data', 35), ('toxic', 22), ('online', 21)]
[('feature', 98), ('model', 46), ('image', 41), ('pet', 41), ('data', 29)]
[('comment', 59), ('feature', 42), ('model', 31), ('character', 29), ('article', 24)]
[('review', 126), ('reply', 81), ('star', 75), ('app', 61), ('model', 59)]
[('tweet', 61), ('movie', 58), ('model', 44), ('feature', 40), ('cnn', 27)]
[('wine', 79), ('word', 37), ('country', 37), ('aroma', 36), ('description', 33)]
In [10]:
long_string = " ".join(top_nwords)
wcloud = wordcloud.WordCloud()
wcloud.generate(long_string)
wcloud.to_image()
Out[10]:

4. Prepare the text for LDA analysis

The main text analysis method that we will use is latent Dirichlet allocation (LDA). LDA is able to perform topic detection on large document sets, determining what the main 'topics' are in a large unlabeled set of texts. A 'topic' is a collection of words that tend to co-occur often. The hypothesis is that LDA might be able to clarify what the different topics in the research titles are. These topics can then be used as a starting point for further analysis.

LDA does not work directly on text data. First, it is necessary to convert the documents into a simple vector representation. This representation will then be used by LDA to determine the topics. Each entry of a 'document vector' will correspond with the number of times a word occurred in the document. In conclusion, we will convert a list of titles into a list of vectors, all with length equal to the vocabulary. For example, 'Analyzing machine learning trends with neural networks.' would be transformed into [1, 0, 1, ..., 1, 0].

We'll then plot the 10 most common words based on the outcome of this operation (the list of document vectors). As a check, these words should also occur in the word cloud.

In [11]:
# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 

    plt.bar(x_pos, counts,align='center')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.title('10 most common words')
    plt.show()

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer()

all_cleancorpus = [' '.join(x) for x in stemed_final_corpus]
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(all_cleancorpus)

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [12]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below (use int values below 15)
number_topics = 6
number_words = 6

# Create and fit the LDA model
lda = LDA(n_components=number_topics)
lda.fit(count_data)

# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
word question talk data model insincere

Topic #1:
feature model pet image data animal

Topic #2:
system nb scale suitable hard factor

Topic #3:
model comment word data feature tweet

Topic #4:
model review movie user restaurant feature

Topic #5:
model review feature text word reply