19/20 BT5153 Report Analysis¶

Tasks¶

Here, we have 20 project reports from our MSBA students taking BT5153 last year. Each report discusses a machine learing system for text data. These 20 reports are processed to obtain a tiny dataset via 01 PDF Extraction.ipynb notebook. In this dataset, each report consists of the main text, number of page and its final scores. We are going to utilze natural language processing methods to analyze these papers.

Basic NLP Techniques:
- Tokenization: breaking text into tokens (words, sentences, n-grams)
- Stop word removal: removing common words
- TF: computing word importance
- Stemming and lemmatization: reducing words to their base form
- LDA: Topic Modelling for Text

import pandas as pd

# Load the dataset  
reports = pd.read_csv('reprots_2019_20.csv', index_col=None)

# check the first three rows
reports.head(3)

reports.shape

(20, 2)

1. Preprocess Text Data¶

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
stop_english = stopwords.words('english')
reports['clean_text'] = reports['text'].apply(lambda x: list(x.split(' ')))
# Remove stop words and single character
reports['clean_text'] = reports['clean_text'].apply(lambda x: [w.lower() for w in x if w.lower() not in stop_english and len(w) >=2])
# Text Normalization: [am, is, are] -> be
reports['stemed_clean_text'] = reports['clean_text'].apply(lambda x: [lemmatizer.lemmatize(w) for w in x])
# Only stop word removal
final_corpus = reports['clean_text'].tolist()
# Stop word removal and words stem
stemed_final_corpus = reports['stemed_clean_text'].tolist()

3. A word cloud to visualize the preprocessed text data¶

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we'll use Andreas Mueller's wordcloud library.

from collections import Counter
# How many high frequent words are selected
N_top = 5
# store the high frequent words 
top_nwords = []
# check each report 
for final_doc in final_corpus:
    c_doc = Counter(final_doc)
    # obtain top N_top words
    top_words = c_doc.most_common(N_top)
    # a list of tuple
    print(top_words)
    # the first element in tuple is the word
    t_nwords = [w[0] for w in top_words]
    # concatenate the list, e.g., [a,b,c] + [d,a] = [a,b,c,d,a]
    top_nwords = top_nwords + t_nwords

[('news', 54), ('fake', 33), ('score', 26), ('dataset', 22), ('data', 19)]
[('data', 47), ('host', 42), ('review', 33), ('airbnb', 32), ('reviews', 28)]
[('answer', 103), ('model', 65), ('accepted', 64), ('answers', 56), ('question', 47)]
[('model', 34), ('ad', 30), ('match', 28), ('data', 26), ('classi', 26)]
[('model', 57), ('image', 43), ('images', 41), ('search', 39), ('cation', 32)]
[('hotel', 66), ('model', 46), ('rating', 43), ('reviews', 34), ('review', 33)]
[('model', 53), ('movie', 46), ('genre', 37), ('analysis', 31), ('learning', 31)]
[('mvp', 71), ('approach', 53), ('data', 45), ('players', 42), ('top', 41)]
[('sentiment', 56), ('weibo', 39), ('user', 31), ('data', 25), ('positive', 22)]
[('climate', 45), ('change', 42), ('model', 41), ('tweets', 32), ('sentiment', 31)]
[('model', 56), ('topic', 37), ('using', 34), ('text', 31), ('score', 30)]
[('tweets', 41), ('tweet', 38), ('rumor', 37), ('model', 35), ('features', 29)]
[('model', 71), ('data', 61), ('price', 51), ('features', 44), ('image', 31)]
[('article', 56), ('model', 47), ('articles', 46), ('news', 44), ('features', 38)]
[('reviews', 154), ('negative', 70), ('user', 65), ('review', 64), ('sentiment', 62)]
[('model', 75), ('toxic', 67), ('words', 46), ('comment', 42), ('text', 37)]
[('images', 69), ('image', 55), ('tags', 55), ('dataset', 34), ('model', 31)]
[('reviews', 37), ('text', 31), ('model', 29), ('based', 27), ('review', 25)]
[('fires', 48), ('fire', 45), ('data', 42), ('model', 40), ('models', 34)]
[('book', 133), ('search', 103), ('books', 93), ('genres', 60), ('model', 57)]

# Import the wordcloud library
import wordcloud

# Join the different processed titles together.
long_string = " ".join(top_nwords)

# Create a WordCloud object
wcloud = wordcloud.WordCloud()

# Generate a word cloud
wcloud.generate(long_string)

# Visualize the word cloud
wcloud.to_image()

After Words Stemming¶

from collections import Counter
top_nwords = []
for final_doc in stemed_final_corpus:
    c_doc = Counter(final_doc)
    top_words = c_doc.most_common(5)
    print(top_words)
    t_nwords = [w[0] for w in top_words]
    top_nwords = top_nwords + t_nwords

[('news', 54), ('fake', 33), ('model', 26), ('score', 26), ('dataset', 22)]
[('review', 61), ('host', 49), ('data', 47), ('score', 39), ('feature', 38)]
[('answer', 159), ('model', 73), ('accepted', 64), ('question', 54), ('post', 51)]
[('match', 44), ('player', 44), ('model', 40), ('ad', 35), ('team', 33)]
[('image', 84), ('model', 84), ('product', 47), ('layer', 41), ('search', 39)]
[('hotel', 84), ('review', 67), ('model', 66), ('rating', 61), ('customer', 25)]
[('model', 82), ('movie', 54), ('feature', 51), ('genre', 43), ('layer', 37)]
[('player', 76), ('mvp', 71), ('approach', 61), ('tweet', 54), ('data', 45)]
[('sentiment', 68), ('user', 48), ('weibo', 39), ('post', 30), ('score', 26)]
[('model', 60), ('sentiment', 51), ('climate', 45), ('change', 42), ('tweet', 40)]
[('model', 76), ('topic', 51), ('customer', 36), ('score', 36), ('using', 34)]
[('tweet', 79), ('rumor', 58), ('word', 50), ('model', 48), ('feature', 39)]
[('model', 101), ('feature', 75), ('data', 61), ('price', 59), ('image', 35)]
[('article', 102), ('model', 62), ('feature', 52), ('news', 44), ('word', 43)]
[('review', 218), ('sentiment', 90), ('user', 85), ('negative', 70), ('topic', 69)]
[('model', 101), ('comment', 68), ('word', 68), ('toxic', 67), ('text', 41)]
[('image', 124), ('tag', 71), ('model', 39), ('dataset', 34), ('product', 34)]
[('review', 62), ('feature', 47), ('model', 38), ('text', 34), ('user', 33)]
[('fire', 93), ('model', 74), ('class', 59), ('data', 42), ('wildfire', 26)]
[('book', 226), ('search', 107), ('genre', 105), ('model', 80), ('description', 53)]

long_string = " ".join(top_nwords)
wcloud = wordcloud.WordCloud()
wcloud.generate(long_string)
wcloud.to_image()

4. Prepare the text for LDA analysis¶

The main text analysis method that we will use is latent Dirichlet allocation (LDA). LDA is able to perform topic detection on large document sets, determining what the main 'topics' are in a large unlabeled set of texts. A 'topic' is a collection of words that tend to co-occur often. The hypothesis is that LDA might be able to clarify what the different topics in the research titles are. These topics can then be used as a starting point for further analysis.

LDA does not work directly on text data. First, it is necessary to convert the documents into a simple vector representation. This representation will then be used by LDA to determine the topics. Each entry of a 'document vector' will correspond with the number of times a word occurred in the document. In conclusion, we will convert a list of titles into a list of vectors, all with length equal to the vocabulary. For example, 'Analyzing machine learning trends with neural networks.' would be transformed into [1, 0, 1, ..., 1, 0].

We'll then plot the 10 most common words based on the outcome of this operation (the list of document vectors). As a check, these words should also occur in the word cloud.

# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 

    plt.bar(x_pos, counts,align='center')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.title('10 most common words')
    plt.show()

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer()

all_cleancorpus = [' '.join(x) for x in stemed_final_corpus]
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(all_cleancorpus)

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)

import warnings
warnings.simplefilter("ignore", DeprecationWarning)

# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below (use int values below 15)
number_topics = 6
number_words = 6

# Create and fit the LDA model
lda = LDA(n_components=number_topics)
lda.fit(count_data)

# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

Topics found via LDA:

Topic #0:
model data word score comment feature

Topic #1:
model data word feature article hotel

Topic #2:
image model review dataset product feature

Topic #3:
model book genre feature search data

Topic #4:
review sentiment user negative positive topic

Topic #5:
model answer feature data score sentiment

	text	numpage
0	In a survey was conducted in Singapore by glob...	8
1	BT Team The Quintet Group Project Report terns...	11
2	Predicting Good Stack Overflow Answers Figure ...	11