Text mining - Process - R

This is Part II of a four-part post. Part I talks about collecting text data from Twitter while Part II discusses analysis on text data i.e. text mining. Part III outlines the process of presenting the data using Tableau and Part IV delves into insights from the analysis.

With the data scraped from Twitter, we can now perform some analysis on the tweets to look at the words that most commonly appeared. With the JSON file scraped, I made use of Tableau to extract the columns I needed, copying them out and saving as a csv file (I'll look into loading JSON file with R next time but for now I used Tableau as data loading is easy with the drag and drop function. Feel free to comment below as well if you have the solution.) This is how the csv file for text mining looks like.

Note: Make sure that the file is saved in the directory which R is pointing to. To check the directory, use the following command getwd() and to change the working directory, use this command setwd("C:/Users/HuiXiang/Documents") setting it to the directory you would like accordingly.

We would need to install the R packages called "tm" and "SnowballC". Click Packages > Install package(s) > Cloud-0 [https] if country is not listed then look for the package to be installed (one at a time). Then we can start the text mining!

textdata<-read.csv("TweetsOB.csv",header=T) #T if variables names are included and F if they're not

nrow(textdata) #check number of rows to see if data is read in correctly

colnames(textdata) #see variable names


text2017<-textdata[Year==2017,] #select tweets within year 2017

text2017<-text2017["Text"] #select column Text only

nrow(text2017) #see number of tweets in 2017 library(tm) doc<-Corpus(DataframeSource(text2017)) doc1<-tm_map(doc, removePunctuation) doc1<-tm_map(doc1, removeNumbers)

doc1<-tm_map(doc1, content_transformer(tolower)) doc1<-tm_map(doc1, removeWords, stopwords("english")) library(SnowballC) doc1<-tm_map(doc1,stripWhitespace) doc1<-tm_map(doc1,PlainTextDocument) dtm<-DocumentTermMatrix(doc1) review_dtm<-removeSparseTerms(dtm,0.95) #adjust the value to a smaller one if the review_dtm returns a large number of terms. We're removing terms that only appear in at most 5% of the data here. review_dtm F<-findFreqTerms(review_dtm,5) #adjust the value to a larger (or smaller) one if you want a shorter (or longer) list of words. We're looking at words that appear at least 5 times within the entire text corpus i.e chunk of text.


F<-as.matrix(F) write.csv(F,file="FreqWords2017.csv") # run this command only if you would like to save the list of words into a file

stemDocument(F) # run this command if you would like to see the root words

Repeat the steps above to get the list of commonly used words for other years. One thing is note is that the percentage of tweets containing a particular word is lower than the number of times the word appear within the text corpus as the word could appear multiple times within the corpus. Furthermore, the list of word returned might be a stemmed word (i.e. root word). e.g. insur (which can indicate insurance or insure). Hence, we can try to complete the word with possible variations. To find out the percentage of tweets containing the word, we use Excel, more specifically the "FIND" function. We search the text that's converted to lowercase for the list of commonly used words we had found within the year. It will return 0 if the word is not found and 1 if it is found, and a blank if the word is not relevant for that particular year. The number of tweets containing a particular can then be derived through summing the column.

However, if you're only interested in the number of times the word appeared in the entire text corpus (instead of treating each document individually), the following code will do that for us.

freq <- sort(colSums(as.matrix(review_dtm)), decreasing=T) freq

And if we're interested in words associated closely with a particular word (eg. "ofa"), we can make use of the function findAssocs. However, as the amount of data is small, there might not be any returns.

findAssocs(review_dtm, "ofa", corlimit=0.70)

The final list of frequently mentioned words with the corresponding percentage of tweets containing the respective words can be found here.

[/Edited on 22 Oct 2018] In the tm package, the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format (which can be very memory consuming for large matrices). --"Introduction to the tm Package, Text Mining in R" by Ingo Feinerer.

[/Edited on 26 Oct 2018, 11 Dec 2018] Separately, I found a website that generates word cloud based on text provided for free. While I think it is able to fulfill most basic needs, there is of course a limit on how much you can customize as compared to coding. An example illustrating the features of the website tool can be found in a separate post here.

Other useful resources: