Sentiment analysis

Sentiment analysis is the process of using text analytics to mine various sources of data for opinions. Often, sentiment analysis is done on the data that is collected from the Internet and from various social media platforms. Politicians and governments often use sentiment analysis to understand how the people feel about themselves and their policies.
With the advent of social media, data is captured from different sources, such as mobile devices and web browsers, and it is stored in various data formats. Because the social media content is unstructured with respect to traditional storage systems (such as RDBMS, Relational Database Management System), we need tools that can process and analyze this disparate data. However, big data technology is made to handle the different sources and different formats of the structured and unstructured data. In this article, I describe how to use big data tools to capture data for storage and process the data for sentiment analysis.

Working with big data
Whenever you gather data from multiple sources in multiple formats (structured, semi-structured, or unstructured), you need to consider setting up a Hadoop cluster and a Hadoop Distributed File System (HDFS) to store your data. An HDFS provides a flexible way of managing big data:
You can move some of your analyzed data into an existing relational database management system (RDBMS), such as Oracle or MySQL, so that you can use existing BI or reporting tools.
You can store the data in HDFS for future analysis, such as comparing old data with new data by doing tests like the ANOVA.T test.
You can discard the data if you just need the analysis of the data at the point of impact.

Retrieving the data and storing it in HDFS
The best sentiment analysis includes data from multiple sources. In this article, I describe how to retrieve data from these sources:
A Twitter feed
An RSS feed
A mobile application
I’ll also explain how to store the data from these different sources in the HDFS in your Hadoop cluster.

Retrieving data from Twitter by using Jaql
After your application is authenticated, we can then use the Twitter APIs to fetch tweets.
Because we want to be in streaming mode our Twitter URL is:
url = “https://stream.twitter.com/1.1/statuses/filter.json?track=governmentTopic”;
Replace governmentTopic with the name of the government topic that we are mining information on. We can fetch the tweets into a variable, by using code similar to this code:

jsonResultTweets = read(http(url));
jsonResultTweets;

When you run the Jaql script, it fetches tweets that relate to the government topic. The tweets are returned in JSON format.
If we want to know the scope of discussions on our government topic by location, we can use the following code snippet to fetch the tweets:

governmentTopicDiscussionByLocation = jsonResultTweets -> transform
{location: $.location,user_id: $.from_user_id_str,date_created:
$.created_at,comment:$text} -> group by key = $.location

Then, we can store this information into your HDFS with the following code snippet:

governmentTopicDiscussionByLocation Cnt ->
write(del(“/user/governmentTopics/governmentTopic_1Tweets.del”, schema =
schema { list_of_comma_seperated_json_fields}

where list_of_comma_seperated_json_fields are these comma seperated fields: location, from_user_id_str, and created_at.
So the full Jaql script that can be run by an Oozie workflow might look like this code sample:

url = “https://stream.twitter.com/1.1/statuses/filter.json?track=governmentTopic”;
jsonResultTweets = read(http(url));
jsonResultTweets;
governmentTopicDiscussionByLocation = jsonResultTweets ->
transform {location: $.location,user_id: $.from_user_id_str,user_name:
$.user.name,user_location: $.user.location,date_created: $.created_at,comment: $.text} ->
group by key = $.location
governmentTopicDiscussionByLocation ->
write(del(“/user/governmentTopics/governmentTopic_1Tweets.del”,
schema = schema {location,user_id,user_name,user_location,date_created,comment}

Retrieving data from Twitter by using R
To use R to retrieve tweets, we need to install certain packages on your system. Although we can use RStudio, these steps show how to set up and use the R console.
On an Ubuntu computer, I completed these steps to install the necessary R packages:
Install these packages:

libcurl4-gnutls-dev
libcurl4-nss-dev
libcurl4-openssl-dev
r-base r-base-dev
r-cran-rjson

Open the R console, and run these commands to install these packages for accessing Twitter:

install.packages(“twitteR”)
install.packages(“ROAuth”)
install.packages(“RCurl”)

Load these libraries into your R workspace:

rm(list=ls())
library(twitteR)
library(ROAuth)
library(RCurl)

Now, we can use the following R script to authenticate with Twitter:

download.file(url=”http://curl.haxx.se/ca/cacert.pem”,destfile=”cacert.pem”)
requestURL <- “https://api.twitter.com/oauth/request_token”
accessURL <- “https://api.twitter.com/oauth/access_token”
authURL <- “https://api.twitter.com/oauth/authorize”
consumerKey consumerSecret myCred consumerSecret=consumerSecret,
requestURL=requestURL,
accessURL=accessURL,
authURL=authURL)accessToken accessSecret

setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessSecret)

Then, we can use the following code snippet to fetch the tweets:

govt_sentiment_dataThe keyWord is the government topic that you are analyzing and last_date_pulled is the date of the last time that you fetched tweets.
If you want to stream the Twitter data and pull data automatically at intervals, replace the previous code snippet with this one:

govt_sentiment_data track=”#keyWord”, timeout=3600, oauth=myCred)

We can use the following R script to cleanse the data:

govt_sentiment_data_txt = govt_sentiment_data$text
# remove retweet entities
govt_sentiment_data_txt = gsub(“(RT|via)((?:\\b\\W*@\\w+)+)”, “”, tweet_txt)
# remove at people
govt_sentiment_data_txt = gsub(“@\\w+”, “”, tweet_txt)
# remove punctuation
govt_sentiment_data_txt = gsub(“[[:punct:]]”, “”, tweet_txt)
# remove numbers
govt_sentiment_data_txt = gsub(“[[:digit:]]”, “”, tweet_txt)
# remove html links
govt_sentiment_data_txt = gsub(“http\\w+”, “”, tweet_txt)
# remove unnecessary spaces
govt_sentiment_data_txt = gsub(“[ \t]{2,}”, “”, tweet_txt)
govt_sentiment_data_txt = gsub(“^\\s+|\\s+$”, “”, tweet_txt)
govt_sentiment_data_txt=gsub(“[^0-9a-zA-Z ,./?><:;’~`!@#&*’]”,””, tweet_txt)

Finally, to save the cleansed data to your HDFS, we can use the following code snippet:

hdfsFile hdfs.write(govt_sentiment_data_txt, hdfsFile)
hdfs.close(hdfsFile)
write(govt_sentiment_data, “govt_sentiment_data.txt”)

Retrieving data from RSS feeds
In addition to tweets, we want to gather personal opinions or views from news articles. For this type of data, I suggest that you use a combination of Java and Rome tools to obtain data from RSS feeds. Rome is a Java library that is used to access and manipulate news feeds on the web.
In this example, we get the following information about a news article: title, link, and description. Then, we extract the information that we want from those data points.
To determine which news feed to use, we need to use some sort of page ranking technique. This technique is used in search algorithms and determines the relevance of an item in terms of its references and popularity. The basic principle is that articles with a higher number of hits or references by external entities have a higher priority and hence appear at the top of the search results.
The following Java code identifies the news feed and uses page ranking to identify its relevance to our data:

private static void getFeeds(String newsFeedUrlLink){File f = new File(“newsFeeds.txt”);
boolean ok = false;
try {
URL feedUrl = new URL(newsFeedUrlLink);
SyndFeedInput input = new SyndFeedInput();
InputSource source = new InputSource(feedUrl.openStream());
SyndFeed feed = input.build(source);
for (Iterator i = feed.getEntries().iterator(); i.hasNext();) {
SyndEntry entry = (SyndEntry) i.next();
writeToFile(f,entry);
}
ok = true;

}
catch (Exception ex) {
ex.printStackTrace();
System.out.println(“ERROR: “+ex.getMessage());

}
if (!ok) {
System.out.println();
System.out.println(“FeedReader reads and prints any RSS/Atom feed type.”);
System.out.println(“The first parameter must be the URL of the feed to read.”);
System.out.println();

}

}

private static void writeToFile(File f, SyndEntry entry) throws IOException {
FileWriter fw = new FileWriter(f.getName(),true);
BufferedWriter bw = new BufferedWriter(fw);
bw.write(entry.getTitle()+”\n”);
bw.close();

}

Next, we can use the following code snippet to store the data in the HDFS file we created with the Twitter data. To append this data to the HDFS file that we created with the Twitter data, we must modify the dfs.support.append property value in the hdfs-site.xml file because HDFS does not allow appending to a file by default.

mydata myfile dfserialized df hdfs.close(myfile)//write(mydata, file = “/tmp/govt_sentiment_data.txt”,append = TRUE)
hdfs.write(mydata, file = “/tmp/govt_sentiment_data.txt”,append = TRUE)
government_sentiment_data

Retrieving data from a mobile application
In addition to Twitter data and RSS feed data, we can also gather data from mobile applications that include personal opinions and views. In this example, I assume that you created a simple mobile app that is installed on mobile devices that allows users to provide their opinions about government topics or policies. The J2ME application can be uploaded to a WAP server where mobile devices (even the older ones like Nokia 3310) can download and install the app. The information that users provide is sent back to an RDBMS and stored for future analyses.
You can use Sqoop to ingress data from the RDBMS server to our Hadoop cluster. Run the following line of a sqoop script on the Hadoop cluster:

sqoop import –options-file dbCredentials.txt –connect
jdbc:mysql://217.8.156.117/govt_policy_app –table opinions –-target-dir /tmp \ –append

The –append flag tells Sqoop to append the imported the data to the data set that we already have from the previous data sources as indicated by the –target-dir flag.

Combining the data that was collected into one data source
As we collected the data from Twitter (by using Jaql or R), from RSS feeds (by using Java), and from a mobile app (by using Sqoop), we appended the data into a single HDFS file. You can automate these scripts by implementing the Oozie workflow engine, and setting the commands to run at certain intervals or as a result of a trigger event happening.

Performing sentiment analysis on the combined data
Now that we combined the data, we can complete the sentiment analysis on a single data source, which allows for uniformity, consistency, and accuracy of our analyses. You can use R, Jaql, or Pig or Hive to do these analyses. Pig and Hive are SQL-like syntax languages that run on the Hadoop platform. In this example, I decided to use R to analyze the retrieved data because R has a rich built-in model function and libraries for graphical representations such as ggplot2.
To complete sentiment analyses, we need to have a dictionary of words or word list. The dictionary includes a set of standard words that depicts positive and negative words within a context. It identifies sarcasm words, innuendos, slang terms, new vocabulary, characters, and smileys that are often used in social media. These word lists can be obtained from the Internet, updated regularly, and integrated into our sentiment analyses logic.
The following code takes the retrieved data and matches the words with our word list to get the number of positive and negative words. The sum difference of the positive and negative words gives us a score that indicates how positive or negative our data is with respect to the government topic that we are analyzing.

sentiment.pos=scan(‘/Users/charles/Downloads/r/positive-words.txt’,what=’character’,comment.char=’;’)
sentiment.neg=scan(‘/Users/charles/Downloads/r/negative-words.txt’,what=’character’,comment.char=’;’)
pos.words=c(sentiment.pos,’good’,’reelect’,’accountable’,’stable’)
neg.words=c(sentiment.neg,’bad’,’corrupt’,’greedy’,’unstable’)

And, the following code represents the sentiment scoring algorithm:

require(plyr)
require(stringr)
score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)
{
sentence = tolower(sentence)
word.list = str_split(sentence, ‘\\s+’)
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) – sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

We can then call our sentiment score algorithm function to score the data by using the following code snippet:

require(plyr)
opinion.scoreFinally, we can run further analysis on the score data by using R’s built-in chart and graph capabilities and draw a chart to display the score bar by using the following code snippet:

library(“ggplot2”)
hist(opinion.scores$score)
qplot(opinion.scores$score)

Conclusion
Big data tools can provide unbiased insight into generated data from any source or space for proper and accurate decision making and implementation. You can readily realize your return on investment by implementing big data tools such as those described in this article.