--- title: "Business data analytics." author: "Brand value monitoring" output: prettydoc::html_pretty: null highlight: github html_document: default html_notebook: default github_document: default theme: cayman --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` In this lab session we will analyse user comments for various airline companies. * We will study user's comments and figure out the most frequent words. * Do sentiment analysis for each airline and find the best one. This Lab exercises consists of 2 parts. * First part: We will perform descriptive analysis and visualization. * Second part: We will also see example of K-NN: how it can be used for text classification. ##Libraries For today session we will need following libraries: ```{r} library("dplyr") library("tidyr") library("ggplot2") library("class") library("tm") # Text mining package library("wordcloud2") # Package for building word clouds library("syuzhet") # Package for sentement analysis library("stringr") # Package for work with strings library("randomForest") ``` ##Loading data ```{r} airline_df <- read.csv(file.choose()) # airline.csv ``` ```{r} head(airline_df, 3) ``` ```{r} str(airline_df) ``` Here in column "text" there are some comments that were left by customer. Our objective will be to study them and figure out customer's mood. This will help us to understand what customers think about airlines they use. Now let's print the list of airlines that we have in our data: ```{r} unique(airline_df$airline) ``` Let's look at destribution of the tweets for each airline. ```{r} dist <- airline_df %>% group_by(airline) %>% summarise(n_row = n()) ggplot(dist, aes(x=airline, y=n_row)) + geom_bar(stat = "identity") + theme_bw() ``` ![](aa.png) Now, let's find tweets for different sentiments: ```{r} dist2 <- airline_df %>% group_by(airline, airline_sentiment) %>% summarise(n_row = n()) ggplot(dist2, aes(x=airline, y=n_row, fill=airline_sentiment)) + geom_bar(stat = "identity", position = "dodge") + theme_bw() ``` ```{r} dist2 <- airline_df %>% filter(airline_sentiment == "negative") %>% group_by(airline, negativereason) %>% summarise(n_row = n()) ggplot(dist2, aes(x=airline, y=n_row, fill=negativereason)) + geom_bar(stat = "identity", position = "dodge") + theme_bw() ``` Now we can tell that the biggest problem in airlines is customer service and late flights. Let's study tweets more closer. ##Counting words using text corpus Corpus is the collection of documents (or texts) which we will use to do analysis. By using corpuses we make our life easier because otherwise we would have to do cleaning of the text and counting of the words manually. ###Building corpus To process the text, first we need to "clean" it from punctuation, links or other things that can affect our analysis. To build corpus first we need to change encoding of our texts to *"UTF-8"*: ```{r} sacred_texts <- iconv(airline_df$text, to = "utf-8") ``` Note: If you using mac, use *"utf-8-mac"* instead of *"utf-8"*. Function *iconv()* is used for converting character vectors to specified encodings. Now we will create corpus based on the texts we converted: ```{r} corpus <- Corpus(VectorSource(sacred_texts)) inspect(corpus[1:5]) ``` ###Cleaning corpus Next step is to clean the corpus from punctuation, links and etc. To do it we will use function *tm_map()* (from tm package). First argument that should be passed into tm_map() is the corpus and second is some specific method that you want to use to clean the data. Some of the methods that are widely used: * tolower - makes all text in lower case. * removePunctuation - removes punctuation like dots, comas or dashes. * removeNumbers - removes numbers from text. * wordLengths - removes words that has less than 3 characters. (By default 3. Can be changed). * removeWords - removes words that you specified. Lets go step by step through the cleaning process: First we will convert all uppercase letters to lowercase: ```{r} corpus <- tm_map(corpus, tolower) inspect(corpus[1:5]) ``` Next, we are removing punctuation from the texts: ```{r} corpus <- tm_map(corpus, removePunctuation) inspect(corpus[1:5]) ``` Removing numbers: ```{r} corpus <- tm_map(corpus, removeNumbers) inspect(corpus[1:5]) ``` And finally we can remove words that are often used (also called as stop words) and have no significant sentiment contribution: Examples of such words: "I", "me", "am", "is", "the", etc. ```{r} cleanset <- tm_map(corpus, removeWords, stopwords('english')) inspect(cleanset[1:5]) ``` Another point that you should pay attention to is an order in which you are cleaning the data. For example, earlier we removed punctuation from text and that made from link "https://t.co/hfhxqj0iob" some meaningless long word: "httpstcohfhxqjiob". If we wanted to completely remove the links, we should have done it before we removed the punctuation. Unfortunately there is no predetermined function for removing links from text, so we will have to create one: ```{r} removeURL <- function(x) gsub('https://[[:alnum:]|[:punct:]]*', '', x) corpus2 <- Corpus(VectorSource(sacred_texts)) noUrl <- tm_map(corpus2, content_transformer(removeURL)) inspect(noUrl [1:5]) ``` Let's go step by step and look what is going on. First, we are creating function "removeURL" which we will use in our tm_map. This function takes a string ("x") and removes from it all links using next gsub function: *gsub('https://[[:alnum:]|[:punct:]]\*', '', x)* Here we take subline that starts with https:// and contains some (0 or more) characters and numbers (*[:alnum:]*) or (*|*) punctuation (*[:punct:]*). This is called regular expression in programming. In terms of this course we will not go into details, but in case you want to know more you can use next documentation: http://www.endmemo.com/program/R/gsub.php We create another corpus ("corpus2") and use tm_map to replace links. Here *content_transformer()* (from tm package) is used make from simple function a transformation function. As tm_map accepts only transformation functions we are wrapping our *removeURL* function with *content_transformer.* Let us proceed with cleaning of the corpus: ```{r} noUrl <- tm_map(noUrl, tolower) noUrl <- tm_map(noUrl, removePunctuation) noUrl <- tm_map(noUrl, removeNumbers) airlines_low <- sapply(unique(airline_df$airline), tolower) airlines_low_nowhitespace <- gsub(" ", "", airlines_low) cleanset <- tm_map( noUrl, removeWords, c(stopwords('english'), airlines_low, airlines_low_nowhitespace) ) cleanset <- tm_map(cleanset, stripWhitespace) inspect(cleanset[1:5]) ``` The result of our cleaning is stored in the variable called the cleanset. We changed our characters to a lowercase, removed a lot of meaningless words, punctuation and following is what we will recive in the end: ```{r} inspect(cleanset[1:5]) ``` ###Term Document Matrix Now let us proceed to our goal. Next step would be to calculate how much of each word we have in our data. To do so, we will use Document-Term matrix: ```{r} dtm <- DocumentTermMatrix(cleanset) inspect(dtm) ``` From the output we can figure out some interesting facts: * In our cleanset there are 14431 different words in 14640 documents; * Now we can see how many times each word occurs in each document. But working with TermDocumentMatix would be problematic with standad functions in R, so let's convert it to regular matrix: ```{r} dtm <- as.data.frame(as.matrix(dtm)) dtm$AAirline <- airline_df$airline dtm[1:10, 1:20] ``` Great! We obtained data we need. Now let's do some analysis on the data. First, we will have to calculate amount of words regardless of the document where they have appeared: ```{r} wordCount <- dtm %>% group_by(AAirline) %>% summarise_all(sum) wordCount ``` Let's gather the columns: ```{r} gathered_wc <- gather(wordCount, "word", "amount", 2:ncol(wordCount)) ``` And visualize the result: ```{r} ggplot(filter(gathered_wc, amount > 50)) + geom_histogram(aes(x=word, y=amount, fill=AAirline), stat = "identity", position = "dodge") + coord_flip() ``` ### Word cloud There are a better way to visualize results we obtained. Now we will create some word clouds. ```{r} unique(airline_df$airline) ``` ###Virgin America ```{r} set.seed(8) gathered_wc %>% filter(AAirline == "Virgin America") %>% select(word, amount) %>% wordcloud2(size = 0.6, # set scale of the words shape = 'triangle', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 10) # minimal frequency of the word ``` Try running word clouds for other companies: ###United ```{r echo=F} set.seed(8) gathered_wc %>% filter(AAirline == "United") %>% select(word, amount) %>% wordcloud2(size = 0.6, # set scale of the words shape = 'square', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 10) # minimal frequency of the word ``` ###Southwest ```{r echo=F} set.seed(8) gathered_wc %>% filter(AAirline == "Southwest") %>% select(word, amount) %>% wordcloud2(size = 1, # set scale of the words shape = 'circle', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 1) # minimal frequency of the word ``` ###Delta ```{r echo=F} set.seed(8) gathered_wc %>% filter(AAirline == "Delta") %>% select(word, amount) %>% wordcloud2(size = 1.5, # set scale of the words shape = 'square', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 2) # minimal frequency of the word ``` ###US Airways ```{r echo=F} set.seed(8) gathered_wc %>% filter(AAirline == "US Airways") %>% select(word, amount) %>% wordcloud2(size = 0.6, # set scale of the words shape = 'triangle', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 10) # minimal frequency of the word ``` ###American ```{r echo=F} set.seed(8) gathered_wc %>% filter(AAirline == "American") %>% select(word, amount) %>% wordcloud2(size = 1, # set scale of the words shape = 'circle', # shape of the cloud rotateRatio = 0.5, # angle with which we want to rotate the word minSize = 2) # minimal frequency of the word ``` #Plotting coordinates on geomap Let's look at coordinates: ```{r} head(airline_df$tweet_coord) ``` There are some empty values in tweet_coord column: ```{r} data_with_coord <- airline_df[airline_df$tweet_coord != "",] ``` Next, to plot tweets on the map, we have to separate them. ```{r} tmp <- data.frame(str_split_fixed(data_with_coord$tweet_coord, ", ", 2)) tmp$X1 <- gsub('^.', '', tmp$X1) # Remove [ tmp$X2 <- gsub('.$', '', tmp$X2) # Remove ] colnames(tmp) <- c("lat", "lon") # Set colnames data_with_coord$tweet_lat <- as.numeric(tmp$lat) # Add latitude column to airline_data data_with_coord$tweet_lon <- as.numeric(tmp$lon) # Add longtitude column to airline_data ``` Now let's plot the map: ```{r} # Setting world data WorldData <- map_data('world') WorldData <- fortify(WorldData) # Setting world map geomap <- ggplot() + geom_map( data=WorldData, map=WorldData, aes(x=long, y=lat, group=group, map_id=region), fill="white", colour="#7f7f7f", size=0.5) # Setting map borders geomap <- geomap + coord_map("rectangular", lat0=0, xlim=c(-180,180), ylim=c(-60, 90)) # Setting coordinates of tweets geomap <- geomap + geom_point( aes(x = tweet_lon, y = tweet_lat, color = airline), alpha = 1, data = data_with_coord, size = 2) # Setting colors geomap <- geomap + scale_colour_manual(values=c("orange", "purple", "red", "darkgreen", "blue", "cyan")) geomap ``` ###Tweets destriburion by days of the week Let's find weekdays for each tweet: ```{r} airline_df <- airline_df %>% mutate(weekday = as.POSIXlt(tweet_created)$wday + 1) ``` ```{r} airline_df %>% group_by(airline, weekday) %>% mutate(n_row = n()) %>% ggplot(aes(x=weekday, y=n_row, color=airline)) + geom_point() + geom_line() + theme_bw() + facet_grid(airline_sentiment ~ .) ``` #Sentiment analysis If you remember, we extracted text from our data and changed it's encoding: ```{r} head(sacred_texts, 3) ``` We will use it to do sentiment analysis. ###Librariy For this we will use syuzhet package that provides useful methods for extracting sentiment data. ###Sentiment scores To calculate sentiment scores we will use NRC emotion lexicon. According to which, there are 8 different emotions and 2 sentiments(negative and positive). Let's check it out by calculating sentiment scores of single words: ```{r} rbind(get_nrc_sentiment('The flight was bad. Plane delayed on 2 hours!!!'), get_nrc_sentiment('sunny')) ``` Now let's calculate scores for our texts: ```{r} scores <- get_nrc_sentiment(sacred_texts) ``` Let us study scores a bit: ```{r} summary(scores) ``` Let's add information about airlines to our data: ```{r} scores$Airline <- airline_df$airline ``` We can calculate total amount of scores for each emotion and for each airline: ```{r} scores <- scores %>% mutate(rows = rowSums(select(., 2:10))) %>% group_by(Airline) %>% summarise( anger = sum(anger), anticipation = sum(anticipation), disgust = sum(disgust), fear = sum(fear), joy = sum(joy), sadness = sum(sadness), surprise = sum(surprise), negative = sum(negative), positive = sum(positive), rows = sum(rows)) ``` Let us visualize our data: ```{r} scores_gathered <- scores %>% gather("sentiment", "value", 2:10) %>% mutate(perc = value/rows * 100) ggplot(scores_gathered, aes(x = sentiment, y = perc, fill = sentiment)) + geom_histogram(stat = "identity") + coord_flip() + theme_bw() + scale_fill_brewer(palette="RdYlGn") + facet_grid(Airline ~ .) ``` ## Part 2: Text classification using K-NN method ###Data preparation Let's look at our data once again: ```{r} head(airline_df,3) ``` Let's predict sentiments based on the comments we have. In our dataset we have two columns: the comment and airline_sentiment. We can use classification to predict sentiments based on words from comments. Before we start, in terms of saving lab session time, we will remove some rows from the data: ```{r} airline_df <- airline_df %>% filter(! is.na(airline_sentiment)) airline_df <- airline_df[1:3640,] ``` Let's clean the text (as we did before): ```{r} sacred_texts <- iconv(airline_df$text, to = "utf-8") corpus_k <- Corpus(VectorSource(sacred_texts)) clean_k <- tm_map(corpus_k, tolower) clean_k <- tm_map(clean_k, removeNumbers) clean_k <- tm_map(clean_k, removeWords, stopwords("english")) clean_k <- tm_map(clean_k, removePunctuation) clean_k <- tm_map(clean_k, stripWhitespace) inspect(clean_k[1:3]) ``` Next we create Document-Term matrix from our corpus: ```{r} dtm <- DocumentTermMatrix(clean_k) ``` And transforming it to data frame, as we will need it for futher work. ```{r} knn_words <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE) ``` So, now instead the texts we have data about how much times each word occurs in each description. For example the first description in our data was: ```{r} airline_df$text[1] ``` And now we have number of times each word appears in our text (Note: that we did a cleaning of this text and some words might have changed): ```{r} knn_words[1:3,1:15] ``` ### Model training and prediction Now that we have our data, we can do predictions of sentiments. For now lets store our actual values from initial dataset: ```{r} actual_val <- airline_df$airline_sentiment ``` Next let's take data that we have and divide it into 2 data sets: * train data (usualy 60-80% of data) * test data (40-20%) ```{r} amount_for_train <- round(nrow(knn_words) / 100 * 80) rows <- sample(nrow(airline_df), amount_for_train) train <- knn_words[rows,] test <- knn_words[-rows,] ``` Now we can train the model and predict sentiments: ```{r} prediction <- knn(train, test, actual_val[rows]) ``` ###Evaluation For evaluation of our model we can use confusion matrix: ```{r} confMX <- table(predictions = prediction, actual = actual_val[-rows]) confMX ``` Or we can calculate accuracy: ```{r} sum(diag(confMX))/nrow(test) * 100 ``` So the model can predict the sentiment of the comment with accuracy of approximately 50%, which means random guess. That is no good. Let's try another algorithm. ##Random Forest First, we will have to change our data a bit. As there are some columns(words) with strange characters, we have to remove them in order to execute Random Forest. ```{r} names(knn_words) <- make.names(names(knn_words)) # creates correct names knn_words$SSenti <- airline_df$airline_sentiment knn_words <- subset(knn_words, select=which(!duplicated(names(knn_words)))) # removes duplicated words ``` Separating into training and test data: ```{r} amount_for_train <- round(nrow(knn_words) / 100 * 80) rows <- sample(nrow(airline_df), amount_for_train) train <- knn_words[rows,] test <- knn_words[-rows,] ``` ```{r} rf_model <- randomForest(formula =SSenti ~ ., data = train) rf_pred <- predict(rf_model, test) ``` ###Evaluation of Random Forest ```{r} confMX2 <- table(rf_pred, actual = actual_val[-rows]) confMX2 ``` ```{r} sum(diag(confMX2))/nrow(test) * 100 ```