In this lab session we will analyse user comments for various airline companies.

This Lab exercises consists of 2 parts.

Libraries

For today session we will need following libraries:

library("dplyr")
library("tidyr")
library("ggplot2")
library("class") 
library("tm")           # Text mining package
library("wordcloud2")   # Package for building word clouds
library("syuzhet")      # Package for sentement analysis
library("stringr")      # Package for work with strings
library("randomForest")

Loading data

airline_df <- read.csv(file.choose()) # airline.csv
head(airline_df, 3)
##       tweet_id airline_sentiment negativereason        airline       name
## 1 5.703061e+17           neutral                Virgin America    cairdin
## 2 5.703011e+17          positive                Virgin America   jnardino
## 3 5.703011e+17           neutral                Virgin America yvonnalynn
##                                                                       text
## 1                                      @VirginAmerica What @dhepburn said.
## 2 @VirginAmerica plus you've added commercials to the experience... tacky.
## 3  @VirginAmerica I didn't today... Must mean I need to take another trip!
##   tweet_coord             tweet_created tweet_location
## 1             2015-02-24 11:35:52 -0800               
## 2             2015-02-24 11:15:59 -0800               
## 3             2015-02-24 11:15:48 -0800      Lets Play
str(airline_df)
## 'data.frame':    14640 obs. of  9 variables:
##  $ tweet_id         : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment: Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...
##  $ negativereason   : Factor w/ 11 levels "","Bad Flight",..: 1 1 1 2 3 3 1 1 1 1 ...
##  $ airline          : Factor w/ 6 levels "American","Delta",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ name             : Factor w/ 7701 levels "___the___","__betrayal",..: 1073 3477 7666 3477 3477 3477 1392 5658 1874 7665 ...
##  $ text             : Factor w/ 14427 levels "\"LOL you guys are so on it\" - me, had this been 4 months ago...“@JetBlue: Our fleet's on fleek. http://t.co/LYcARlTFHl”",..: 14005 13912 13790 13844 13648 13926 14038 13917 14004 13846 ...
##  $ tweet_coord      : Factor w/ 833 levels "","[-33.87144962, 151.20821275]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tweet_created    : Factor w/ 14247 levels "2015-02-16 23:36:05 -0800",..: 14212 14170 14169 14168 14166 14165 14164 14160 14158 14106 ...
##  $ tweet_location   : Factor w/ 3082 levels "","'Greatness has no limits'",..: 1 1 1465 1 1 1 2407 1529 2389 1529 ...

Here in column “text” there are some comments that were left by customer. Our objective will be to study them and figure out customer’s mood. This will help us to understand what customers think about airlines they use.

Now let’s print the list of airlines that we have in our data:

unique(airline_df$airline)
## [1] Virgin America United         Southwest      Delta         
## [5] US Airways     American      
## Levels: American Delta Southwest United US Airways Virgin America

Let’s look at destribution of the tweets for each airline.

dist <- airline_df %>%
  group_by(airline) %>%
  summarise(n_row = n())

ggplot(dist, aes(x=airline, y=n_row)) + geom_bar(stat = "identity") + theme_bw()

Now, let’s find tweets for different sentiments:

dist2 <- airline_df %>%
  group_by(airline, airline_sentiment) %>%
  summarise(n_row = n())

ggplot(dist2, aes(x=airline, y=n_row, fill=airline_sentiment)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()

dist2 <- airline_df %>%
  filter(airline_sentiment == "negative") %>%
  group_by(airline, negativereason) %>%
  summarise(n_row = n())

ggplot(dist2, aes(x=airline, y=n_row, fill=negativereason)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()

Now we can tell that the biggest problem in airlines is customer service and late flights.

Let’s study tweets more closer.

Counting words using text corpus

Corpus is the collection of documents (or texts) which we will use to do analysis. By using corpuses we make our life easier because otherwise we would have to do cleaning of the text and counting of the words manually.

Building corpus

To process the text, first we need to “clean” it from punctuation, links or other things that can affect our analysis. To build corpus first we need to change encoding of our texts to “UTF-8”:

sacred_texts <- iconv(airline_df$text, to = "utf-8")

Note: If you using mac, use “utf-8-mac” instead of “utf-8”.

Function iconv() is used for converting character vectors to specified encodings. Now we will create corpus based on the texts we converted:

corpus <- Corpus(VectorSource(sacred_texts))
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @VirginAmerica What @dhepburn said.                                                                                           
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.                                                      
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!                                                       
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it

Cleaning corpus

Next step is to clean the corpus from punctuation, links and etc. To do it we will use function tm_map() (from tm package). First argument that should be passed into tm_map() is the corpus and second is some specific method that you want to use to clean the data. Some of the methods that are widely used:

  • tolower - makes all text in lower case.
  • removePunctuation - removes punctuation like dots, comas or dashes.
  • removeNumbers - removes numbers from text.
  • wordLengths - removes words that has less than 3 characters. (By default 3. Can be changed).
  • removeWords - removes words that you specified.

Lets go step by step through the cleaning process: First we will convert all uppercase letters to lowercase:

corpus <- tm_map(corpus, tolower)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @virginamerica what @dhepburn said.                                                                                           
## [2] @virginamerica plus you've added commercials to the experience... tacky.                                                      
## [3] @virginamerica i didn't today... must mean i need to take another trip!                                                       
## [4] @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @virginamerica and it's a really big bad thing about it

Next, we are removing punctuation from the texts:

corpus <- tm_map(corpus, removePunctuation) 
inspect(corpus[1:5]) 
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica what dhepburn said                                                                                       
## [2] virginamerica plus youve added commercials to the experience tacky                                                     
## [3] virginamerica i didnt today must mean i need to take another trip                                                      
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it

Removing numbers:

corpus <- tm_map(corpus, removeNumbers) 
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica what dhepburn said                                                                                       
## [2] virginamerica plus youve added commercials to the experience tacky                                                     
## [3] virginamerica i didnt today must mean i need to take another trip                                                      
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it

And finally we can remove words that are often used (also called as stop words) and have no significant sentiment contribution: Examples of such words: “I”, “me”, “am”, “is”, “the”, etc.

cleanset <- tm_map(corpus, removeWords, stopwords('english')) 
inspect(cleanset[1:5]) 
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica  dhepburn said                                                                        
## [2] virginamerica plus youve added commercials   experience tacky                                       
## [3] virginamerica  didnt today must mean  need  take another trip                                       
## [4] virginamerica  really aggressive  blast obnoxious entertainment   guests faces amp   little recourse
## [5] virginamerica    really big bad thing

Another point that you should pay attention to is an order in which you are cleaning the data. For example, earlier we removed punctuation from text and that made from link “https://t.co/hfhxqj0iob” some meaningless long word: “httpstcohfhxqjiob”. If we wanted to completely remove the links, we should have done it before we removed the punctuation. Unfortunately there is no predetermined function for removing links from text, so we will have to create one:

removeURL <- function(x) gsub('https://[[:alnum:]|[:punct:]]*', '', x) 
corpus2 <- Corpus(VectorSource(sacred_texts))
noUrl <- tm_map(corpus2, content_transformer(removeURL)) 
inspect(noUrl [1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @VirginAmerica What @dhepburn said.                                                                                           
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.                                                      
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!                                                       
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it

Let’s go step by step and look what is going on. First, we are creating function “removeURL” which we will use in our tm_map. This function takes a string (“x”) and removes from it all links using next gsub function:

gsub(‘https://[[:alnum:]|[:punct:]]*’, ’’, x)

Here we take subline that starts with https:// and contains some (0 or more) characters and numbers ([:alnum:]) or (|) punctuation ([:punct:]). This is called regular expression in programming. In terms of this course we will not go into details, but in case you want to know more you can use next documentation: http://www.endmemo.com/program/R/gsub.php

We create another corpus (“corpus2”) and use tm_map to replace links. Here content_transformer() (from tm package) is used make from simple function a transformation function. As tm_map accepts only transformation functions we are wrapping our removeURL function with content_transformer.

Let us proceed with cleaning of the corpus:

noUrl <- tm_map(noUrl, tolower)
noUrl <- tm_map(noUrl, removePunctuation) 
noUrl <- tm_map(noUrl, removeNumbers)

airlines_low <- sapply(unique(airline_df$airline), tolower)
airlines_low_nowhitespace <- gsub(" ", "", airlines_low)

cleanset <- tm_map( noUrl, 
                    removeWords, 
                    c(stopwords('english'), airlines_low, airlines_low_nowhitespace)
                    ) 
cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1]  dhepburn said                                                                   
## [2]  plus youve added commercials experience tacky                                   
## [3]  didnt today must mean need take another trip                                    
## [4]  really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5]  really big bad thing

The result of our cleaning is stored in the variable called the cleanset. We changed our characters to a lowercase, removed a lot of meaningless words, punctuation and following is what we will recive in the end:

inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1]  dhepburn said                                                                   
## [2]  plus youve added commercials experience tacky                                   
## [3]  didnt today must mean need take another trip                                    
## [4]  really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5]  really big bad thing

Term Document Matrix

Now let us proceed to our goal. Next step would be to calculate how much of each word we have in our data. To do so, we will use Document-Term matrix:

dtm <- DocumentTermMatrix(cleanset)
inspect(dtm)
## <<DocumentTermMatrix (documents: 14640, terms: 14375)>>
## Non-/sparse entries: 134132/210315868
## Sparsity           : 100%
## Maximal term length: 46
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    americanair can cancelled flight get jetblue just now southwestair
##   11499           0   0         0      0   0       0    0   0            0
##   11703           0   0         0      0   0       0    0   0            0
##   11715           0   0         0      0   0       0    0   0            0
##   2855            0   0         0      0   0       0    0   0            0
##   3698            0   0         0      2   0       0    0   0            0
##   3995            0   0         0      0   0       0    0   0            0
##   4357            0   0         0      1   0       0    1   1            1
##   4797            0   0         1      2   0       0    0   0            1
##   6764            0   0         2      2   0       2    0   1            0
##   7786            0   1         0      0   0       1    0   0            0
##        Terms
## Docs    thanks
##   11499      0
##   11703      0
##   11715      0
##   2855       0
##   3698       0
##   3995       0
##   4357       0
##   4797       0
##   6764       0
##   7786       0

From the output we can figure out some interesting facts:

  • In our cleanset there are 14431 different words in 14640 documents;
  • Now we can see how many times each word occurs in each document.

But working with TermDocumentMatix would be problematic with standad functions in R, so let’s convert it to regular matrix:

dtm <- as.data.frame(as.matrix(dtm))
dtm$AAirline <- airline_df$airline
dtm[1:10, 1:20]
##    dhepburn said added commercials experience plus tacky youve another
## 1         1    1     0           0          0    0     0     0       0
## 2         0    0     1           1          1    1     1     1       0
## 3         0    0     0           0          0    0     0     0       1
## 4         0    0     0           0          0    0     0     0       0
## 5         0    0     0           0          0    0     0     0       0
## 6         0    0     0           0          0    0     0     0       0
## 7         0    0     0           0          0    0     0     0       0
## 8         0    0     0           0          0    0     0     0       0
## 9         0    0     0           0          0    0     0     0       0
## 10        0    0     0           0          0    0     0     0       0
##    didnt mean must need take today trip aggressive amp blast entertainment
## 1      0    0    0    0    0     0    0          0   0     0             0
## 2      0    0    0    0    0     0    0          0   0     0             0
## 3      1    1    1    1    1     1    1          0   0     0             0
## 4      0    0    0    0    0     0    0          1   1     1             1
## 5      0    0    0    0    0     0    0          0   0     0             0
## 6      1    0    0    0    0     0    0          0   0     0             0
## 7      0    0    0    0    0     0    0          0   0     0             0
## 8      0    0    0    0    0     0    0          0   0     0             0
## 9      0    0    0    0    0     0    0          0   0     0             0
## 10     0    0    0    0    0     0    0          0   0     0             0

Great! We obtained data we need. Now let’s do some analysis on the data. First, we will have to calculate amount of words regardless of the document where they have appeared:

wordCount <- dtm %>%
  group_by(AAirline) %>%
  summarise_all(sum)

wordCount
## # A tibble: 6 x 14,376
##   AAirline   dhepburn  said added commercials experience  plus tacky youve
##   <fct>         <dbl> <dbl> <dbl>       <dbl>      <dbl> <dbl> <dbl> <dbl>
## 1 American         0.   38.    0.          0.        29.    5.    0.   14.
## 2 Delta            0.   29.    3.          2.        25.    5.    0.    2.
## 3 Southwest        0.   20.    4.          2.        25.    4.    0.    8.
## 4 United           0.   47.    5.          0.        56.   37.    0.   22.
## 5 US Airways       0.   42.    5.          1.        55.    7.    0.    8.
## 6 Virgin Am~       1.    2.    3.          1.        12.    1.    1.    2.
## # ... with 14,367 more variables: another <dbl>, didnt <dbl>, mean <dbl>,
## #   must <dbl>, need <dbl>, take <dbl>, today <dbl>, trip <dbl>,
## #   aggressive <dbl>, amp <dbl>, blast <dbl>, entertainment <dbl>,
## #   faces <dbl>, guests <dbl>, little <dbl>, obnoxious <dbl>,
## #   really <dbl>, recourse <dbl>, bad <dbl>, big <dbl>, thing <dbl>,
## #   flight <dbl>, flying <dbl>, pay <dbl>, playing <dbl>, seats <dbl>,
## #   seriously <dbl>, away <dbl>, ear <dbl>, every <dbl>, fly <dbl>,
## #   nearly <dbl>, time <dbl>, wonã <dbl>, wormã <dbl>, yes <dbl>,
## #   hats <dbl>, men <dbl>, missed <dbl>, opportunity <dbl>, parody <dbl>,
## #   prime <dbl>, without <dbl>, didntã <dbl>, now <dbl>, well <dbl>,
## #   amazing <dbl>, arrived <dbl>, early <dbl>, good <dbl>, hour <dbl>,
## #   youre <dbl>, among <dbl>, cause <dbl>, death <dbl>, know <dbl>,
## #   leading <dbl>, second <dbl>, suicide <dbl>, teens <dbl>, better <dbl>,
## #   graphics <dbl>, iconography <dbl>, minimal <dbl>, much <dbl>,
## #   pretty <dbl>, already <dbl>, australia <dbl>, deal <dbl>, even <dbl>,
## #   gone <dbl>, great <dbl>, havent <dbl>, thinking <dbl>, yet <dbl>,
## #   fabulous <dbl>, httptcoahlxhhkiyn <dbl>, seductive <dbl>, skies <dbl>,
## #   stress <dbl>, travel <dbl>, virginmedia <dbl>, thanks <dbl>,
## #   mia <dbl>, schedule <dbl>, sfopdx <dbl>, still <dbl>, country <dbl>,
## #   cross <dbl>, daystogo <dbl>, excited <dbl>, first <dbl>, heard <dbl>,
## #   ive <dbl>, lax <dbl>, mco <dbl>, nothing <dbl>, things <dbl>,
## #   couldnt <dbl>, due <dbl>, ...

Let’s gather the columns:

gathered_wc <- gather(wordCount, "word", "amount", 2:ncol(wordCount))

And visualize the result:

ggplot(filter(gathered_wc, amount > 50)) + 
  geom_histogram(aes(x=word, y=amount, fill=AAirline), stat = "identity", position = "dodge") + 
  coord_flip()

Word cloud

There are a better way to visualize results we obtained. Now we will create some word clouds.

unique(airline_df$airline)
## [1] Virgin America United         Southwest      Delta         
## [5] US Airways     American      
## Levels: American Delta Southwest United US Airways Virgin America

Virgin America

set.seed(8)

gathered_wc %>%
filter(AAirline == "Virgin America") %>% 
  select(word, amount) %>%
  wordcloud2(size = 0.6, # set scale of the words
   shape = 'triangle', # shape of the cloud
   rotateRatio = 0.5, # angle with which we want to rotate the word
   minSize = 10) # minimal frequency of the word

Try running word clouds for other companies:

United

Southwest

Delta