In this lab session we will analyse user comments for various airline companies.
- We will study user’s comments and figure out the most frequent words.
- Do sentiment analysis for each airline and find the best one.
This Lab exercises consists of 2 parts.
- First part: We will perform descriptive analysis and visualization.
- Second part: We will also see example of K-NN: how it can be used for text classification.
Libraries
For today session we will need following libraries:
library("dplyr")
library("tidyr")
library("ggplot2")
library("class")
library("tm") # Text mining package
library("wordcloud2") # Package for building word clouds
library("syuzhet") # Package for sentement analysis
library("stringr") # Package for work with strings
library("randomForest")
Loading data
airline_df <- read.csv(file.choose()) # airline.csv
head(airline_df, 3)
## tweet_id airline_sentiment negativereason airline name
## 1 5.703061e+17 neutral Virgin America cairdin
## 2 5.703011e+17 positive Virgin America jnardino
## 3 5.703011e+17 neutral Virgin America yvonnalynn
## text
## 1 @VirginAmerica What @dhepburn said.
## 2 @VirginAmerica plus you've added commercials to the experience... tacky.
## 3 @VirginAmerica I didn't today... Must mean I need to take another trip!
## tweet_coord tweet_created tweet_location
## 1 2015-02-24 11:35:52 -0800
## 2 2015-02-24 11:15:59 -0800
## 3 2015-02-24 11:15:48 -0800 Lets Play
str(airline_df)
## 'data.frame': 14640 obs. of 9 variables:
## $ tweet_id : num 5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
## $ airline_sentiment: Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...
## $ negativereason : Factor w/ 11 levels "","Bad Flight",..: 1 1 1 2 3 3 1 1 1 1 ...
## $ airline : Factor w/ 6 levels "American","Delta",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ name : Factor w/ 7701 levels "___the___","__betrayal",..: 1073 3477 7666 3477 3477 3477 1392 5658 1874 7665 ...
## $ text : Factor w/ 14427 levels "\"LOL you guys are so on it\" - me, had this been 4 months ago...â@JetBlue: Our fleet's on fleek. http://t.co/LYcARlTFHlâ",..: 14005 13912 13790 13844 13648 13926 14038 13917 14004 13846 ...
## $ tweet_coord : Factor w/ 833 levels "","[-33.87144962, 151.20821275]",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ tweet_created : Factor w/ 14247 levels "2015-02-16 23:36:05 -0800",..: 14212 14170 14169 14168 14166 14165 14164 14160 14158 14106 ...
## $ tweet_location : Factor w/ 3082 levels "","'Greatness has no limits'",..: 1 1 1465 1 1 1 2407 1529 2389 1529 ...
Here in column “text” there are some comments that were left by customer. Our objective will be to study them and figure out customer’s mood. This will help us to understand what customers think about airlines they use.
Now let’s print the list of airlines that we have in our data:
unique(airline_df$airline)
## [1] Virgin America United Southwest Delta
## [5] US Airways American
## Levels: American Delta Southwest United US Airways Virgin America
Let’s look at destribution of the tweets for each airline.
dist <- airline_df %>%
group_by(airline) %>%
summarise(n_row = n())
ggplot(dist, aes(x=airline, y=n_row)) + geom_bar(stat = "identity") + theme_bw()
Now, let’s find tweets for different sentiments:
dist2 <- airline_df %>%
group_by(airline, airline_sentiment) %>%
summarise(n_row = n())
ggplot(dist2, aes(x=airline, y=n_row, fill=airline_sentiment)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()
dist2 <- airline_df %>%
filter(airline_sentiment == "negative") %>%
group_by(airline, negativereason) %>%
summarise(n_row = n())
ggplot(dist2, aes(x=airline, y=n_row, fill=negativereason)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()
Now we can tell that the biggest problem in airlines is customer service and late flights.
Let’s study tweets more closer.
Counting words using text corpus
Corpus is the collection of documents (or texts) which we will use to do analysis. By using corpuses we make our life easier because otherwise we would have to do cleaning of the text and counting of the words manually.
Building corpus
To process the text, first we need to “clean” it from punctuation, links or other things that can affect our analysis. To build corpus first we need to change encoding of our texts to “UTF-8”:
sacred_texts <- iconv(airline_df$text, to = "utf-8")
Note: If you using mac, use “utf-8-mac” instead of “utf-8”.
Function iconv() is used for converting character vectors to specified encodings. Now we will create corpus based on the texts we converted:
corpus <- Corpus(VectorSource(sacred_texts))
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] @VirginAmerica What @dhepburn said.
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it
Cleaning corpus
Next step is to clean the corpus from punctuation, links and etc. To do it we will use function tm_map() (from tm package). First argument that should be passed into tm_map() is the corpus and second is some specific method that you want to use to clean the data. Some of the methods that are widely used:
- tolower - makes all text in lower case.
- removePunctuation - removes punctuation like dots, comas or dashes.
- removeNumbers - removes numbers from text.
- wordLengths - removes words that has less than 3 characters. (By default 3. Can be changed).
- removeWords - removes words that you specified.
Lets go step by step through the cleaning process: First we will convert all uppercase letters to lowercase:
corpus <- tm_map(corpus, tolower)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] @virginamerica what @dhepburn said.
## [2] @virginamerica plus you've added commercials to the experience... tacky.
## [3] @virginamerica i didn't today... must mean i need to take another trip!
## [4] @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
## [5] @virginamerica and it's a really big bad thing about it
Next, we are removing punctuation from the texts:
corpus <- tm_map(corpus, removePunctuation)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] virginamerica what dhepburn said
## [2] virginamerica plus youve added commercials to the experience tacky
## [3] virginamerica i didnt today must mean i need to take another trip
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it
Removing numbers:
corpus <- tm_map(corpus, removeNumbers)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] virginamerica what dhepburn said
## [2] virginamerica plus youve added commercials to the experience tacky
## [3] virginamerica i didnt today must mean i need to take another trip
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it
And finally we can remove words that are often used (also called as stop words) and have no significant sentiment contribution: Examples of such words: “I”, “me”, “am”, “is”, “the”, etc.
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] virginamerica dhepburn said
## [2] virginamerica plus youve added commercials experience tacky
## [3] virginamerica didnt today must mean need take another trip
## [4] virginamerica really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5] virginamerica really big bad thing
Another point that you should pay attention to is an order in which you are cleaning the data. For example, earlier we removed punctuation from text and that made from link “https://t.co/hfhxqj0iob” some meaningless long word: “httpstcohfhxqjiob”. If we wanted to completely remove the links, we should have done it before we removed the punctuation. Unfortunately there is no predetermined function for removing links from text, so we will have to create one:
removeURL <- function(x) gsub('https://[[:alnum:]|[:punct:]]*', '', x)
corpus2 <- Corpus(VectorSource(sacred_texts))
noUrl <- tm_map(corpus2, content_transformer(removeURL))
inspect(noUrl [1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] @VirginAmerica What @dhepburn said.
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it
Let’s go step by step and look what is going on. First, we are creating function “removeURL” which we will use in our tm_map. This function takes a string (“x”) and removes from it all links using next gsub function:
gsub(‘https://[[:alnum:]|[:punct:]]*’, ’’, x)
Here we take subline that starts with https:// and contains some (0 or more) characters and numbers ([:alnum:]) or (|) punctuation ([:punct:]). This is called regular expression in programming. In terms of this course we will not go into details, but in case you want to know more you can use next documentation: http://www.endmemo.com/program/R/gsub.php
We create another corpus (“corpus2”) and use tm_map to replace links. Here content_transformer() (from tm package) is used make from simple function a transformation function. As tm_map accepts only transformation functions we are wrapping our removeURL function with content_transformer.
Let us proceed with cleaning of the corpus:
noUrl <- tm_map(noUrl, tolower)
noUrl <- tm_map(noUrl, removePunctuation)
noUrl <- tm_map(noUrl, removeNumbers)
airlines_low <- sapply(unique(airline_df$airline), tolower)
airlines_low_nowhitespace <- gsub(" ", "", airlines_low)
cleanset <- tm_map( noUrl,
removeWords,
c(stopwords('english'), airlines_low, airlines_low_nowhitespace)
)
cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] dhepburn said
## [2] plus youve added commercials experience tacky
## [3] didnt today must mean need take another trip
## [4] really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5] really big bad thing
The result of our cleaning is stored in the variable called the cleanset. We changed our characters to a lowercase, removed a lot of meaningless words, punctuation and following is what we will recive in the end:
inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] dhepburn said
## [2] plus youve added commercials experience tacky
## [3] didnt today must mean need take another trip
## [4] really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5] really big bad thing
Term Document Matrix
Now let us proceed to our goal. Next step would be to calculate how much of each word we have in our data. To do so, we will use Document-Term matrix:
dtm <- DocumentTermMatrix(cleanset)
inspect(dtm)
## <<DocumentTermMatrix (documents: 14640, terms: 14375)>>
## Non-/sparse entries: 134132/210315868
## Sparsity : 100%
## Maximal term length: 46
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs americanair can cancelled flight get jetblue just now southwestair
## 11499 0 0 0 0 0 0 0 0 0
## 11703 0 0 0 0 0 0 0 0 0
## 11715 0 0 0 0 0 0 0 0 0
## 2855 0 0 0 0 0 0 0 0 0
## 3698 0 0 0 2 0 0 0 0 0
## 3995 0 0 0 0 0 0 0 0 0
## 4357 0 0 0 1 0 0 1 1 1
## 4797 0 0 1 2 0 0 0 0 1
## 6764 0 0 2 2 0 2 0 1 0
## 7786 0 1 0 0 0 1 0 0 0
## Terms
## Docs thanks
## 11499 0
## 11703 0
## 11715 0
## 2855 0
## 3698 0
## 3995 0
## 4357 0
## 4797 0
## 6764 0
## 7786 0
From the output we can figure out some interesting facts:
- In our cleanset there are 14431 different words in 14640 documents;
- Now we can see how many times each word occurs in each document.
But working with TermDocumentMatix would be problematic with standad functions in R, so let’s convert it to regular matrix:
dtm <- as.data.frame(as.matrix(dtm))
dtm$AAirline <- airline_df$airline
dtm[1:10, 1:20]
## dhepburn said added commercials experience plus tacky youve another
## 1 1 1 0 0 0 0 0 0 0
## 2 0 0 1 1 1 1 1 1 0
## 3 0 0 0 0 0 0 0 0 1
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## didnt mean must need take today trip aggressive amp blast entertainment
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 0 0 0 0
## 4 0 0 0 0 0 0 0 1 1 1 1
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 1 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0
Great! We obtained data we need. Now let’s do some analysis on the data. First, we will have to calculate amount of words regardless of the document where they have appeared:
wordCount <- dtm %>%
group_by(AAirline) %>%
summarise_all(sum)
wordCount
## # A tibble: 6 x 14,376
## AAirline dhepburn said added commercials experience plus tacky youve
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 American 0. 38. 0. 0. 29. 5. 0. 14.
## 2 Delta 0. 29. 3. 2. 25. 5. 0. 2.
## 3 Southwest 0. 20. 4. 2. 25. 4. 0. 8.
## 4 United 0. 47. 5. 0. 56. 37. 0. 22.
## 5 US Airways 0. 42. 5. 1. 55. 7. 0. 8.
## 6 Virgin Am~ 1. 2. 3. 1. 12. 1. 1. 2.
## # ... with 14,367 more variables: another <dbl>, didnt <dbl>, mean <dbl>,
## # must <dbl>, need <dbl>, take <dbl>, today <dbl>, trip <dbl>,
## # aggressive <dbl>, amp <dbl>, blast <dbl>, entertainment <dbl>,
## # faces <dbl>, guests <dbl>, little <dbl>, obnoxious <dbl>,
## # really <dbl>, recourse <dbl>, bad <dbl>, big <dbl>, thing <dbl>,
## # flight <dbl>, flying <dbl>, pay <dbl>, playing <dbl>, seats <dbl>,
## # seriously <dbl>, away <dbl>, ear <dbl>, every <dbl>, fly <dbl>,
## # nearly <dbl>, time <dbl>, wonã <dbl>, wormã <dbl>, yes <dbl>,
## # hats <dbl>, men <dbl>, missed <dbl>, opportunity <dbl>, parody <dbl>,
## # prime <dbl>, without <dbl>, didntã <dbl>, now <dbl>, well <dbl>,
## # amazing <dbl>, arrived <dbl>, early <dbl>, good <dbl>, hour <dbl>,
## # youre <dbl>, among <dbl>, cause <dbl>, death <dbl>, know <dbl>,
## # leading <dbl>, second <dbl>, suicide <dbl>, teens <dbl>, better <dbl>,
## # graphics <dbl>, iconography <dbl>, minimal <dbl>, much <dbl>,
## # pretty <dbl>, already <dbl>, australia <dbl>, deal <dbl>, even <dbl>,
## # gone <dbl>, great <dbl>, havent <dbl>, thinking <dbl>, yet <dbl>,
## # fabulous <dbl>, httptcoahlxhhkiyn <dbl>, seductive <dbl>, skies <dbl>,
## # stress <dbl>, travel <dbl>, virginmedia <dbl>, thanks <dbl>,
## # mia <dbl>, schedule <dbl>, sfopdx <dbl>, still <dbl>, country <dbl>,
## # cross <dbl>, daystogo <dbl>, excited <dbl>, first <dbl>, heard <dbl>,
## # ive <dbl>, lax <dbl>, mco <dbl>, nothing <dbl>, things <dbl>,
## # couldnt <dbl>, due <dbl>, ...
Let’s gather the columns:
gathered_wc <- gather(wordCount, "word", "amount", 2:ncol(wordCount))
And visualize the result:
ggplot(filter(gathered_wc, amount > 50)) +
geom_histogram(aes(x=word, y=amount, fill=AAirline), stat = "identity", position = "dodge") +
coord_flip()
Word cloud
There are a better way to visualize results we obtained. Now we will create some word clouds.
unique(airline_df$airline)
## [1] Virgin America United Southwest Delta
## [5] US Airways American
## Levels: American Delta Southwest United US Airways Virgin America
Virgin America
set.seed(8)
gathered_wc %>%
filter(AAirline == "Virgin America") %>%
select(word, amount) %>%
wordcloud2(size = 0.6, # set scale of the words
shape = 'triangle', # shape of the cloud
rotateRatio = 0.5, # angle with which we want to rotate the word
minSize = 10) # minimal frequency of the word
Try running word clouds for other companies: