In this lab session we will analyse user comments for various airline companies.

This Lab exercises consists of 2 parts.

Libraries

For today session we will need following libraries:

library("dplyr")
library("tidyr")
library("ggplot2")

library("tm")           # Text mining package
library("wordcloud2")   # Package for building word clouds
library("syuzhet")      # Package for sentement analysis
library("stringr")      # Package for work with strings
library("class")        # KNN
library("e1071")        # For SVM

Loading data

airline_df <- read.csv(file.choose()) # airline.csv
head(airline_df, 3)
##       tweet_id airline_sentiment negativereason        airline       name
## 1 5.703061e+17           neutral                Virgin America    cairdin
## 2 5.703011e+17          positive                Virgin America   jnardino
## 3 5.703011e+17           neutral                Virgin America yvonnalynn
##                                                                       text
## 1                                      @VirginAmerica What @dhepburn said.
## 2 @VirginAmerica plus you've added commercials to the experience... tacky.
## 3  @VirginAmerica I didn't today... Must mean I need to take another trip!
##   tweet_coord             tweet_created tweet_location
## 1             2015-02-24 11:35:52 -0800               
## 2             2015-02-24 11:15:59 -0800               
## 3             2015-02-24 11:15:48 -0800      Lets Play
str(airline_df)
## 'data.frame':    14640 obs. of  9 variables:
##  $ tweet_id         : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment: Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...
##  $ negativereason   : Factor w/ 11 levels "","Bad Flight",..: 1 1 1 2 3 3 1 1 1 1 ...
##  $ airline          : Factor w/ 6 levels "American","Delta",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ name             : Factor w/ 7701 levels "___the___","__betrayal",..: 1073 3477 7666 3477 3477 3477 1392 5658 1874 7665 ...
##  $ text             : Factor w/ 14427 levels "\"LOL you guys are so on it\" - me, had this been 4 months ago...“@JetBlue: Our fleet's on fleek. http://t.co/LYcARlTFHl”",..: 14005 13912 13790 13844 13648 13926 14038 13917 14004 13846 ...
##  $ tweet_coord      : Factor w/ 833 levels "","[-33.87144962, 151.20821275]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tweet_created    : Factor w/ 14247 levels "2015-02-16 23:36:05 -0800",..: 14212 14170 14169 14168 14166 14165 14164 14160 14158 14106 ...
##  $ tweet_location   : Factor w/ 3082 levels "","'Greatness has no limits'",..: 1 1 1465 1 1 1 2407 1529 2389 1529 ...

Here in column “text” there are some comments that were left by customer. Our objective will be to study them and figure out customer’s mood. This will help us to understand what customers think about airlines they use.

Now let’s print the list of airlines that we have in our data:

unique(airline_df$airline)
## [1] Virgin America United         Southwest      Delta         
## [5] US Airways     American      
## Levels: American Delta Southwest United US Airways Virgin America

Let’s look at destribution of the tweets for each airline.

dist <- airline_df %>%
  group_by(airline) %>%
  summarise(n_row = n())

ggplot(dist, aes(x=airline, y=n_row)) + geom_bar(stat = "identity") + theme_bw()

We googled domestic market share for airlines and found next plot:

Now, let’s find tweets for different sentiments:

dist2 <- airline_df %>%
  group_by(airline, airline_sentiment) %>%
  summarise(n_row = n())

ggplot(dist2, aes(x=airline, y=n_row, fill=airline_sentiment)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()

dist2 <- airline_df %>%
  filter(airline_sentiment == "negative") %>%
  group_by(airline, negativereason) %>%
  summarise(n_row = n())

ggplot(dist2, aes(x=airline, y=n_row, fill=negativereason)) + geom_bar(stat = "identity", position = "dodge") + theme_bw()

Now we can tell that the biggest problem in airlines is customer service and late flights.

Let’s study tweets more closer.

Counting words using text corpus

Corpus is the collection of documents (or texts) which we will use to do analysis. By using corpuses we make our life easier because otherwise we would have to do cleaning of the text and counting of the words manually.

Building corpus

To process the text, first we need to “clean” it from punctuation, links or other things that can affect our analysis. To build corpus first we need to change encoding of our texts to “UTF-8”:

sacred_texts <- iconv(airline_df$text, to = "utf-8")

Note: If you using mac, use “utf-8-mac” instead of “utf-8”.

Function iconv() is used for converting character vectors to specified encodings. Now we will create corpus based on the texts we converted:

corpus <- Corpus(VectorSource(sacred_texts))
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @VirginAmerica What @dhepburn said.                                                                                           
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.                                                      
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!                                                       
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it

Cleaning corpus

Next step is to clean the corpus from punctuation, links and etc. To do it we will use function tm_map() (from tm package). First argument that should be passed into tm_map() is the corpus and second is some specific method that you want to use to clean the data. Some of the methods that are widely used:

  • tolower - makes all text in lower case.
  • removePunctuation - removes punctuation like dots, comas or dashes.
  • removeNumbers - removes numbers from text.
  • wordLengths - removes words that has less than 3 characters. (By default 3. Can be changed).
  • removeWords - removes words that you specified.

Lets go step by step through the cleaning process: First we will convert all uppercase letters to lowercase:

corpus <- tm_map(corpus, tolower)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @virginamerica what @dhepburn said.                                                                                           
## [2] @virginamerica plus you've added commercials to the experience... tacky.                                                      
## [3] @virginamerica i didn't today... must mean i need to take another trip!                                                       
## [4] @virginamerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @virginamerica and it's a really big bad thing about it

Next, we are removing punctuation from the texts:

corpus <- tm_map(corpus, removePunctuation) 
inspect(corpus[1:5]) 
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica what dhepburn said                                                                                       
## [2] virginamerica plus youve added commercials to the experience tacky                                                     
## [3] virginamerica i didnt today must mean i need to take another trip                                                      
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it

Removing numbers:

corpus <- tm_map(corpus, removeNumbers) 
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica what dhepburn said                                                                                       
## [2] virginamerica plus youve added commercials to the experience tacky                                                     
## [3] virginamerica i didnt today must mean i need to take another trip                                                      
## [4] virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
## [5] virginamerica and its a really big bad thing about it

And finally we can remove words that are often used (also called as stop words) and have no significant sentiment contribution: Examples of such words: “I”, “me”, “am”, “is”, “the”, etc.

cleanset <- tm_map(corpus, removeWords, stopwords('english')) 
inspect(cleanset[1:5]) 
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] virginamerica  dhepburn said                                                                        
## [2] virginamerica plus youve added commercials   experience tacky                                       
## [3] virginamerica  didnt today must mean  need  take another trip                                       
## [4] virginamerica  really aggressive  blast obnoxious entertainment   guests faces amp   little recourse
## [5] virginamerica    really big bad thing

Another point that you should pay attention to is an order in which you are cleaning the data. For example, earlier we removed punctuation from text and that made from link “https://t.co/hfhxqj0iob” some meaningless long word: “httpstcohfhxqjiob”. If we wanted to completely remove the links, we should have done it before we removed the punctuation. Unfortunately there is no predetermined function for removing links from text, so we will have to create one:

removeURL <- function(x) gsub('https://[[:alnum:]|[:punct:]]*', '', x) 
corpus2 <- Corpus(VectorSource(sacred_texts))
noUrl <- tm_map(corpus2, content_transformer(removeURL)) 
inspect(noUrl [1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @VirginAmerica What @dhepburn said.                                                                                           
## [2] @VirginAmerica plus you've added commercials to the experience... tacky.                                                      
## [3] @VirginAmerica I didn't today... Must mean I need to take another trip!                                                       
## [4] @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## [5] @VirginAmerica and it's a really big bad thing about it

Let’s go step by step and look what is going on. First, we are creating function “removeURL” which we will use in our tm_map. This function takes a string (“x”) and removes from it all links using next gsub function:

gsub(‘https://[[:alnum:]|[:punct:]]*’, ’’, x)

Here we take subline that starts with https:// and contains some (0 or more) characters and numbers ([:alnum:]) or (|) punctuation ([:punct:]). This is called regular expression in programming. In terms of this course we will not go into details, but in case you want to know more you can use next documentation: http://www.endmemo.com/program/R/gsub.php

We create another corpus (“corpus2”) and use tm_map to replace links. Here content_transformer() (from tm package) is used make from simple function a transformation function. As tm_map accepts only transformation functions we are wrapping our removeURL function with content_transformer.

Let us proceed with cleaning of the corpus:

<YOUR CODE> # Repeat cleaning as we did before
            # Cast letters to lower case
            # Remove punctuation
            # Remove numbers
airlines_low <- sapply(unique(airline_df$airline), tolower)
airlines_low_nowhitespace <- gsub(" ", "", airlines_low)

cleanset <- tm_map( noUrl, 
                    removeWords, 
                    c(stopwords('english'), airlines_low, airlines_low_nowhitespace)
                    ) 
cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1]  dhepburn said                                                                   
## [2]  plus youve added commercials experience tacky                                   
## [3]  didnt today must mean need take another trip                                    
## [4]  really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5]  really big bad thing

The result of our cleaning is stored in the variable called the cleanset. We changed our characters to a lowercase, removed a lot of meaningless words, punctuation and following is what we will recive in the end:

inspect(cleanset[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1]  dhepburn said                                                                   
## [2]  plus youve added commercials experience tacky                                   
## [3]  didnt today must mean need take another trip                                    
## [4]  really aggressive blast obnoxious entertainment guests faces amp little recourse
## [5]  really big bad thing

Term Document Matrix

Now let us proceed to our goal. Next step would be to calculate how much of each word we have in our data. To do so, we will use Document-Term matrix:

dtm <- DocumentTermMatrix(cleanset)
inspect(dtm)
## <<DocumentTermMatrix (documents: 14640, terms: 14375)>>
## Non-/sparse entries: 134132/210315868
## Sparsity           : 100%
## Maximal term length: 46
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    americanair can cancelled flight get jetblue just now southwestair
##   11499           0   0         0      0   0       0    0   0            0
##   11703           0   0         0      0   0       0    0   0            0
##   11715           0   0         0      0   0       0    0   0            0
##   2855            0   0         0      0   0       0    0   0            0
##   3698            0   0         0      2   0       0    0   0            0
##   3995            0   0         0      0   0       0    0   0            0
##   4357            0   0         0      1   0       0    1   1            1
##   4797            0   0         1      2   0       0    0   0            1
##   6764            0   0         2      2   0       2    0   1            0
##   7786            0   1         0      0   0       1    0   0            0
##        Terms
## Docs    thanks
##   11499      0
##   11703      0
##   11715      0
##   2855       0
##   3698       0
##   3995       0
##   4357       0
##   4797       0
##   6764       0
##   7786       0

From the output we can figure out some interesting facts:

  • In our cleanset there are 14375 different words in 14640 documents;
  • Now we can see how many times each word occurs in each document.

But working with TermDocumentMatix would be problematic with standad functions in R, so let’s convert it to regular matrix:

dtm <- as.data.frame(as.matrix(dtm))
dtm$AAirline <- airline_df$airline
dtm[1:10, 1:20]
##    dhepburn said added commercials experience plus tacky youve another
## 1         1    1     0           0          0    0     0     0       0
## 2         0    0     1           1          1    1     1     1       0
## 3         0    0     0           0          0    0     0     0       1
## 4         0    0     0           0          0    0     0     0       0
## 5         0    0     0           0          0    0     0     0       0
## 6         0    0     0           0          0    0     0     0       0
## 7         0    0     0           0          0    0     0     0       0
## 8         0    0     0           0          0    0     0     0       0
## 9         0    0     0           0          0    0     0     0       0
## 10        0    0     0           0          0    0     0     0       0
##    didnt mean must need take today trip aggressive amp blast entertainment
## 1      0    0    0    0    0     0    0          0   0     0             0
## 2      0    0    0    0    0     0    0          0   0     0             0
## 3      1    1    1    1    1     1    1          0   0     0             0
## 4      0    0    0    0    0     0    0          1   1     1             1
## 5      0    0    0    0    0     0    0          0   0     0             0
## 6      1    0    0    0    0     0    0          0   0     0             0
## 7      0    0    0    0    0     0    0          0   0     0             0
## 8      0    0    0    0    0     0    0          0   0     0             0
## 9      0    0    0    0    0     0    0          0   0     0             0
## 10     0    0    0    0    0     0    0          0   0     0             0

Great! We obtained data we need. Now let’s do some analysis on the data. First, we will have to calculate amount of words regardless of the document where they have appeared:

wordCount <- dtm %>%
  group_by(AAirline) %>%
  summarise_all(sum)

wordCount
## # A tibble: 6 x 14,376
##   AAirline   dhepburn  said added commercials experience  plus tacky youve
##   <fct>         <dbl> <dbl> <dbl>       <dbl>      <dbl> <dbl> <dbl> <dbl>
## 1 American         0.   38.    0.          0.        29.    5.    0.   14.
## 2 Delta            0.   29.    3.          2.        25.    5.    0.    2.
## 3 Southwest        0.   20.    4.          2.        25.    4.    0.    8.
## 4 United           0.   47.    5.          0.        56.   37.    0.   22.
## 5 US Airways       0.   42.    5.          1.        55.    7.    0.    8.
## 6 Virgin Am~       1.    2.    3.          1.        12.    1.    1.    2.
## # ... with 14,367 more variables: another <dbl>, didnt <dbl>, mean <dbl>,
## #   must <dbl>, need <dbl>, take <dbl>, today <dbl>, trip <dbl>,
## #   aggressive <dbl>, amp <dbl>, blast <dbl>, entertainment <dbl>,
## #   faces <dbl>, guests <dbl>, little <dbl>, obnoxious <dbl>,
## #   really <dbl>, recourse <dbl>, bad <dbl>, big <dbl>, thing <dbl>,
## #   flight <dbl>, flying <dbl>, pay <dbl>, playing <dbl>, seats <dbl>,
## #   seriously <dbl>, away <dbl>, ear <dbl>, every <dbl>, fly <dbl>,
## #   nearly <dbl>, time <dbl>, wonã <dbl>, wormã <dbl>, yes <dbl>,
## #   hats <dbl>, men <dbl>, missed <dbl>, opportunity <dbl>, parody <dbl>,
## #   prime <dbl>, without <dbl>, didntã <dbl>, now <dbl>, well <dbl>,
## #   amazing <dbl>, arrived <dbl>, early <dbl>, good <dbl>, hour <dbl>,
## #   youre <dbl>, among <dbl>, cause <dbl>, death <dbl>, know <dbl>,
## #   leading <dbl>, second <dbl>, suicide <dbl>, teens <dbl>, better <dbl>,
## #   graphics <dbl>, iconography <dbl>, minimal <dbl>, much <dbl>,
## #   pretty <dbl>, already <dbl>, australia <dbl>, deal <dbl>, even <dbl>,
## #   gone <dbl>, great <dbl>, havent <dbl>, thinking <dbl>, yet <dbl>,
## #   fabulous <dbl>, httptcoahlxhhkiyn <dbl>, seductive <dbl>, skies <dbl>,
## #   stress <dbl>, travel <dbl>, virginmedia <dbl>, thanks <dbl>,
## #   mia <dbl>, schedule <dbl>, sfopdx <dbl>, still <dbl>, country <dbl>,
## #   cross <dbl>, daystogo <dbl>, excited <dbl>, first <dbl>, heard <dbl>,
## #   ive <dbl>, lax <dbl>, mco <dbl>, nothing <dbl>, things <dbl>,
## #   couldnt <dbl>, due <dbl>, ...

Let’s gather the columns:

gathered_wc <- gather(wordCount, "word", "amount", 2:ncol(wordCount))

And visualize the result:

ggplot(filter(gathered_wc, amount > 50)) + 
  geom_histogram(aes(x=word, y=amount, fill=AAirline), stat = "identity", position = "dodge") + 
  coord_flip()

Word cloud

There are a better way to visualize results we obtained. Now we will create some word clouds.

unique(airline_df$airline)
## [1] Virgin America United         Southwest      Delta         
## [5] US Airways     American      
## Levels: American Delta Southwest United US Airways Virgin America

United

Try running word clouds for other companies:

<YOUR CODE> # Create your own wordcloud

Tweets distribution by days of the week

Let’s find weekdays for each tweet:

airline_df <- airline_df %>%
  mutate(weekday = as.POSIXlt(tweet_created)$wday + 1)
airline_df %>%
  group_by(airline, weekday) %>%
  mutate(n_row = n()) %>%
ggplot(aes(x=weekday, y=n_row, color=airline)) + 
  geom_point() +
  geom_line() + 
  theme_bw() + 
  facet_grid(airline_sentiment ~ .)

Sentiment analysis

If you remember, we extracted text from our data and changed it’s encoding:

head(sacred_texts, 3)
## [1] "@VirginAmerica What @dhepburn said."                                     
## [2] "@VirginAmerica plus you've added commercials to the experience... tacky."
## [3] "@VirginAmerica I didn't today... Must mean I need to take another trip!"

We will use it to do sentiment analysis.

Librariy

For this we will use syuzhet package that provides useful methods for extracting sentiment data.

Sentiment scores

To calculate sentiment scores we will use NRC emotion lexicon. According to which, there are 8 different emotions and 2 sentiments(negative and positive).

Let’s check it out by calculating sentiment scores of single words:

rbind(get_nrc_sentiment('The flight was bad. Plane delayed on 2 hours!!!'),
get_nrc_sentiment('sunny'))
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     1            0       1    1   0       1        0     0        2
## 2     0            1       0    0   1       0        1     0        0
##   positive
## 1        0
## 2        1

Now let’s calculate scores for our texts:

scores <- get_nrc_sentiment(sacred_texts)

Let us study scores a bit:

summary(scores)
##      anger         anticipation       disgust            fear       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1875   Mean   :0.3672   Mean   :0.1464   Mean   :0.2264  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :4.0000   Max.   :4.0000   Max.   :4.0000   Max.   :4.0000  
##       joy            sadness          surprise          trust       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2265   Mean   :0.2859   Mean   :0.1536   Mean   :0.6552  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :4.0000   Max.   :4.0000   Max.   :4.0000   Max.   :5.0000  
##     negative         positive     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000  
##  Mean   :0.4958   Mean   :0.8656  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :6.0000   Max.   :6.0000

Let’s add information about airlines to our data:

scores$Airline <- airline_df$airline

We can calculate total amount of scores for each emotion and for each airline:

scores <- scores %>% 
  mutate(rows = rowSums(select(., 2:10))) %>%
  group_by(Airline) %>%
  summarise(
    anger = sum(anger),
    anticipation = sum(anticipation),
    disgust = sum(disgust),
    fear = sum(fear),
    joy = sum(joy),
    sadness = sum(sadness),
    surprise = sum(surprise),
    negative = sum(negative),
    positive = sum(positive),
    rows = sum(rows))

Let us visualize our data:

scores_gathered <- scores %>% 
  gather("sentiment", "value", 2:10) %>%
  mutate(perc = value/rows * 100)

ggplot(scores_gathered, aes(x = sentiment, y = perc, fill = sentiment)) +
  geom_histogram(stat = "identity") + 
  coord_flip() + 
  theme_bw() + 
  scale_fill_brewer(palette="RdYlGn") + 
  facet_grid(Airline ~ .)

Part 2: Text classification using K-NN and SVM

Data preparation

Let’s look at our data once again:

head(airline_df,3)
##       tweet_id airline_sentiment negativereason        airline       name
## 1 5.703061e+17           neutral                Virgin America    cairdin
## 2 5.703011e+17          positive                Virgin America   jnardino
## 3 5.703011e+17           neutral                Virgin America yvonnalynn
##                                                                       text
## 1                                      @VirginAmerica What @dhepburn said.
## 2 @VirginAmerica plus you've added commercials to the experience... tacky.
## 3  @VirginAmerica I didn't today... Must mean I need to take another trip!
##   tweet_coord             tweet_created tweet_location weekday
## 1             2015-02-24 11:35:52 -0800                      3
## 2             2015-02-24 11:15:59 -0800                      3
## 3             2015-02-24 11:15:48 -0800      Lets Play       3

Let’s predict sentiments based on the comments we have.

In our dataset we have two columns: the comment and airline_sentiment. We can use classification to predict sentiments based on words from comments. Before we start, in terms of saving lab session time, we will remove some rows from the data:

airline_df <- airline_df %>% filter(! is.na(airline_sentiment))
airline_df <- airline_df[1:3640,]

Let’s clean the text (as we did before):

sacred_texts <- iconv(airline_df$text, to = "utf-8")
corpus_k <- Corpus(VectorSource(sacred_texts))
clean_k <- tm_map(corpus_k, tolower)
clean_k <- tm_map(clean_k, removeNumbers)
clean_k <- tm_map(clean_k, removeWords, stopwords("english"))
clean_k <- tm_map(clean_k, removePunctuation)
clean_k <- tm_map(clean_k, stripWhitespace)

inspect(clean_k[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] virginamerica dhepburn said                          
## [2] virginamerica plus added commercials experience tacky
## [3] virginamerica today must mean need take another trip

Next we create Document-Term matrix from our corpus:

dtm <- DocumentTermMatrix(clean_k)

And transforming it to data frame, as we will need it for futher work.

knn_words <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE)

So, now instead the texts we have data about how much times each word occurs in each description. For example the first description in our data was:

airline_df$text[1]
## [1] @VirginAmerica What @dhepburn said.
## 14427 Levels: "LOL you guys are so on it" - me, had this been 4 months ago...“@JetBlue: Our fleet's on fleek. http://t.co/LYcARlTFHl” ...

And now we have number of times each word appears in our text (Note: that we did a cleaning of this text and some words might have changed):

knn_words[1:3,1:15]
##   dhepburn said virginamerica added commercials experience plus tacky
## 1        1    1             1     0           0          0    0     0
## 2        0    0             1     1           1          1    1     1
## 3        0    0             1     0           0          0    0     0
##   another mean must need take today trip
## 1       0    0    0    0    0     0    0
## 2       0    0    0    0    0     0    0
## 3       1    1    1    1    1     1    1

Model training and prediction

Now that we have our data, we can do predictions of sentiments.

For now lets store our actual values from initial dataset:

actual_val <- airline_df$airline_sentiment

Next let’s take data that we have and divide it into 2 data sets:

  • train data (usualy 60-80% of data)
  • test data (40-20%)
<YOUR CODE> # Split the data

Now we can train the model and predict sentiments:

prediction <- knn(train, test, actual_val[rows])

Evaluation

For evaluation of our model we can use confusion matrix:

<YOUR CODE> # Create confusion matrix and calculate accuracy

SVM

names(knn_words) <- make.names(names(knn_words)) # creates correct names
knn_words$SSenti <- airline_df$airline_sentiment 
knn_words <- subset(knn_words, select=which(!duplicated(names(knn_words)))) # removes duplicated words

Separating into training and test data:

train <- knn_words[rows,]
test <- knn_words[-rows,]
svm_model <- svm(formula = SSenti ~ ., data = train)
svm_pred <- predict(svm_model, test)

Evaluation of SVM

<YOUR CODE> # Create confusion matrix and calculate accuracy