## Creating objects in R

You can get output from R simply by typing math in the console:

5 + 5
## [1] 10
12 / 6
## [1] 2

In order to do more complex things we need to assign values to variables:

X <- 12

Variable is an object that can contain some value like number, string or even table. In order to assign value to variable you can use operator “<-”. Variable name cannot start with number or contain spacebars(" “) and also R is case sensitive language. This means next variables var, Var and vAr are different and can contain different values. Also some variable names are reserved by R (e.g., if, else, for, see here for a complete list).

Let’s look at example:

varX <- 5 # we assign value to variable varX
varX * 10 # multiply it by 10
## [1] 50

Now we will assign new value to varX:

varX <- 6
varX * 10 # multiply it by 10
## [1] 60

In order to leave some notes in the code, you can use comments. Simply go to the end of the line where you want leave a comment and put # and write youre note on the right side of it.

## Functions and their arguments

Functions can be used to execute the same calculations on different set of values. For example:

a <- 256
b <- sqrt(a)
b
## [1] 16

sqrt() is a function that takes value from a variable, calculates square root from it and assigns result to b. Let’s try a function that can take multiple arguments: round(). Note: use ?round to see help on this function.

round(3.14159) # takes 1 argument
## [1] 3

By default function round() takes the value and rounds to the nearest whole number. Now we will use args command and look, which arguments can be accepted by round.

args(round)
## function (x, digits = 0)
## NULL

Now we can use second argument to leave 2 digits after the dot:

round(3.14159, digits = 2)
## [1] 3.14

Or if you provide the arguments in the exact same order as they are defined you don’t have to name them:

round(3.14159, 2)
## [1] 3.14

And if you do name the arguments, you can switch their order:

round(digits = 2, x = 3.14159)
## [1] 3.14

## Vectors and data types

As we mentioned earlier, variable contains some value, but there can be several types of this value. The basics types of the value in R are:

• Numeric (1,2,3,-4363, 2.2222)
• Character (or String) (“A”, “AAAAAAA”)
• Logic (TRUE, FALSE)

Apart from basic data types, R also has more complex data types. One of them is vector. A vector is the most common and basic data type in R that can contain several values. For example the following is a numeric vector:

X <- c(1,5,4,9,0)
X
## [1] 1 5 4 9 0

Vector can also contain characters:

queue <- c("first", "second", "third")
queue 
## [1] "first"  "second" "third"

Note that quotes around “first”, “second”, etc. are essential here. Without the quotes R will assume there are variables called first, second and third. As these variables don’t exist in R’s memory, there will be an error message. There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(X)
## [1] 5
length(queue)
## [1] 3

Also the vectors can have missing values (NA):

z <- c(NA, 3, 14, NA, 33, 17, NA, 41)

NA is name reserved by R and can also be used in Character or Logic vector.

Now let us do a few exercises with vectors. Take previous numeric vector and multiply it by 2:

k <- z*2
k
## [1] NA  6 28 NA 66 34 NA 82

multiply it by c(1, 0, 0, 2, 5, 10, 0, -1):

m <- c(1, 0, 0, 2, 5, 10, 0, -1)
n <- z*m
n
## [1]  NA   0   0  NA 165 170  NA -41

Bind 2 vectors:

• 1, 3, 5, 7, 11, 13, 17, -1
• 1, 0, 0, 2, 5, 10, 0, -1
a <- c(1, 3, 5, 7, 11, 13, 17, -1)
b <- c(1, 0, 0, 2, 5, 10, 0, -1)
d <- cbind(a,b)
d
##       a  b
## [1,]  1  1
## [2,]  3  0
## [3,]  5  0
## [4,]  7  2
## [5,] 11  5
## [6,] 13 10
## [7,] 17  0
## [8,] -1 -1

Or another way:

e <- rbind(a,b)
e
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## a    1    3    5    7   11   13   17   -1
## b    1    0    0    2    5   10    0   -1

As you have noticed, all of the elements are the same type of data in vector. The function class() indicates the class (the type of element) of an object:

class(a)
## [1] "numeric"
class(c("cat", "dog"))
## [1] "character"

You can use the c() function to add other elements to your vector:

weight_g <- c(50,49,70,45)
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
## [1] 30 50 49 70 45 90

In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g. We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.

## Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"

We can also repeat the indices to create an object with more elements than the original one:

more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
## [1] "mouse" "rat"   "dog"   "rat"   "mouse" "cat"

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

## Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
## [1] 21 39 54

Typically, you do not have to type, but can use the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:

weight_g > 50
## [1] FALSE FALSE FALSE  TRUE  TRUE
weight_g[weight_g > 50]
## [1] 54 55

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

weight_g[weight_g < 30 | weight_g > 50]
## [1] 21 54 55
weight_g[weight_g >= 30 & weight_g == 21]
## numeric(0)

The result numeric(0) stands for an empty vector with the type numeric for its entries. The type starts to matter once you do some operations with this vector. A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

animals <- c("mouse", "rat", "dog", "cat")
animals[animals == "cat" | animals == "rat"] # returns both rat and cat
## [1] "rat" "cat"
animals %in% c("rat", "cat", "dog", "duck", "goat")
## [1] FALSE  TRUE  TRUE  TRUE
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
## [1] "rat" "dog" "cat"

## Data frames

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. This is how you create a new data frame manually, using data.frame function

df <- data.frame(
Names = c("Jhon", "Joseph", "Martin", "Ivan", "Andrea"),
Goods = c("Bread", "Milk", "Apples", "Meat", "Eggs"),
Sales = c(15, 18, 21, NA, 60),
Price = c(34, 52, 33, 44, NA),
stringsAsFactors = FALSE)
str(df)
## 'data.frame':    5 obs. of  4 variables:
##  $Names: chr "Jhon" "Joseph" "Martin" "Ivan" ... ##$ Goods: chr  "Bread" "Milk" "Apples" "Meat" ...
##  $Sales: num 15 18 21 NA 60 ##$ Price: num  34 52 33 44 NA

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because the column are vectors, they all contain the same type of data (e.g., characters, integers, factors).

## Inspecting data.frame Objects

We already saw how the function str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

• Size:
• dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
• nrow(surveys) - returns the number of rows
• ncol(surveys) - returns the number of columns
• Content:
• head(surveys) - shows the first 6 rows
• tail(surveys) - shows the last 6 rows
• Names:
• names(surveys) - returns the column names (synonym of colnames() for data.frame objects)
• rownames(surveys) - returns the row names
• Summary:
• summary(surveys) - summary statistics for each column

Note: most of these functions are “generic”, that is, they also can be used on other types of objects besides data.frame.

## Indexing and subsetting data frames

Now imagine that we need specific value from the variable. We can extract it by using [ ] and adding indexes into it. When it comes to vector, we can use

vector <- c("a", "b", "c", "d")
vector[3]    
## [1] "c"

to get 3th value from it. However it is different in data.frame, as it has 2 dimensions(rows and columns). In that case we can use 2 indexes instead of one. Row index come first, followed by column index. However, note that different ways of specifying these coordinates lead to results with different classes.

df
##    Names  Goods Sales Price
## 1   Jhon  Bread    15    34
## 2 Joseph   Milk    18    52
## 3 Martin Apples    21    33
## 4   Ivan   Meat    NA    44
## 5 Andrea   Eggs    60    NA
df[1, 1]    # first element in the first column of the data frame (as a vector)
## [1] "Jhon"
df[1, 3]    # first element in the 3rd column (as a vector)
## [1] 15
df[, 1]     # first column in the data frame (as a vector)
## [1] "Jhon"   "Joseph" "Martin" "Ivan"   "Andrea"
df[1]       # first column in the data frame (as a data.frame)
##    Names
## 1   Jhon
## 2 Joseph
## 3 Martin
## 4   Ivan
## 5 Andrea
df[1:3, 2]  # first three elements in the 2nd column (as a vector)
## [1] "Bread"  "Milk"   "Apples"
df[1:3,1:2] # first three elements in the first two columns (as a data.frame)
##    Names  Goods
## 2 Joseph   Milk
## 3 Martin Apples
df[3, ]     # the 3rd element for all columns (as a data.frame) 
##    Names  Goods Sales Price
## 3 Martin Apples    21    33

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance. You can also exclude certain parts of a data frame using the “-” sign:

df[,-1]          # The whole data frame, except the first column
##    Goods Sales Price
## 2   Milk    18    52
## 3 Apples    21    33
## 4   Meat    NA    44
## 5   Eggs    60    NA
df[-c(1:2),]     # The whole data frame, except for the first two rows
##    Names  Goods Sales Price
## 3 Martin Apples    21    33
## 4   Ivan   Meat    NA    44
## 5 Andrea   Eggs    60    NA

As well as using numeric values to subset a data.frame (or matrix), columns can be called by name, using one of the four following notations:

df["Names"]       # Result is a data.frame
##    Names
## 1   Jhon
## 2 Joseph
## 3 Martin
## 4   Ivan
## 5 Andrea
df[, "Names"]     # Result is a vector
## [1] "Jhon"   "Joseph" "Martin" "Ivan"   "Andrea"
df[["Names"]]     # Result is a vector
## [1] "Jhon"   "Joseph" "Martin" "Ivan"   "Andrea"
df$Names # Result is a vector ## [1] "Jhon" "Joseph" "Martin" "Ivan" "Andrea" For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.$ is specific operator, that can extract subobject from the collection(like dataframe). Let us do a few exercises:

df <- data.frame(
Name = c("Jhon", "Joseph", "Martin", "Ivan", "Andrea"),
Goods = c("Bread", "Milk", "Apples", "Meat", "Eggs"),
Sales = c(15, 18, 21, NA, 60),
Price = c(34, 52, 33, 44, NA),
stringsAsFactors = FALSE)

Print “head” and “tail” of dataset.

head(df)
##     Name  Goods Sales Price
## 1   Jhon  Bread    15    34
## 2 Joseph   Milk    18    52
## 3 Martin Apples    21    33
## 4   Ivan   Meat    NA    44
## 5 Andrea   Eggs    60    NA
tail(df)
##     Name  Goods Sales Price
## 1   Jhon  Bread    15    34
## 2 Joseph   Milk    18    52
## 3 Martin Apples    21    33
## 4   Ivan   Meat    NA    44
## 5 Andrea   Eggs    60    NA

Add some rows to dataset. Print again.

newRow <- data.frame("Mike", "Oranges", 22, 35)
colnames(newRow) <- colnames(df)
rbind(df, newRow)
##     Name   Goods Sales Price
## 1   Jhon   Bread    15    34
## 2 Joseph    Milk    18    52
## 3 Martin  Apples    21    33
## 4   Ivan    Meat    NA    44
## 5 Andrea    Eggs    60    NA
## 6   Mike Oranges    22    35

Remove 2nd row of dataset.

df <- df[-2,]

## Functions

Make function that looks through data.frame and returns Name of a person, who sold the biggest amount of goods.

getBiggestName <- function(data){
max <- which.max(data$Sales) return(data$Name[max])
}
getBiggestName(df)
## [1] "Andrea"

Make function that looks through data.frame and calculates how much each seller earned.

getProfit <- function(data){
return(df$Sales * df$Price)
}
getProfit(df)
## [1] 510 693  NA  NA

## Factors

Factors are very useful and are actually something that makes R particularly well suited to working with data, so we’re going to spend a little time introducing them. Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting. Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

gender <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level “female” and 2 to the level “male” (because f comes before m, even though the first element in this vector is “male”). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(gender)
## [1] "female" "male"
nlevels(gender)
## [1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the gender vector would be:

gender # current order
## [1] male   female female male
## Levels: female male
gender <- factor(gender, levels = c("male", "female"))
gender # after re-ordering
## [1] male   female female male
## Levels: male female

In R’s memory, these factors are represented by integers (1, 2), but are more informative than integers because factors are self-describing: “female”, “male” is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels.

## Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

as.character(gender)
## [1] "male"   "female" "female" "male"

Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels() function. Compare:

f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f)               # wrong! and there is no warning...
## [1] 3 2 1 4 3
as.numeric(as.character(f)) # works...
## [1] 1990 1983 1977 1998 1990

Lets do some exercises:

Take previous data.frame and add column. Generate factor using sample function (nrow(df) - amount of rows in dataframe):

ftrs <- sample(as.factor(c("low", "medium", "high")), nrow(df), replace=TRUE)
ftrs
## [1] medium low    medium low
## Levels: high low medium

df <- cbind(df, ftrs)
df
##     Name  Goods Sales Price   ftrs
## 1   Jhon  Bread    15    34 medium
## 3 Martin Apples    21    33    low
## 4   Ivan   Meat    NA    44 medium
## 5 Andrea   Eggs    60    NA    low

Rename new column to “Efficiency”.

colnames(df)[5] <- "Efficiency"
df
##     Name  Goods Sales Price Efficiency
## 1   Jhon  Bread    15    34     medium
## 3 Martin Apples    21    33        low
## 4   Ivan   Meat    NA    44     medium
## 5 Andrea   Eggs    60    NA        low

Sort data.frame by worker’s Efficiency.

df <- df[order(df\$Efficiency),]
df
##     Name  Goods Sales Price Efficiency
## 3 Martin Apples    21    33        low
## 5 Andrea   Eggs    60    NA        low
## 1   Jhon  Bread    15    34     medium
## 4   Ivan   Meat    NA    44     medium

## Dplyr package

In order to retrieve information from data frames we use the package dplyr. This package makes filtering, sorting and grouping operations on a data frame very easy. To install the package, write

install.packages("dplyr")

After the package has been installed, you have to load it with the following command

library(dplyr)

The main commands of the dplyr package are:

• select(): choosing a subset of columns
• filter(): choosing a subset of rows
• arrange(): sort the rows
• summarise(): aggregates the values
• group_by(): change the data into grouped data in order to apply functions to each of the groups separately
• top_n(): choose n first/last rows

The first argument of these functions is always the data.frame and all the functions also return a data.frame object.

Next we will show you some simple examples to demonstrate the functionality of dplyr package.

data = data.frame(gender = c("M", "M", "F"),
age = c(20, 60, 30),
height = c(180, 200, 150))
data
##   gender age height
## 1      M  20    180
## 2      M  60    200
## 3      F  30    150

### select()

Selecting a subset of columns.

select(data, age)
##   age
## 1  20
## 2  60
## 3  30
select(data, gender, age)
##   gender age
## 1      M  20
## 2      M  60
## 3      F  30
select(data, -height)
##   gender age
## 1      M  20
## 2      M  60
## 3      F  30

### filter()

Selecting a subset of rows.

filter(data, height > 160)
##   gender age height
## 1      M  20    180
## 2      M  60    200
filter(data, height > 160, age > 30)
##   gender age height
## 1      M  60    200
filter(data, height > 160 & age > 30)
##   gender age height
## 1      M  60    200

### arrange()

Sorts rows.

arrange(data, height)
##   gender age height
## 1      F  30    150
## 2      M  20    180
## 3      M  60    200
arrange(data, desc(height))
##   gender age height
## 1      M  60    200
## 2      M  20    180
## 3      F  30    150

### mutate()

mutate(data, height2 = height / 100)
##   gender age height height2
## 1      M  20    180     1.8
## 2      M  60    200     2.0
## 3      F  30    150     1.5
mutate(data, height2 = height / 100, random_feature = height * age)
##   gender age height height2 random_feature
## 1      M  20    180     1.8           3600
## 2      M  60    200     2.0          12000
## 3      F  30    150     1.5           4500

### summarise()

Aggregates the values.

summarise(data, average_height = mean(height))
##   average_height
## 1       176.6667

### group_by()

Changes the data into grouped data where functions are applied separately to each group.

grouped_data = group_by(data, gender)
# Applying the function summarise to each group separately
summarise(grouped_data, average_height = mean(height))
## # A tibble: 2 x 2
##   gender average_height
##   <fct>           <dbl>
## 1 F                 150
## 2 M                 190
# In addition to average height we can also count the number of observations in that group
summarise(grouped_data,
average_height = mean(height),
nr_of_people = n())
## # A tibble: 2 x 3
##   gender average_height nr_of_people
##   <fct>           <dbl>        <int>
## 1 F                 150            1
## 2 M                 190            2

### top_n()

Separates top n values from the dataset by some feature (column). NOTE: The resulting data.frame is not ordered by these values.

top_n(data, 1, height)
##   gender age height
## 1      M  60    200
top_n(data, 2, height)
##   gender age height
## 1      M  20    180
## 2      M  60    200

### Applying multiple functions

Example: Let’s sort the data by height and select only the rows where gender == “M”.

sorted = arrange(data, height)
filter(sorted, gender == "M")
##   gender age height
## 1      M  20    180
## 2      M  60    200
filter(arrange(data, height),
gender == "M")
##   gender age height
## 1      M  20    180
## 2      M  60    200

### %>% operator

You can make your code more readable by using the dyplr’s pipe operator (%>%). This operator takes the object from the left and gives it as the first argument to the function on the right. For example the function f(x, y) can be written as x %>% f(y).

data %>%
arrange(height) %>%
filter(gender == "M")
##   gender age height
## 1      M  20    180
## 2      M  60    200

You can read the code written with the pipe operator in the following way:

Take the dataset called “data”, then sort it by height, then extract rows where gender == “M”

Code written in this way is easier to read, especially if multiple functions are applied. For example the example written before

grouped_data = group_by(data, gender)
summarise(grouped_data, average_height = mean(height))
## # A tibble: 2 x 2
##   gender average_height
##   <fct>           <dbl>
## 1 F                 150
## 2 M                 190

can be written as

data %>%
group_by(gender) %>%
summarise(average_height = mean(height))
## # A tibble: 2 x 2
##   gender average_height
##   <fct>           <dbl>
## 1 F                 150
## 2 M                 190