--- title: "Introduction to Business Data Analytics" subtitle: "Network Science" author: "University of Tartu" output: prettydoc::html_pretty: null highlight: github html_document: default html_notebook: default github_document: default theme: cayman --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` ```{r setup, echo=FALSE} library(knitr) ``` Today we are going to do some network analysis. The goals of our lab session: * Get familiar with the graphs and how to work with them; * Analyse the data about media companies. Figure out their connections; * Learn about various way of displaying statistic on the graphs. ## Libraries Today we will need following libraries: ```{r eval=FALSE} install.packages("igraph") ``` ```{r warning=FALSE} library("igraph") library("ggplot2") ``` Igraph is a package for creating and manipulating graphs and analyzing networks. There are a number of different software packages available for this purpose, but iGraph has become perhaps the most flexible and powerful library for performing network analysis. ## Basic work with networks ### Creating networks Let's create an undirected graph with 3 edges. ```{r} g1 <- graph(edges=c(1,2, 2,3, 3, 1), n=3, directed=F) plot(g1) ``` On the plot above we have 3 nodes (1,2,3) and undirected edges that connect nodes next way: **1->2, 2->3, 3->1**. Now we can study our graph simply by calling the variable: ```{r} g1 ``` Above we can find some important information about the graph: * First number - the number of nodes * Second number - the number of edges in the graph * List of edges Now let us create another graph: ```{r} g2 <- graph(edges=c(1,2, 2,3, 3,1, 4,5, 8,5, 4,7, 2,6), n=8) g2 ```
How much nodes and edges does this graph have?
```{r} plot(g2) ``` Above is an example of disconnected graph. Having numbers as names of the nodes can be meaningless, so let us name them: ```{r} g3 <- graph(c("John", "Jim", "Jim", "Jill", "Jill", "John")) plot(g3) ``` We can make isolated nodes by passing the list to "isolates" attribute: ```{r} g4 <- graph(c("John", "Jim", "Jim", "Jack", "Jim", "Jack", "John", "John"), isolates=c("Jesse", "Janis", "Jennifer", "Justin")) plot(g4, edge.arrow.size=.5, vertex.color="darkgreen", vertex.size=15, vertex.frame.color="black", vertex.label.color="black", vertex.label.cex=0.8, vertex.label.dist=2.5, edge.curved=0.5) ``` Small graphs can also be generated with a description of this kind: - for undirected tie, +- or -+ for directed ties pointing left & right, ++ for a symmetric tie, and ":" for sets of vertices. ###Edge, vertex, and network attributes We can call specific functions to learn about graph attributes: ```{r} E(g4) # returns edges V(g4) # returns nodes ``` Or we can study matrix: ```{r} g4[] ``` Or even separate rows: ```{r} g4[1,] ``` And columns: ```{r} g4[,1] ``` We can add new attributes to existing graph: ```{r} V(g4)$name # was already there V(g4)$gender <- c("male", "male", "male", "male", "female", "female", "male") E(g4)$type <- "email" # assigns "email" to all edges E(g4)$weight <- 10 ``` Now, let's examine attributes: ```{r} edge_attr(g4) ``` ```{r} vertex_attr(g4) ``` If you want to assign attribute value to the whole graph, you can use: ```{r} g4 <- set_graph_attr(g4, "name", "Email Network") g4 <- set_graph_attr(g4, "something", "A thing") graph_attr_names(g4) ``` ```{r} graph_attr(g4, "something") ``` ```{r} graph_attr(g4) ``` Now, if we want to delete some attribute: ```{r} g4 <- delete_graph_attr(g4, "something") graph_attr(g4) ``` ```{r} plot(g4, edge.arrow.size=.5, vertex.label.color="black", vertex.label.dist=1.5, vertex.color=c("pink", "skyblue")[1+(V(g4)$gender=="male")]) ``` We have the situation where we have a loop from "John" to himself and two directed edges between "Jim" and "Jack". We can remove unnecessary edges, using **edge.attr.comb()** to indicate how edge attributes should be combined. We can choose following options: * sum * mean * prod * min * max * first/last * ignore ```{r} g4s <- simplify(g4, remove.multiple = T, remove.loops = F, edge.attr.comb=c(weight="sum", type="ignore")) plot(g4s, vertex.label.dist=1.5) ``` ```{r} E(g4s)$weight ``` ```{r} g4s ``` Now we can see some additional information in summary: * D or U, for a directed or undirected graph * N for a named graph (where nodes have a name attribute) * W for a weighted graph (where edges have a weight attribute) * B for a bipartite (two-mode) graph (where nodes have a type attribute) The two numbers that follow (7 3) refer to the number of nodes and edges in the graph. The description also lists node & edge attributes, for example: * (g/c) - graph-level character attribute * (v/c) - vertex-level character attribute * (e/n) - edge-level numeric attribute ##Network analysis As we learned basic functions to work with graphs, now we will apply them to use in practice. To be more precise, we will study a dataset to understand how different media organizations are related with each other. That would let us know, which marketing channel is better to use to do advertising. We will start by loading the data: ```{r} nodes <- read.csv(file.choose()) # Dataset1-NODES.csv links <- read.csv(file.choose()) # Dataset1-EDGES.csv ``` Let's examine the data: ```{r} str(nodes) ``` ```{r} str(links) ``` ```{r} cat("Amount of rows in nodes data: ", nrow(nodes), "\n") cat("Amount of unique nodes: ", length(unique(nodes$id)), "\n") cat("Amount of rows in links data: ", nrow(links), "\n") cat("Amount of unique links: ", nrow(unique(links[,c("from", "to")])), "\n") ``` As you can see, total amount of links is bigger then unique links with combination from;to. This shows us that there are nodes with two or more edges. We will collapse all links of the same type between the same two nodes by summing their weights: ```{r} links <- aggregate(links[,3], links[,-3], sum) #summing weights by other cols colnames(links)[4] <- "weight" ``` ###Creating igraph objects Now, as we have data loaded, we can create igraph objects and work with them. For this, we will use *graph_from_data_frame()* that accepts data.frames that describe edges and nodes. ```{r} net <- graph_from_data_frame(d=links, vertices=nodes, directed=T) net ``` ```{r} E(net) # The edges of the "net" object V(net) # The vertices of the "net" object E(net)$type # Edge attribute "type" V(net)$media # Vertex attribute "media" ``` ```{r} plot(net, edge.arrow.size=.4, vertex.label.dist=2.5) ``` As you can see, there are some loops on the graph. Let's remove them: ```{r} net <- simplify(net, remove.multiple = F, remove.loops = T) plot(net, edge.arrow.size=.4, vertex.label.dist=2.5) ``` There are also double edges, but we won't remove them this time since they can belong to different type(for example "hyperlinks" and "mentions"). We can obtain information about edges using next functions: The following function returns the list of edges: ```{r} as_edgelist(net, names=T) ``` The function below creates matrix that shows us weight for each edge. The first one shows the list of all edges from(the first column) and to(the second) Another function shows us the matrics, that shows edges from(the first column) and to(first row). Numbers here represent the attribute weight, as we passed it into the function. ```{r} as_adjacency_matrix(net, attr="weight") ``` Now we have information on how media companies are related with each other. Let's study the data by visualizing. For starters, let's replace node labels, with node names stored in "media". ```{r} plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=0.8, vertex.size=10, vertex.color="orange", vertex.frame.color="black", vertex.label=V(net)$media, vertex.label.color="black") ``` Now, let's add different colors depending on which media type each node has. ```{r} unique(V(net)$media.type) ``` ```{r} colors <- c("gray50", "tomato", "gold") V(net)$color <- colors[V(net)$media.type] plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=0.8, vertex.size=10, vertex.label=V(net)$media, vertex.label.color="black") ``` Now it's easier for us to see, to which media type belongs each node: * Red - TV * Yellow - internet * Grey - newspapers Let's continue our research and make our plot to show audience size: ```{r} V(net)$size <- V(net)$audience.size*.4 plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=.8, vertex.label=V(net)$media, vertex.label.color="black") ``` And the weight of the edge: ```{r} E(net)$width <- E(net)$weight/6 plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=.8, vertex.label=V(net)$media, vertex.label.color="black") ``` And final touch would be setting layout for graph: ```{r} graph_attr(net, "layout") <- layout_with_lgl plot(net, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label=V(net)$media, vertex.label.color="black") ``` Now let us add some explanations for the colors: ```{r} graph_attr(net, "layout") <- layout_with_lgl plot(net, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label=V(net)$media, vertex.label.color="black") legend(x=-1.5, y=-1.1, #coordinates c("Newspaper","Television", "Online News"), pch=21, pt.bg=colors, pt.cex=2, #size of the circle cex=.8, #size of the tex ncol=3) #amount of columns in legends ``` ###More network analysis At this point, we learned how to create a simple network and analyze it. But what if we want to see more information? Let's replace node names like 's01', 's02', 's03' etc. with names of media companies, so that we do not have to specify node name each time we plot. ```{r} V(net)$id=V(net)$name V(net)$name=V(net)$media ``` Now, let's add different colors depending on which media type each node has. ```{r} unique(V(net)$media.type) ``` ```{r} colors <- c("gray50", "tomato", "gold") V(net)$color <- colors[V(net)$media.type] plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=0.8, vertex.size=10, vertex.label.color="black") ``` Now it's easier for us to see, to which media type belongs each node: * Red - TV * Yellow - internet * Grey - newspapers Let's continue our research and make our plot to show audience size: ```{r} V(net)$size <- V(net)$audience.size * .4 plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=.8, vertex.label.color="black") ``` And the weight of the edge: ```{r} E(net)$width <- E(net)$weight/6 plot(net, edge.arrow.size=.2, vertex.label.dist=1.8, vertex.label.cex=.8, vertex.label.color="black") ``` Finally, let us add some explanations for the colors: ```{r} graph_attr(net, "layout") <- layout_with_lgl plot(net, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black") legend(x=-1.5, y=-1.1, #coordinates c("Newspaper","Television", "Online News"), pch=21, pt.bg=colors, pt.cex=2, #size of the circle cex=.8, #size of the tex ncol=3) #amount of columns in legends ``` ## Network and node's description Let's calculate some statistical information: ### Density Density ratio of the number of edges and the number of possible edges(fully connected graph). ```{r} edge_density(net) ``` ###Reciprocity Reciprocity is the measure of the likelihood of vertices in a directed network to be mutually linked ```{r} reciprocity(net) ``` ###Clustering coefficient (Transitivity) Transitivity the probability that the adjacent vertices of a vertex are connected ```{r} transitivity(net, type="global") transitivity(net, type="local") ``` ###Diameter Diameter at network considered as the length of the shortest paths between the nodes. We can use next function to obtain diameter: ```{r} diameter(net, directed=T, weights=NA) ``` ```{r} diameter(net, directed=T) ``` Or we can obtain the nodes along the first found path of that distance. ```{r} diam <- get_diameter(net, directed=T) diam ``` Let's colorize this path on our network: ```{r} vcol <- V(net)$color vcol[diam] <- "green" ecol <- rep("gray80", ecount(net)) ecol[E(net, path=diam)] <- "green" # E(net, path=diam) finds edges along a path, here 'diam' plot(net, vertex.color=vcol, edge.color=ecol, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", edge.curved=0.3) ``` Let's pring edges and nodes to validate: ```{r} V(net)[diam] E(net)[E(net, path=diam)] E(net)$weight[E(net, path=diam)] ``` ###Node degrees ```{r} deg <- degree(net, mode="all") plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.size=deg*3) ``` Let's make a histogram of node degree: ```{r} deg_df <- as.data.frame(deg, node=names(deg), deg=deg) ggplot(deg_df) + geom_bar(aes(x=as.factor(deg)), stat="count") + xlab("degree") ``` ### Degree distribution ```{r} plot(degree_distribution(net, mode='all')) ``` ### Degree ```{r} head(sort(degree(net, mode='all'), decreasing = T),5) ``` ### Closeness ```{r} head(sort(closeness(net, mode='all'), decreasing = T),5) ``` ### Betweenness ```{r} head(sort(betweenness(net), decreasing = T),5) ``` ### Eigen vector Centrality ```{r} head(sort(eigen_centrality(net)$vector, decreasing = T),5) ``` ###Hubs and authorities Let's look at hubs and authorities on our data. Hubs mean nodes that outgoing edges. Authorities are nodes that have incoming edges. ```{r} hs <- hub_score(net, weights=NA)$vector as <- authority_score(net, weights=NA)$vector ``` Plotting hubs: ```{r} plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.size=hs*50, main="Hubs") ``` Plotting authorities: ```{r} plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.size=as*50, main="Authorities") ``` ###Distances and paths Now we will calculate the average path length between each pair of nodes in our graph. Let's assume that our graph is undirected: ```{r} mean_distance(net, directed=F) ``` And also calculate for directed graph: ```{r} mean_distance(net, directed=T) ``` We also can find distances of the all shortest paths in the graph: ```{r} distances(net) # with edge weights distances(net, weights=NA) # ignore weights ``` This way we can extract distances to the node we are interested in: ```{r} dist_NYT <- distances(net, v=V(net)[media=="NY Times"], to=V(net), weights=NA) #Plotting palette <- colorRampPalette(c("dark red", "gold")) col <- palette(max(dist_NYT)+1) col <- col[dist_NYT+1] plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.cex=.8, vertex.color=col, vertex.label=dist_NYT, vertex.label.color="white") ``` Now let's do a bit more complex task. We will find the shortest path (edges) from NY Times to FOX News: ```{r} nyt_fox_path <- shortest_paths(net, from = V(net)[media=="FOX News"], to = V(net)[media=="New York Post"], output = "both") # both path nodes and edges # Generate edge color variable to plot the path: ecol <- rep("gray80", ecount(net)) ecol[unlist(nyt_fox_path$epath)] <- "green" # Generate edge width variable to plot the path: edge_width <- rep(2, ecount(net)) edge_width[unlist(nyt_fox_path$epath)] <- 4 # Generate node color variable to plot the path: vcol <- V(net)$color vcol[unlist(nyt_fox_path$vpath)] <- "green" plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.color=vcol, edge.color=ecol, edge.width=edge_width) ``` We can find edges that go in or out of our node. Let's find all edges that are around our FOX News: ```{r} fox_edges <- incident(net, V(net)[media=="FOX News"], mode="all") #Set colors to plot the selected edges. ecol <- rep("gray80", ecount(net)) ecol[fox_edges] <- "green" vcol <- V(net)$color vcol[V(net)$media=="FOX News"] <- "green" #Plotting plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.color=vcol, edge.color=ecol) ``` Or we can use incident_edges() to select edges of several nodes: ```{r} selected_nodes <- V(net)[c(2,6)] nodes2_edges <- incident_edges(net, selected_nodes, mode="all") #Set colors to plot the selected edges. ecol <- rep("gray80", ecount(net)) ecol[E(net) %in% nodes2_edges$s02] <- "blue" ecol[E(net) %in% nodes2_edges$s06] <- "green" vcol <- V(net)$color vcol[selected_nodes[1]] <- "blue" vcol[selected_nodes[2]] <- "green" #Plotting plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", vertex.color=vcol, edge.color=ecol) ``` Another way of selecting edges would be to use %-% or %->%: * E(network)[X %-% Y] takes all edges between X and Y; * E(network)[X %->% Y] takes edges going from X to Y. Let's see an example: ```{r} selected_edges <- E(net)[ V(net)[type.label=="Newspaper"] %->% V(net)[type.label=="Online"] ] #Set colors to plot the selected edges. ecol <- rep("gray80", ecount(net)) ecol[selected_edges] <- "green" #Plotting plot(net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black", edge.color=ecol) ``` ##Subgroups and communities There are several algorithms to detect communities on the plot. Likely, IGraph package contains functions that can be called to execute many of them. ###Finding communities based on propagating labels Label Propagation algorithm assigns labels to the points of data. In the beginning, a subset of the data points has labels. These labels are propagated to the unlabeled points throughout the algorithm. The following function automatically detects the communities and returns them: ```{r} clp <- cluster_label_prop(as.undirected(net)) clp plot(clp, net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black") ``` ###Louvain algorithm for communities detection: Let's use another algorithm to obtain communities. Louvain algorithm consists of repeated application of two steps. The first step is a "greedy" assignment of nodes to communities. The second step is the definition of a new network in terms of the communities found in the first step. ```{r} lou <- cluster_louvain(as.undirected(net)) lou #Plotting plot(lou, net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black") ``` ###Walktrap community detection algorithm The idea of this method relies on using random walks. These walks are more likely to stay inside the community as there can be only a few edges that are going outside, to another community. ```{r} wtc <- walktrap.community(as.undirected(net)) wtc #Plotting plot(wtc, net, edge.arrow.mode=0, edge.arrow.size=.2, vertex.label.dist=2, vertex.label.cex=.8, vertex.label.color="black") ```