How much nodes and edges does this graph have?

```{r}
plot(g2)
```
Above is an example of disconnected graph.
Having numbers as names of the nodes can be meaningless, so let us name them:
```{r}
g3 <- graph(c("John", "Jim", "Jim", "Jill", "Jill", "John"))
plot(g3)
```
We can make isolated nodes by passing the list to "isolates" attribute:
```{r}
g4 <- graph(c("John", "Jim", "Jim", "Jack", "Jim", "Jack", "John", "John"),
isolates=c("Jesse", "Janis", "Jennifer", "Justin"))
plot(g4, edge.arrow.size=.5, vertex.color="darkgreen", vertex.size=15,
vertex.frame.color="black", vertex.label.color="black",
vertex.label.cex=0.8, vertex.label.dist=2.5, edge.curved=0.5)
```
Small graphs can also be generated with a description of this kind: - for undirected tie, +- or -+ for directed ties pointing left & right, ++ for a symmetric tie, and ":" for sets of vertices.
###Edge, vertex, and network attributes
We can call specific functions to learn about graph attributes:
```{r}
E(g4) # returns edges
V(g4) # returns nodes
```
Or we can study matrix:
```{r}
g4[]
```
Or even separate rows:
```{r}
g4[1,]
```
And columns:
```{r}
g4[,1]
```
We can add new attributes to existing graph:
```{r}
V(g4)$name # was already there
V(g4)$gender <- c("male", "male", "male", "male", "female", "female", "male")
E(g4)$type <- "email" # assigns "email" to all edges
E(g4)$weight <- 10
```
Now, let's examine attributes:
```{r}
edge_attr(g4)
```
```{r}
vertex_attr(g4)
```
If you want to assign attribute value to the whole graph, you can use:
```{r}
g4 <- set_graph_attr(g4, "name", "Email Network")
g4 <- set_graph_attr(g4, "something", "A thing")
graph_attr_names(g4)
```
```{r}
graph_attr(g4, "something")
```
```{r}
graph_attr(g4)
```
Now, if we want to delete some attribute:
```{r}
g4 <- delete_graph_attr(g4, "something")
graph_attr(g4)
```
```{r}
plot(g4,
edge.arrow.size=.5,
vertex.label.color="black",
vertex.label.dist=1.5,
vertex.color=c("pink", "skyblue")[1+(V(g4)$gender=="male")])
```
We have the situation where we have a loop from "John" to himself and two directed edges between "Jim" and "Jack". We can remove unnecessary edges, using **edge.attr.comb()** to indicate how edge attributes should be combined.
We can choose following options:
* sum
* mean
* prod
* min
* max
* first/last
* ignore
```{r}
g4s <- simplify(g4,
remove.multiple = T,
remove.loops = F,
edge.attr.comb=c(weight="sum", type="ignore"))
plot(g4s, vertex.label.dist=1.5)
```
```{r}
E(g4s)$weight
```
```{r}
g4s
```
Now we can see some additional information in summary:
* D or U, for a directed or undirected graph
* N for a named graph (where nodes have a name attribute)
* W for a weighted graph (where edges have a weight attribute)
* B for a bipartite (two-mode) graph (where nodes have a type attribute)
The two numbers that follow (7 3) refer to the number of nodes and edges in the graph. The description also lists node & edge attributes, for example:
* (g/c) - graph-level character attribute
* (v/c) - vertex-level character attribute
* (e/n) - edge-level numeric attribute
##Network analysis
As we learned basic functions to work with graphs, now we will apply them to use in practice.
To be more precise, we will study a dataset to understand how different media organizations are related with each other. That would let us know, which marketing channel is better to use to do advertising.
We will start by loading the data:
```{r}
nodes <- read.csv(file.choose()) # Dataset1-NODES.csv
links <- read.csv(file.choose()) # Dataset1-EDGES.csv
```
Let's examine the data:
```{r}
str(nodes)
```
```{r}
str(links)
```
```{r}
cat("Amount of rows in nodes data: ", nrow(nodes), "\n")
cat("Amount of unique nodes: ", length(unique(nodes$id)), "\n")
cat("Amount of rows in links data: ", nrow(links), "\n")
cat("Amount of unique links: ", nrow(unique(links[,c("from", "to")])), "\n")
```
As you can see, total amount of links is bigger then unique links with combination from;to. This shows us that there are nodes with two or more edges.
We will collapse all links
of the same type between the same two nodes by summing their weights:
```{r}
links <- aggregate(links[,3], links[,-3], sum) #summing weights by other cols
colnames(links)[4] <- "weight"
```
###Creating igraph objects
Now, as we have data loaded, we can create igraph objects and work with them.
For this, we will use *graph_from_data_frame()* that accepts data.frames that describe edges and nodes.
```{r}
net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
net
```
```{r}
E(net) # The edges of the "net" object
V(net) # The vertices of the "net" object
E(net)$type # Edge attribute "type"
V(net)$media # Vertex attribute "media"
```
```{r}
plot(net, edge.arrow.size=.4, vertex.label.dist=2.5)
```
As you can see, there are some loops on the graph. Let's remove them:
```{r}
net <- simplify(net, remove.multiple = F, remove.loops = T)
plot(net, edge.arrow.size=.4, vertex.label.dist=2.5)
```
There are also double edges, but we won't remove them this time since they can belong to different type(for example "hyperlinks" and "mentions").
We can obtain information about edges using next functions:
The following function returns the list of edges:
```{r}
as_edgelist(net, names=T)
```
The function below creates matrix that shows us weight for each edge.
The first one shows the list of all edges from(the first column) and to(the second)
Another function shows us the matrics, that shows edges from(the first column) and to(first row).
Numbers here represent the attribute weight, as we passed it into the function.
```{r}
as_adjacency_matrix(net, attr="weight")
```
Now we have information on how media companies are related with each other.
Let's study the data by visualizing.
For starters, let's replace node labels, with node names stored in "media".
```{r}
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=0.8,
vertex.size=10,
vertex.color="orange",
vertex.frame.color="black",
vertex.label=V(net)$media,
vertex.label.color="black")
```
Now, let's add different colors depending on which media type each node has.
```{r}
unique(V(net)$media.type)
```
```{r}
colors <- c("gray50", "tomato", "gold")
V(net)$color <- colors[V(net)$media.type]
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=0.8,
vertex.size=10,
vertex.label=V(net)$media,
vertex.label.color="black")
```
Now it's easier for us to see, to which media type belongs each node:
* Red - TV
* Yellow - internet
* Grey - newspapers
Let's continue our research and make our plot to show audience size:
```{r}
V(net)$size <- V(net)$audience.size*.4
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=.8,
vertex.label=V(net)$media,
vertex.label.color="black")
```
And the weight of the edge:
```{r}
E(net)$width <- E(net)$weight/6
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=.8,
vertex.label=V(net)$media,
vertex.label.color="black")
```
And final touch would be setting layout for graph:
```{r}
graph_attr(net, "layout") <- layout_with_lgl
plot(net,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label=V(net)$media,
vertex.label.color="black")
```
Now let us add some explanations for the colors:
```{r}
graph_attr(net, "layout") <- layout_with_lgl
plot(net,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label=V(net)$media,
vertex.label.color="black")
legend(x=-1.5, y=-1.1, #coordinates
c("Newspaper","Television", "Online News"),
pch=21,
pt.bg=colors,
pt.cex=2, #size of the circle
cex=.8, #size of the tex
ncol=3) #amount of columns in legends
```
###More network analysis
At this point, we learned how to create a simple network and analyze it. But what if we want to see more information?
Let's replace node names like 's01', 's02', 's03' etc. with names of media companies, so that we do not have to specify node name each time we plot.
```{r}
V(net)$id=V(net)$name
V(net)$name=V(net)$media
```
Now, let's add different colors depending on which media type each node has.
```{r}
unique(V(net)$media.type)
```
```{r}
colors <- c("gray50", "tomato", "gold")
V(net)$color <- colors[V(net)$media.type]
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=0.8,
vertex.size=10,
vertex.label.color="black")
```
Now it's easier for us to see, to which media type belongs each node:
* Red - TV
* Yellow - internet
* Grey - newspapers
Let's continue our research and make our plot to show audience size:
```{r}
V(net)$size <- V(net)$audience.size * .4
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=.8,
vertex.label.color="black")
```
And the weight of the edge:
```{r}
E(net)$width <- E(net)$weight/6
plot(net,
edge.arrow.size=.2,
vertex.label.dist=1.8,
vertex.label.cex=.8,
vertex.label.color="black")
```
Finally, let us add some explanations for the colors:
```{r}
graph_attr(net, "layout") <- layout_with_lgl
plot(net,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black")
legend(x=-1.5, y=-1.1, #coordinates
c("Newspaper","Television", "Online News"),
pch=21,
pt.bg=colors,
pt.cex=2, #size of the circle
cex=.8, #size of the tex
ncol=3) #amount of columns in legends
```
## Network and node's description
Let's calculate some statistical information:
### Density
Density ratio of the number of edges and the number of possible edges(fully connected graph).
```{r}
edge_density(net)
```
###Reciprocity
Reciprocity is the measure of the likelihood of vertices in a directed network to be mutually linked
```{r}
reciprocity(net)
```
###Clustering coefficient (Transitivity)
Transitivity the probability that the adjacent vertices of a vertex are connected
```{r}
transitivity(net, type="global")
transitivity(net, type="local")
```
###Diameter
Diameter at network considered as the length of the shortest paths between the nodes.
We can use next function to obtain diameter:
```{r}
diameter(net, directed=T, weights=NA)
```
```{r}
diameter(net, directed=T)
```
Or we can obtain the nodes along the first found path of that distance.
```{r}
diam <- get_diameter(net, directed=T)
diam
```
Let's colorize this path on our network:
```{r}
vcol <- V(net)$color
vcol[diam] <- "green"
ecol <- rep("gray80", ecount(net))
ecol[E(net, path=diam)] <- "green"
# E(net, path=diam) finds edges along a path, here 'diam'
plot(net,
vertex.color=vcol,
edge.color=ecol,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black", edge.curved=0.3)
```
Let's pring edges and nodes to validate:
```{r}
V(net)[diam]
E(net)[E(net, path=diam)]
E(net)$weight[E(net, path=diam)]
```
###Node degrees
```{r}
deg <- degree(net, mode="all")
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.size=deg*3)
```
Let's make a histogram of node degree:
```{r}
deg_df <- as.data.frame(deg, node=names(deg), deg=deg)
ggplot(deg_df) + geom_bar(aes(x=as.factor(deg)), stat="count") + xlab("degree")
```
### Degree distribution
```{r}
plot(degree_distribution(net, mode='all'))
```
### Degree
```{r}
head(sort(degree(net, mode='all'), decreasing = T),5)
```
### Closeness
```{r}
head(sort(closeness(net, mode='all'), decreasing = T),5)
```
### Betweenness
```{r}
head(sort(betweenness(net), decreasing = T),5)
```
### Eigen vector Centrality
```{r}
head(sort(eigen_centrality(net)$vector, decreasing = T),5)
```
###Hubs and authorities
Let's look at hubs and authorities on our data.
Hubs mean nodes that outgoing edges.
Authorities are nodes that have incoming edges.
```{r}
hs <- hub_score(net, weights=NA)$vector
as <- authority_score(net, weights=NA)$vector
```
Plotting hubs:
```{r}
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.size=hs*50,
main="Hubs")
```
Plotting authorities:
```{r}
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.size=as*50,
main="Authorities")
```
###Distances and paths
Now we will calculate the average path length between each pair of nodes in our graph.
Let's assume that our graph is undirected:
```{r}
mean_distance(net, directed=F)
```
And also calculate for directed graph:
```{r}
mean_distance(net, directed=T)
```
We also can find distances of the all shortest paths in the graph:
```{r}
distances(net) # with edge weights
distances(net, weights=NA) # ignore weights
```
This way we can extract distances to the node we are interested in:
```{r}
dist_NYT <- distances(net, v=V(net)[media=="NY Times"], to=V(net), weights=NA)
#Plotting
palette <- colorRampPalette(c("dark red", "gold"))
col <- palette(max(dist_NYT)+1)
col <- col[dist_NYT+1]
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.cex=.8,
vertex.color=col,
vertex.label=dist_NYT,
vertex.label.color="white")
```
Now let's do a bit more complex task.
We will find the shortest path (edges) from NY Times to FOX News:
```{r}
nyt_fox_path <- shortest_paths(net,
from = V(net)[media=="FOX News"],
to = V(net)[media=="New York Post"],
output = "both") # both path nodes and edges
# Generate edge color variable to plot the path:
ecol <- rep("gray80", ecount(net))
ecol[unlist(nyt_fox_path$epath)] <- "green"
# Generate edge width variable to plot the path:
edge_width <- rep(2, ecount(net))
edge_width[unlist(nyt_fox_path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- V(net)$color
vcol[unlist(nyt_fox_path$vpath)] <- "green"
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.color=vcol,
edge.color=ecol,
edge.width=edge_width)
```
We can find edges that go in or out of our node. Let's find all edges that are around our FOX News:
```{r}
fox_edges <- incident(net, V(net)[media=="FOX News"], mode="all")
#Set colors to plot the selected edges.
ecol <- rep("gray80", ecount(net))
ecol[fox_edges] <- "green"
vcol <- V(net)$color
vcol[V(net)$media=="FOX News"] <- "green"
#Plotting
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.color=vcol,
edge.color=ecol)
```
Or we can use incident_edges() to select edges of several nodes:
```{r}
selected_nodes <- V(net)[c(2,6)]
nodes2_edges <- incident_edges(net, selected_nodes, mode="all")
#Set colors to plot the selected edges.
ecol <- rep("gray80", ecount(net))
ecol[E(net) %in% nodes2_edges$s02] <- "blue"
ecol[E(net) %in% nodes2_edges$s06] <- "green"
vcol <- V(net)$color
vcol[selected_nodes[1]] <- "blue"
vcol[selected_nodes[2]] <- "green"
#Plotting
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
vertex.color=vcol,
edge.color=ecol)
```
Another way of selecting edges would be to use %-% or %->%:
* E(network)[X %-% Y] takes all edges between X and Y;
* E(network)[X %->% Y] takes edges going from X to Y.
Let's see an example:
```{r}
selected_edges <- E(net)[ V(net)[type.label=="Newspaper"] %->% V(net)[type.label=="Online"] ]
#Set colors to plot the selected edges.
ecol <- rep("gray80", ecount(net))
ecol[selected_edges] <- "green"
#Plotting
plot(net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black",
edge.color=ecol)
```
##Subgroups and communities
There are several algorithms to detect communities on the plot. Likely, IGraph package contains functions that can be called to execute many of them.
###Finding communities based on propagating labels
Label Propagation algorithm assigns labels to the points of data. In the beginning, a subset of the data points has labels. These labels are propagated to the unlabeled points throughout the algorithm.
The following function automatically detects the communities and returns them:
```{r}
clp <- cluster_label_prop(as.undirected(net))
clp
plot(clp, net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black")
```
###Louvain algorithm for communities detection:
Let's use another algorithm to obtain communities. Louvain algorithm consists of repeated application of two steps. The first step is a "greedy" assignment of nodes to communities. The second step is the definition of a new network in terms of the communities found in the first step.
```{r}
lou <- cluster_louvain(as.undirected(net))
lou
#Plotting
plot(lou, net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black")
```
###Walktrap community detection algorithm
The idea of this method relies on using random walks. These walks are more likely to stay inside the community as there can be only a few edges that are going outside, to another community.
```{r}
wtc <- walktrap.community(as.undirected(net))
wtc
#Plotting
plot(wtc, net,
edge.arrow.mode=0,
edge.arrow.size=.2,
vertex.label.dist=2,
vertex.label.cex=.8,
vertex.label.color="black")
```