HW8. Graphs (2nd of Nov, 23.59)
EX1. There are many data types that can be represented as graphs. Study online collection at https://snap.stanford.edu/data/. Select from there 2 graphs that are different in type and size. Consider them as undirected (if edge A->B exists, but B->A doesn't exist, then also add B->A) and characterize them by calculating:
- Nr of nodes
- Nr of edges
- Distribution of node degrees (visualize it using a histogram)
What is the graph about? What can you tell about the graphs from these statistics? Are these graphs dense or sparse?
EX2.
Read in the following UNdirected Graph dataset from: dataset link
The dataset starts with a description of vertices (below text VERTICES), which describe all graph vertices with their corresponding coordinates X and Y.
E.g. 0 652 396, is vertex 0 at location X=652 Y=396.
Lines after the line "EDGES" describe graph edges.
E.g. 0 1, there exists an edge between nodes 0 and 1.
You can visualize this graph on your own or by making use of the following web tool: Webtool searchvis. Python code to generate input for visualisation can be found at Graph task visualisation code. Feel free to change parameters, modify the code or implement your own. Sample generated input for the web tool can be found here: link:
Your task here is to start from vertex with id 0 and perform Breadth-First Search and Depth-First Search:
Start from node 0. Perform BFS and DFS and return the order of processing the nodes (the first time you start to process the node (e.g. when you add neighbours to queue/stack)). When adding neighbours, prioritize lower ID nodes first! E.g. if you have to add nodes 1 and 2 to stack/queue, add 1 first and then 2.
The output of this task should be two sequences (one for BFS, one for DFS) [0,1,...] . Add this to your report!
Now, visualize the output of your traversal by adding the order of processing to the visualisation (the one in darkred). To do that, you can either do it by hand (paint) or use the provided visualisation code and change line 152-153, where there is a variable "traversal_order". Change the sequence within the list to be the order of vertices being processed (previous output of the task, you can also use this to confirm your solution works as intended!). Also, add the visualisations of orderings to your report!
Hint: If you use a queue (FiLo: First in - Last out) for first traversal and a stack (LiFo: Last in - First out) for second, then the code for the two solutions should be almost the same! In order to not process nodes you've already processed, you probably need to create an additional data structure to remind you which nodes have already been processed.
EX3. Use the following directed graph. Convert it into adjacency matrix representation and then convert the matrix into the graph where the original edges are replaced by "3-hop" edges or in other words - connect only vertices that are reachable in 3 steps of the original graph.
EX4. Make a transitive closure of the same graph - using two approaches. 1) Multiplication approach: G*G*G...; 3) with Warshall algorithm. Verify that you got the same answer.
EX5. Implement a "random walk" procedure following links randomly (equal probabilities). Using the same graph again, estimate the probability of being in any given node by varying how you deal with "dead ends" (e.g. node B) and "nodes with no links into it" (e.g. nodes D or A). First, try "re-appearing anywhere" from dead ends only. Secondly, when walking, with 20% probability jump to any random vertex in the network, and 80% of times select randomly and go to one of the neighbouring vertices. For both scenarios, provide the respective probabilities and identify the most important nodes.
EX6 (Bonus 2p). Use the random walk data from the previous exercise. Implement the "same" random walk procedure but now clearly using matrix multiplications much like in finding transitive closures. Assign uneven probabilities to links - e.g. 80%-20% for two-link nodes (larger probability to alphabetically the first neighbour) and 60%-30%-10% for three link ones. Feel free to modify the input graph a bit. Introduce enough links so that you do not have "dead ends" or nodes where nothing leads into. E.g. I recommend introducing links G->H, B->K, K->H. But feel free to experiment. Of course, you should also figure out what happens to graph multiplications when you have not yet changed the graph. Your task is to report transition probabilities after N steps. How do you know when to stop multiplying?
EX7 (Bonus 2p). Extend your solution from EX6 and implement the Google Page Rank algorithm (here is a small playground), the algorithm itself is very simple) where 20% of times you would re-appear anywhere in graph. Both using equal-probabilities links and the uneven 80-20, 60-30-10 probabilities. Report the outcome probabilities and compare to results in EX6.