Potential Topics for the Course Project
Study of Interesting Patterns in Email Exchange Data
Internet2 project has collected a large network flow logs. They are publicly available as one data collection in the webpage of Internet2 observatory data collections. The amount of data allows to monitor various changes in the network dynamics. In particular, various events in email exchange protocols can be identified and studied. More precisely, all network logs are in the NetFlow format, which consists of the following data entries
source IP | destination IP | protocol | source port | destination port | octets | packets |
---|---|---|---|---|---|---|
128.112.128.0 | 65.55.200.0 | 6 | 80 | 22233 | 1420 | 1 |
128.112.128.0 | 65.55.200.0 | 6 | 80 | 20961 | 1420 | 1 |
where protocol
encodes the type of communication protocol such as TCP or UDP and octets
encode the total length of a packet.
As email exchange is done mostly over the port 25, 465 (SMTP protocol), over the ports 143, 993 (IMAP), over ports 110, 995 (POP), it is possible to filter out network traffic related to email exchange. Unfortunately, the network logs are anonymized by setting last 11 bits of IPv4 addresses to zero. As a result, one can only observe email exchange between small subnetworks. Also, the packet sampling ratio is 1 : 100 and thus not all interaction signatures are observable. Hence, the first task is identification of email exchange events that can be reliably extracted form the data.
The second task is reconstruction of a directed graph with arc labels such that it is possible to observe dynamic events. There are essentially two options. First, it is possible to construct a single graph with timing information. Second, one can construct graphs corresponding to certain time intervals and later study changes in graphs. Different time resolution reveal different aspects in mail exchange. A time window that spans over many hours characterizes average case behavior, whereas windows spanning over few minutes reveal sporadic user behavior. Both types of information is useful in the following stages.
The third task is to identify: consumer networks, locations of major mail servers and list of potential spammers based on standard graph parameters. You should describe the rationale behing your choices and estimate how reliable your classification is. As a next step, you should study how different mail protocols are intertwined and what kind of interaction patterns can be found in the data. Find both spatial and temporal patterns. Please interpret the results.
Provided that you can semi-reliably detect end users, mail servers and spammers identify how big is the fraction of spam and how does it diffuse in the network. This cannot be done without heuristic assumptions about email traffic in general, so you have to make and document some educated guesses that make the analysis tractable.
Besides these aspects you might try to identify other types of patterns, like global temporal trends related to day time or patterns that reveal weak links in the network.
Technical tips. Note that fixed port ranges for SMTP protocol are on the receiver side, whereas POP and IMAP protocols use fixed ports on the sender side. As a result, the following rows
source | destination | packets | mail protocol |
---|---|---|---|
xx:** | yy:25 | 4 | SMTP |
xx:** | yy:465 | 10 | SMTP |
xx:143 | yy:25 | 14 | IMAP |
xx:993 | yy:465 | 13 | IMAP |
xx:143 | yy:25 | 42 | POP |
xx:993 | yy:465 | 1 | POP |
indicate mail communication form the node xx
to yy
and the packet count is correlated with the number and size of messages.
Analysis of Social Networks
A social network is a graph by design. In most cases, the social bonds between users are symmetric and thus the graph can be viewed as undirected graph. However, this graph contains much more information. Usually, each user has a profile that can be scraped, as well. Secondly, many social networks provide a way to send pubic short messages to other persons. These can be viewed as a different type of edges. Of course, not all persons keep their profile open for everybody but still the fraction of publicly available information is quite large. Hence, it would be interesting to build a prediction algorithm that given a person, crawls among his or her friends and provides some reasonable predictions for missing attributes.
The concrete task consists of three basic steps. First, you have to scape some part of the social network. For that you need a web crawler and a crawling strategy. I would suggest to start to take initial starting points from a certain community and then use either snowball sampling, random walk together with neighborhood sampling or any other idea that comes to your mind. The resulting graph should be well connected and not so sparse or otherwise it is difficult to get enough data for the attribute prediction. Building a good enough web crawler is not so difficult as you might think. For instance, Minas Gjoka has written Python scripts for crawling Facebook and a freeware .NET webcrawler can be used for the Facebook, as well. It is also not so hard to write such crawler from scratch, see the discussion in Stackverlflow. As a friendly advice, use either publicly available pages or create a special account for crawling.
The second step consist of text scraping. Now that you have downloaded a huge pile of HTML pages, you need to extract attributes and link information. For that you can use any text processing tools. Command line tools grep
and awk
combined with perl and python scripts would be my choiche. As a result, you should obtain a list of nodes decorated with attributes and list of edges also decorated with attributes. A well-documented and well-sampled graph with semantically rich and meaningful annotations alone is a good project work. However, you should not stop there. You should study basic descriptive properties of the graph.
The third and final step is attribute prediction. Since most of the machine learning algorithms cannot handle complex data types you should extract features or try out some specialized algorithms. As usual use disjoint training, validation and test sets for tuning and evaluating the prediction algorithm. For categorical data, you should use standard SVM packages or decision trees (if you want to get easily interpretable knowledge). I would use either Matlab, GNU R, Weka or RapidMiner for this task unless you want to experiment with specialized algorithms.
The final report should contain detailed description of all three steps and final assessment whether the resulting predictor is usable in practice. If not then a detailed analysis of potential reasons and methodological shortcomings should be listed.
Understanding Blogosphere and Its Dynamics
Blog as a collection of webpages is not a graph by definition. However, a collection of blogs or even the entire blogosphere could be easily visualized as graph or a collection of graphs. Moreover, changes in the blogs are easily accessible---every blog entry is dated. On a flip side, every blog looks a bit different, which complicates automatic analysis. Fortunately, most blogs are hosted on specific websites that provide standardized HTML-tagging of all posts. With a help of a good browser inspector it is easy to find initial guesses, which id and classes in DIV elements indicate entry titles, entry date, links to other blogs, etc. Next you should use a good HTML parser library and a set of blogs to test and refine your initial guesses. One possibility is to study Estonian blogosphere and pick initial set of blogs from the top node, but any other choice is as good provided that the number of blogs is more than 1000.
As the first step, you should download and do initial parsing of the blog entries. For every blog entry, you should extract title, main text, date, meta tags, comments. For every blog, you should extract blog entries, bloggers personal profile, links to other blogs etc. As explained, this is a process of trial and error. Hence, it is important to document the extraction procedure so that others could reproduce it.
As the second step, you should form a collection of graphs where each node is a blog and arcs correspond to features extracted from the blog posts. For that you should fix a reasonable time window so that there would be enough blog posts for the analysis. The length of the time window depends entirely on the type of phenomena you are trying to investigate. If you study news propagation through blogs then changes occurs in minutes. For more slow marketing and buzz events, time resolution in days is sufficient. The arcs in the graph should represent different aspects. For instance, you could add a link between two blogs if both blogs have postings with very similar titles. How to quantify similarity is up to you. For example, you can say that tiles are similar if they contain same keyword, e.g. IPhone, Linux, diving. Two blog posting with high similarity could lead to a different arc. Again, the definition of similarity is up to you, but I would use standard bag-of-words approach with global frequency correction. Additionally, you should use keyword specific links that connect two blogs if both blog entries contain same keyword. Similar arc construction methods can be devised for meta tags and comments. Note that the resulting graph contains edges with different types, i.e., properties of titles and the main text should not be merged. Besides labelled edges, the nodes have also labels corresponding to the attributes of a blogger's profile.
For the third step, you should investigate how the graphs change throughout the time. Firstly, you should study and visualize basic graph characteristics like degree and diameter. Secondly, you should try to analyze more specific features. On possibility is study of the information flow. Read some articles about the topic. For instance, start with Cascading Behavior in Large Blog Graphs, if it is not relevant try references provided therein or articles that cite it. Anyhow, it would be nice how do topics like Windows 7, eggplant salad, volcano propagate through the network in time. Alternatively, you could test whether there is a community like structure in the blogosphere. Use some community detection algorithm for discovering overlapping clusters of blogs. Test whether new buzzwords (by definition a buzzword is a term that is very frequent during the time window compared to its base frequency) stay in the cluster. If not try to find out which are the gatekeeper nodes between the community and other blogs. Give a list of local buzzwords that become popular only inside a specific community and list of global buzzwords. Characterize the drift of communities---how do they change throughout the time.
Yet another option is to use exponential random graphs as a tool to map structural changes in terms of few parameters. For that, you need statnet
package for GNU R. The sub-package ergm
allows you to fit various exponential models.
The package contains many predefined features that can be used as terms in exponential model. However, if you feel that some specific new term is more relevant are you daring enough then it is possible to define new terms with ergmuserterms
package. Anyhow there are two principal ways how to use ergm
package. The first option is to find a model that fits the graphs with enough precision and then observe how the estimated coefficients change in time and give an interpretation to the changes. Secondly, you could first fit the model to the graph and then use the model as background for estimating significance of patterns you discover. For example, you could see if certain communities or node neighborhoods are abnormal for that background model.
The final alternative is to predict changes in the graph. For instance, predict new keywords for emerging blog posts or predict new links for the graph. It is also possible to predict changes in node attributes. Sometimes users modify their personal profile or links to the other blogs. Another interesting question is whether some buzzwords near the blogger trigger blog posts. This can be also formulated as a prediction task.
How do Co-expression Patterns Change in Human Tissues?
Gene expression measurements are common way to study transcriptional regulation of a cell. It widely believed that genes that behave similarly over many different conditions must share some mechanisms of transcriptional regulation, such as common binding site for a specific transcription factor---a protein that regulates transcription of a gene. The most obvious way to quantify similarity of two genes is to compute correlation between their expression values over a fixed set of experiments. These correlation values can be treated as edge weights. Alternatively, one can estimate the probabilities up-->up, ..., down-->down and define inhibitor and activator arcs if values in the dataset are discretized. A collection of such datasets generates a corresponding collection of graphs. The aim of this project is quantify and visualize changes in these graphs so that results could be easily interpretable.
One possibility is to use exponential random models to fit the data and then quantify the changes in terms of parameter variations. In particular, you can cluster the graphs according to the parameters and investigate whether the resulting clusters are biologically meaningful, e.g., cancer datasets are clearly separated from others. Secondly, you can order datasets so that distance between neighbors is minimal and then visualize the changes in the graphs. Also, you can study the similarity of datasets in terms of significance. Each dataset defines a best exponential model, for any dataset we can find the corresponding likelihood. By computing the corresponding factors log(P[graph|model]/P[original graph/model]) we can find significantly dissimilar graph pairs. Again, you should verify whether these results have clear biological interpretation.
Another possibility is to use same reasonable clustering or community detection algorithm for all of these graphs. As a result you get many clusterings. To illustrate the changes, you should device a visualization method for changes in clustering so called alluvial diagrams. Note that there is no prescribed order for the datasets and thus you define by youself. One possibility is to use exponential random graph models for defining the order as in the previous paragraph. Alternatively, you could choose the order so that the graph would look pretty.
A third and even more interesting option is to study how specific are clusters to particular conditions. For that, you need to combine both ideas described above. Let C1 and C2 be the most similar clusters according to some reasonable metrics in two adjacent graphs. Then we can ask whether the clusters are significantly dissimilar or not. For that, we should fix a background model and then generate many graphs to see whether the resulting dissimilarity is big or small on average. As usual, we have to generate large amount of sample graphs and find clusters, compute the dissimilarity score and then find corresponding empirical p-values. If the model is chosen as best fit for the entire collection of graphs, the high p-value indicates that change in the clustering is specific to the change in conditions.
Example dataset: Collection of Tissues
Fast Generation of Surveys on Academic Topics
Writing an academic article is such a chore---you have to do literature review in order to find relevant results. The first phase of this work is rather tedious. You just enter some random keywords into google, find list of articles that seem relevant, read through some articles referenced by them and some articles which reference these articles. To be precise you do not read them you just glimpse through the title and abstract to see whether this article is relevant at all. As an end result, you get a large list of articles that you really need to read. The aim of this project is to automate this process as much as you can.
To limit the scope, consider only three sources of information: Google Scholar, DBLP Bibliography Server, ACM Digital library. The first source provides access to the articles, similar articles, articles which cite the current article. The second source provides author information such as list of all articles and list of co-authors. The third source allows you to extract abstract and references of the article.
As a first step you should build a crawler that given a keyword combination or list of articles used Google Scolar and ACM Digital Library to build a graph of relevant articles. Intuitively, you should find all relevant articles about the topic cited in the initial list of articles, then you should refine that list by following references and citings further but not infinitely as then you would download the entire web. Here you should use some heuristic criterion like maximal number of forward and backward links, list of good keywords, list of stop words, co-citation indices, etc.
After you have obtained the graph you need to decorate it with abstracts. Google Scholar and ACM Digital Library can be used for this. Next, you should use abstracts, keywords, number of citings and other measures to find out publications that are likely to be more relevant. After that you should present the results in tree structure that mimics historical development in the topic.
In general, this is an open ended project. You have to find solution and the way to evaluate its goodness by yourself. In a certain sense the project can fail. However, you must document the failure well enough.
Empirical Study of Sampling Methods in Directed Graphs
It is relatively straightforward to design estimation procedures for graphs that are undirected and either edges or vertices are randomly accessible. In many cases, none of these assumptions is satisfied. The simplest example is estimation of various network parameters of the internet or world-wide-web. The goal is to apply different well-known sampling techniques specifically on directed graphs and to run various estimators on the collected samples analyzing and comparing produced values. In particular, one should investigate whether global properties of the underlying graph (such as diameter or link density) influence the goodness of estimates.
The experimental data collection should consist of graphs with diverse properties. On possibility is to choose several very diverse real world networks and compare the performance of various estimators. Another alternative is to do systematic modifications into the graphs in order to study the dependency on a certain parameter. To study the dependency on graph diameter, one can choose a random center point and delete all nodes outside n1, n2, ... hops. Similarly, links can be dropped to study the dependency on link density.
Most well-known sampling methods require uniform sampling of vertices or edges. As the graph is given implicitly, this random sampling must be implemented as a separate routine. That is, given only a few starting nodes in the graph, a sampling algorithm should crawl in the network to get the vertices. After that one can apply classical sampling designs, such as snowball sampling and later correct the bias with Horvitz-Thompson style estimators. Of course, the bias and variance of the estimators depends on the algorithm that is used to generate random vertices and thus requires separate investigation.
The set of estimators should include totals (average degree, number of edges, clustering coefficient), estimators of vertex group size (hidden population, total number of vertexes) and complex graph properties (number of triangles, average number of paths between two nodes, average distance). The bias and variance of an estimator should be estimated empirically by repeating the computing procedure sufficiently many times. To illustrate typical behavior of the estimation procedure, it is advisable to draw some smoothed histograms of the outputs.
Prediction of Traffic Congestion
The goal of the contest Internet Mathematics 2010 is to predict the rate of traffic congestion based on previous observations. The data provided for participants in the contest are a graph of Moscow streets and observation information – the speed of traffic flow on segments of streets during one month. The task is to predict the rate of traffic congestion on the last day of the month.
Further details: http://imat2010.yandex.ru/en/datasets