Lectures
There will be five sets of lectures by the following distinguished lecturers.
Keyword-Based Querying of Geo-Textual Data (2 lectures)
Data-Intensive Routing in Spatial Networks (1 lecture)
Christian S. Jensen
Keyword-Based Querying of Geo-Textual Data - lecture slides
Data-Intensive Routing in Spatial Networks - lecture slides
Abstracts
Keyword-Based Querying of Geo-Textual Data
The web is being accessed increasingly by users for which an accurate geo-location is available, and increasing volumes of geo-tagged content are available on the web, including web pages, points of interest, and microblog posts. Studies suggest that each week, several billions of keyword-based queries are issued that have some form of local intent and that target geo-tagged web content with textual descriptions.
This state of affairs gives prominence to spatial web data management, and it opens to a research area full of new and exciting opportunities and challenges. A prototypical spatial web query takes a user location and user-supplied keywords as arguments, and it returns content that is spatially and textually relevant to these arguments. Due perhaps to the rich semantics of geographical space and its importance to our daily lives, many different kinds of relevant spatial web query functionality may be envisioned.
Based on recent and ongoing work by the speaker and his colleagues, the lecture presents key functionality, concepts, and techniques relating to spatial web object ranking and querying; it presents functionality that addresses different kinds of user intent; and it outlines directions for the future development of keyword-based spatial web querying.
Data-Intensive Routing in Spatial Networks
We are increasingly instrumenting reality and are increasingly capturing aspects of our lives digitally. As a result, data is increasingly becoming available that enables us to capture the states of processes important at an unprecedented level of detail, in turn enabling us to better understand and improve the processes.
The lecture describes recent advances in vehicle routing by the speaker and his colleagues that exploit GPS data from vehicles data that capture the time-varying state of the traffic in a road network. More specifically, the lecture describes techniques that enable the annotation of a road network with time-varying and uncertain data that in turn enable the computation of time-varying travel costs, e.g., travel time and greenhouse gas emissions, associated with the traversal of routes in the road network. The lecture covers so-called stochastic skyline routing in this setting and also covers an alternative approach to routing that exploits trajectory data from local drivers.
Biography
Christian S. Jensen is Obel Professor of Computer Science at Aalborg University, Denmark, and he was previously with Aarhus University for three years and spent a one-year sabbatical at Google Inc., Mountain View. His research concerns data management and data-intensive systems, and its focus is on temporal and spatio-temporal data management. Christian is an ACM and an IEEE Fellow, and he is a member of Academia Europaea, the Royal Danish Academy of Sciences and Letters, and the Danish Academy of Technical Sciences. He has received several national and international awards for his research. He is Editor-in-Chief of ACM Transactions on Database Systems.
Quantitative methods in software engineering (1 lecture)
Mining GitHub for fun and profit (2 lectures)
Georgios Gousios
Quantitative methods in software engineering - lecture slides
Mining GitHub for fun and profit - lecture slides
Abstracts
Modern organizations use telemetry and process data to make software production more efficient. Consequently, software engineering is an increasingly data-centered scientific field. With over 30 million repositories and 10 million users, GitHub is currently the largest code hosting site in the world. Software engineering researchers have been drawn to GitHub due to this popularity, as well as its integrated social features and the metadata that can be accessed through its API. To make research with GitHub data approachable, we created the GHTorrent project, a scalable, off-line mirror of all data offered through the GitHub API.
In our lectures, we will see an overview of how software engineering data are being used by researchers and organisations, we will discuss the GHTorrent project in detail and we will go through a case study of applying quantitative methods to analyze data from software repositories.
Biography
Dr. Georgios Gousios is an assistant professor at the Web Information Systems group, TU Delft. His research interests include software engineering, software analytics and programming languages. He works in the fields of distributed software development processes, software quality, software testing, developer productivity assessment and research infrastructures. His research has been published in top venues, where he has received 4 best paper awards. In total, he has published more that 40 papers and also co-edited the ``Beautiful Architectures'' book (OReilly, 2009). He is an avid committer to open source projects as well as the main author of the GHTorrent data collection and curration framework and the Alitheia Core repository mining platform. His research results are being used by hundrends of researchers as well as companies like Microsoft and GitHub. Dr. Gousios holds a PhD in Software Engineering (mining software repositories) from the Athens University of Economics and Business (AUEB) and an MSc in software engineering from the University of Manchester, both with distinction.
Basic principles of algorithmic graph mining
Aristides Gionis
Introduction to graph mining - lecture slides
Computing basic graph statistics - lecture slides
Finding dense subgraphs - lecture slides
Spectral graph analysis - lecture slides
Abstracts
Networks, or graphs, provide a powerful abstraction for modeling a wide variety of real-world data. Graph mining is the discipline of analyzing data represented as graphs. In this course we will provide an overview of some major problems in graph mining. We will then present fundamental algorithmic principles for solving graph-mining tasks and discovering hidden structure in large graphs; emphasis will be given in obtaining algorithms with provable guarantees and in the efficiency of the developed method. The course will cover five different themes: (1) an introduction to graph mining; (2) efficient estimation of graph statistics; (3) finding dense subgraphs; (4) spectral graph analysis; and (5) applications.
Biography
Aristides Gionis is an associate professor in the department of Computer Science of Aalto University. He is the director of the Algorithmic Data Analysis (ADA) programme in the Helsinki Institute for Information Technology (HIIT) and he leads the Data Mining group in Aalto University. Previously he has been a senior research scientist in Yahoo! Research. He received his PhD from the Computer Science department of Stanford University in 2003. He is currently serving as an associate editor in the IEEE Transactions of Knowledge and Data Engineering (TKDE) and the ACM Transactions of Knowledge Discovery from Data (TKDD), and as a managing editor in Internet Mathematics. He has served in the PC of numerous premium conferences, including being the PC chair for WSDM 2013 and ECML PKDD 2010.
Accessing textual information in large collections
Eric Gaussier
Accessing textual information in large collections - lecture slides
Abstracts
In this series of lectures, we will first study how large scale textual collections are indexed, prior to review how information is retrieved and classified in such collections. Information retrieval aims at scoring documents given a user need that takes the form of a query; in this domain, we will review the most important models, including supervised, learning to rank models based on clickthrough data, as well as approaches aiming at efficiently computing approximate similarties. We will then discuss several issues related to text classification in large scale taxonomies, as Directory Mozilla or the Wikipedia category system. We will see how to deploy classifiers in such taxonomies as well as how to treat accurately their "small" classes, by exploiting the power law distributions underlying textual collections.
Biography
After a PhD in Computer Science conducted in both the Centre Scientifique d’IBM France and the University Paris 7, Eric Gaussier joined the Xerox Research Centre Europe (XRCE) in 1996, where he became Area Manager of the group "Learning and Content Analysis". He joined the Unvisersity Grenoble Alps (at that time University J. Fourier) and the Laboratoire d’Informatique de Grenoble (LIG) as a Professor in 2006. In the LIG, he created, with other colleagues, the AMA reserach team on machine learning and data analysis, which he was head of from 2011 to 2015. He currently is Director of the LIG, after having been Deputy Director from 2011 to 2015. Eric Gaussier is conducting research on Data Science, focusing on models and systems to extract information, insights and knowledge from data, and more particularly from large-scale (multlingual) text collections. Eric Gaussier has been primarily involved in machine learning, information retrieval and computational linguistics, and is interested in theoretical models that explain and take into account properties of the data studied. He is also interested in modeling how the textual information is shared in social (content) networks, and how such networks evolve over time. He has been in the editorial board of major journals and on the programme committees of major conferences in these domains, and has received several prices for his reserach contributions.