Seminar on Enterprise Software - Kursused

Quo vadis entrepreneurial hackathons?

Time bounded events such as hackahons, data dives, codefests, hack-days, sprints and edit-a-thons have received wide spread attention in recent years. Events that are organized with the aim to support teams to develop innovative products and services that can be turned into successful start-ups have been at the forefront of this recent surge. The question however remains how hackathons and entrepreneurship are actually connected. The aim of this topic is to conduct a comprehensive literature review covering the current state-of-the-art of research on entrepreneurial hackathons and their connection to the startup scene. This review should outline current knowledge as well as open questions and shortcomings.

[1] Cobham, D., Hargrave, B., Jacques, K., Gowan, C., Laurel, J., & Ringham, S. (2017). From hackathon to student enterprise: an evaluation of creating successful and sustainable student entrepreneurial activity initiated by a university hackathon.

[2] Komssi, M., Pichlis, D., Raatikainen, M., Kindström, K., & Järvinen, J. (2015). What are hackathons for?. IEEE Software, 32(5), 60-67.

[3] Taylor, N., & Clarke, L. (2018, April). Everybody's Hacking: Participation and the Mainstreaming of Hackathons. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (p. 172). ACM.

Proposed by A. Nolte

email: alexander.nolte@udo.edu

Replication of Empirical Software Engineering Case Study Experiments

Empirical software engineering community publishes many case studies validating different approaches and analytical algorithms to software engineering. Unfortunately, these studies are rarely validated by independent replication. To make matters worse, the studies use different validation metrics, which makes them incomparable. Thus, your mission, should you choose to accept it, is to analyse different published case studies on one topic (e.g. bug detection, code churn estimation) to evaluate their replicability and replicate the studies in order to make them comparable. In short you will: 1. envisage a workflow/pipeline for replicating published studies (including testing for replicability); 2. use the workflow to replicate several studies; 3. validate these studies and compare their results on an common scale.

[1] Le Goues, C., Dewey-Vogt, M., Forrest, S., & Weimer, W. (2012, June). A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Software Engineering (ICSE), 2012 34th International Conference on (pp. 3-13). IEEE. http://ieeexplore.ieee.org/abstract/document/6227211/

[2] Tian, Y., Lawall, J., & Lo, D. (2012, June). Identifying linux bug fixing patches. In Proceedings of the 34th International Conference on Software Engineering (pp. 386-396). IEEE Press. https://dl.acm.org/citation.cfm?id=2337269

[3] Kagdi, H., Collard, M. L., & Maletic, J. I. (2007). A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of Software: Evolution and Process, 19(2), 77-131. http://onlinelibrary.wiley.com/doi/10.1002/smr.344/full

[4] Thomas, S. W. (2011, May). Mining software repositories using topic models. In Proceedings of the 33rd International Conference on Software Engineering (pp. 1138-1139). ACM. https://dl.acm.org/citation.cfm?id=1986020

Proposed by S. Karus

email: siim.karus@ut.ee

GPU-Accelerated Data Analytics

In this project a set of GPU accelerated data mining or analytics algorithms will be implemented as an extension to an analytical database solution. For this task, you will need to learn parallel processing optimizations specific to GPU programming (balancing between bandwidth and processing power), implement the analytics algorithms, and design a user interface to accompany it. As the aim is to provide extension to analytical databases (preferably MSSQL, Oracle or PostgreSQL), you will also need to learn the extension interfaces of these databases and their native development and BI tools. Finally, you will assess the performance gains of your algorithms compared to comparable algorithms in existing analytical database tools.

[1] Bakkum, P., & Skadron, K. (2010, March). Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (pp. 94-103). ACM. https://dl.acm.org/citation.cfm?id=1735706

[2] Breß, S., Heimel, M., Siegmund, N., Bellatreche, L., & Saake, G. (2014). GPU-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV (pp. 1-35). Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/978-3-662-45761-0_1

[3] Sitaridi, E. A., & Ross, K. A. (2016). GPU-accelerated string matching for database applications. The VLDB Journal, 25(5), 719-740. https://link.springer.com/article/10.1007/s00778-015-0409-y

[4] Karnagel, T., Mueller, R., & Lohman, G. M. (2015). Optimizing GPU-accelerated Group-By and Aggregation. In ADMS@ VLDB (pp. 13-24).

Proposed by S. Karus

email: siim.karus@ut.ee

Graph Reasoning on Software Repositories

Software repositories offer lots of insights into software development process. Most of the analytical processes used on software repositories are heavily reliant on the availability of training data – samples of positive and negative cases. These samples, however, have to be determined by people. This limits the usefulness of the analytics as people might miss possible relationships or make mistakes in specifying the training output values. Graph reasoning on the other hand is a machine learning technique that does not require training data and uses internal rules to find relationships in the data. As such, graph reasoning can be used to discover unwanted or unnoticed patterns in software and software evolution. Your task is to bridge these two disciplines in order to further our understanding of software, its evolution and perhaps even improve the quality assurance process.

[1] Kiefer, C., Bernstein, A., & Tappolet, J. (2007, May). Mining software repositories with isparol and a software evolution ontology. In Proceedings of the Fourth International Workshop on Mining Software Repositories (p. 10). IEEE Computer Society. https://dl.acm.org/citation.cfm?id=1269048

[2] Watkins, E. R., & Nicole, D. A. (2006, January). Named graphs as a mechanism for reasoning about provenance. In Asia-Pacific Web Conference (pp. 943-948). Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/11610113_99

[3] Keivanloo, I., Forbes, C., Hmood, A., Erfani, M., Neal, C., Peristerakis, G., & Rilling, J. (2012, June). A linked data platform for mining software repositories. In Mining Software Repositories (MSR), 2012 9th IEEE Working Conference on (pp. 32-35). IEEE. http://ieeexplore.ieee.org/abstract/document/6224296/

[4] Martinez, M., & Monperrus, M. (2015). Mining software repair models for reasoning on the search space of automated program fixing. Empirical Software Engineering, 20(1), 176-205. https://link.springer.com/article/10.1007/s10664-013-9282-8

Proposed by S. Karus

email: siim.karus@ut.ee

Interpretable Predictive Monitoring of Business Processes

Recent advances of supervised machine learning in various tasks stem from the use of powerful and complex models (neural networks, deep learning, random forests). However, adoption in practice remains challenging because of limited interpretability of these methods and low actionability (what should the user do to alter the ongoing process instance to improve the expected/predicted outcome). Lack of understandability and actionability poses a serious challenge in domains such as financial and medical services, where the understanding of the decision behind the prediction is crucial. Moreover, the interpretability of the model can provide a valuable feedback in order to improve it even further. As such, this thesis project goes beyond the state-of-the-art in predictive process monitoring by developing methods and techniques to translate complex predictive models into understandable knowledge for key stakeholders in the process.

[1] https://www.sciencedirect.com/science/article/pii/S0957417417303950

Proposed by F. M. Maggi

email: f.m.maggi@ut.ee

Discovery of Hybrid Process Models

The declarative-procedural dichotomy is highly relevant when choosing the most suitable process modeling language to represent a discovered process in the context of process discovery techniques. Less-structured processes with a high level of variability can be described in a more compact way using a declarative language. By contrast, procedural process modeling languages seem more suitable to describe structured and stable processes. However, in various cases, a process may incorporate parts that are better captured in a declarative fashion, while other parts are more suitable to be described procedurally. In these scenarios, hybrid models are the best choice for describing the discovery results. In this thesis, an approach for the discovery of hybrid process models from logs of process executions will be developed. The approach will be implemented in the process mining tool ProM and experimented in real life case studies.

[1] Fabrizio Maria Maggi, Tijs Slaats, Hajo A. Reijers: The Automated Discovery of Hybrid Processes. BPM 2014: 392-399

Proposed by F. M. Maggi

email: f.m.maggi@ut.ee

Online Data-Aware Declarative Process Discovery from Event Streams

Stream processing is defined as “technologies designed to process large real-time streams of event data” and one of the example applications is process monitoring. The challenge to deal with streaming event data is also discussed in the Process Mining Manifesto. A process discovery algorithm is a function that maps an event log in a process model such that the model is representative for the behavior seen in the event log. A declarative process model is a set of business rules that describe the process behavior under an open world assumption, i.e., everything that is not forbidden by the model is allowed. These models can be used to express process behaviors involving multiple alternatives and can be enriched by data-aware conditions depending on some values that can be represented as attributes in the data in a compact way and are very suitable to be used in changeable and unstable environments with respect to the conventional procedural approaches. In [1] an approach to automatically discover declarative process models from streams of data has been presented. However this approach did not consider Data-aware conditions. In this thesis, we extend the algorithm in [1] in order to generate Data-aware declarative process models.

[1] Andrea Burattin, Marta Cimitile, Fabrizio Maria Maggi, Alessandro Sperduti: Online Discovery of Declarative Process Models from Event Streams. IEEE Trans. Services Computing 8(6): 833-846 (2015)

Proposed by F. M. Maggi

email: f.m.maggi@ut.ee

Deviance Mining of Business Processes

Deviant business process executions are those that deviate in a negative or positive way with respect to normative or desirable outcomes, such as executions that undershoot or exceed performance targets. There are classification methods that can be used to discriminate between normal and deviant executions. In particular, they can be used to discover rules that explain potential causes of observed deviances. In this thesis, an approach for deviance mining of business processes will be implemented in the process mining tool ProM and experimented in real life case studies.

[1] Hoang Nguyen, Marlon Dumas, Marcello La Rosa, Fabrizio Maria Maggi, Suriadi Suriadi: Business Process Deviance Mining: Review and Evaluation. CoRR abs/1608.08252 (2016)

Proposed by F. M. Maggi

email: f.m.maggi@ut.ee

Static Analysis of Node.js applications

Node.js is a runtime environment that allows Javascript applications to run outside the Web browser. It has become a popular platform to implement server-side applications in Javascript. The lack of type-safety of Javascript makes it prone to errors and vulnerabilities, including injection attacks. A number of analysis techniques specifically designed for Node.js have emerged in the past years. Your task is to conduct a review of techniques and tools for static analysis and dynamic analysis tools for Javascript in general, and Node.js in particular, and to discuss the maturity and limitations of existing solutions in this field.

[1] https://dl.acm.org/citation.cfm?id=3179527

[2] https://dl.acm.org/citation.cfm?id=3236502

[3] https://plg.uwaterloo.ca/~olhotak/pubs/oopsla15.pdf

[4] https://dl.acm.org/citation.cfm?doid=3145473.3106739

[5] https://dl.acm.org/citation.cfm?doid=2771783.2771809

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Case Studies of Robotic Process Automation

Robotic Process Automation (RPA) tools, such as UIPath and Automation Anywhere, allow organizations to automate repetitive work by executing scripts that encode sequences of fine-grained interactions with Web and desktop applications, such as opening a file, selecting a field in a form or a cell in a spreadsheet, and copy-pasting data across fields or cells. A typical task that can be automated using an RPA tool is transferring data from one system to another via their respective user interfaces, e.g. copying records from a spreadsheet application into a Web-based enterprise information system. Several case studies of robotic process automation have been reported in recent years. Your task is to do a survey of case studies in the field of RPA, and to derive from this survey some advantages and pitfalls of RPA. Below are some initial examples of such case studies, which you can use as a starting point.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Automated Discovery of Data Transformations From Examples

Implementing transformations between multiple data schemas or document formats is a recurrent task in enterprise system integration efforts. Recently, a new family of techniques have emerged, which seeks to automatically discover mappings between two data schemas based on examples. Your task is to conduct a survey of the literature in this field and to discuss the capabilities and limitations of existing techniques. Below are some starting points.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Collaborative Business Process Execution on Blockchains

Blockchain technology provides basic building blocks to support the execution of collaborative business processes involving mutually untrusted parties in a decentralized environment. Several research proposals have demonstrated the feasibility of designing blockchain-based collaborative business processes using a high-level notation, such as the Business Process Model and Notation (BPMN), and thereon automatically generating the code artifacts required to execute these processes on a blockchain platform. Your task is to conduct a review of the state of the art in this field.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Predicting the Next Task in a Business Process

Predictive process monitoring is a family of techniques to predict future events or properties of running executions of a process. Within this field, several techniques have been proposed to address the following question: What will be the next event or task that will occur in an ongoing execution of a process? Your task is to review the literature in the field and discuss the capabilities and limitations of techniques in this field. Below are some initial pointers:

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Principles of microservice architectures

Microservice architectures are a modern approach to develop, deploy, operate, and maintain software applications, particularly in enterprise settings. They rely on the tenet that the data and business logic of an application should be decomposed into cohesive pieces, which are developed, deployed and operated independently. But what exactly is a microservice architecture? What are the key principles or tenets of a microservice architecture? And what trade-offs does it strike with respect to other alternative types of software architectures? Your task is to conduct a review of definitions and principles of microservice architectures in order to provide a synthetic answer to the above questions.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Migrating monoliths to microservices

Microservice architectures are a modern approach to develop, deploy, operate, and maintain software applications, particularly in enterprise settings. They rely on the tenet that the data and business logic of an application should be decomposed into cohesive pieces, which are developed, deployed and operated independently. Over the past few years, many companies have migrated their enterprise systems from a so-called monolothic (three-tier) architecture to a microservices architecture. Your task is to conduct a review of case studies and methods for migrating monolithic applications to microservices.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Offline-First Web Applications

Most Web applications are architected in such a way that they cannot be used in the absence of a (stable) connection between the client and the server. In the context of mobile computing, it is however desirable to have at least some functionality available even when the device is disconnected from the network. There are several approaches for developing offline-first web applications, capable of operating (possibly in degraded mode) without a connection. Your task is to review existing approaches in this field.

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Progressive Web Applications

Web applications are said to be progressive when they provide similar user experience regardless of the browser and the device (e.g. desktop PC, tablet, or smartphone), and that can work offline in degraded mode. Your task is to review existing definitions of progressive Web apps, and approaches to implement such apps. Below are some initial pointers:

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Static Bug Detection

Detecting bugs automatically is a holy grail in the field of software verification. Decades of development in the field of static program analysis have led to relatively mature bug finding tools, which have been deployed in industrial settings in the past 10-15 years. Your task is to conduct a survey of bug finding tools based on static analysis (e.g. FindBugs, Infer, Julia) and to discuss their current capabilities and limitations. Below are some initial pointers:

[1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/34339.pdf

[2] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/32791.pdf

[3] https://link.springer.com/chapter/10.1007/978-3-662-53413-7_3

[4] http://www.cs.columbia.edu/~junfeng/14fa-e6121/papers/coverity.pdf

[5] https://link.springer.com/chapter/10.1007%2F978-3-319-17524-9_1

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Sampling-based Java Profiling

Developers generally rely on profilers to analyze the performance of software applications and to identify the source of performance issues. However, profiling comes at a cost: It visibly generally slows down the performance of the application under observation, which in many cases diminishes the usefulness of profilers. Sampling-based profiling is a technique to profile applications in a way to limits the performance overhead. This technique is however difficult to implement correctly and with minimum overhead, particularly in the context of multi-threaded applications. Significant amount of research and development is still ongoing in the field of sampling-based profiling of (e.g.) Java applications.

[1] Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney. Evaluating the accuracy of Java profilers. PLDI 2010: 187-197

[2] Peter Hofer, David Gnedt, Hanspeter Mössenböck: Lightweight Java Profiling with Partial Safepoints and Incremental Stack Tracing. Proc. of ICPE 2015: 75-86

[3] Peter Hofer, Hanspeter Mössenböck: Efficient and accurate stack trace sampling in the Java hotspot virtual machine. ICPE 2014: 277-280

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Process Mining for Auditing

Auditing is a time-consuming task. Widespread use of enterprise systems (e.g. CRM and ERP systems) to perform day-to-day business processes in companies are making it possible to apply data mining techniques to support the work of auditors. In particular, several experiences have shown that process mining can be used to support certain auditing tasks.

[1] Mieke Jans, Michael Alles, Miklos Vasarhelyib: The case for process mining in auditing: Sources of value added and areas of application. International Journal of Accounting Information Systems 14(1): 1-20 (2013)

[2] Wil M. P. van der Aalst, Kees M. van Hee, Jan Martijn E. M. van der Werf, Marc Verdonk: Auditing 2.0: Using Process Mining to Support Tomorrow's Auditor. IEEE Computer 43(3): 90-93 (2010)

[3] https://www.researchgate.net/profile/Miklos_Vasarhelyi/publication/273111802_A_Field_Study_on_the_Use_of_Process_Mining_of_Event_Logs_as_an_Analytical_Procedure_in_Auditing/links/5550a74708ae12808b39034a/A-Field-Study-on-the-Use-of-Process-Mining-of-Event-Logs-as-an-Analytical-Procedure-in-Auditing.pdf

Proposed by M. Dumas

email: marlon.dumas@ut.ee

Discovering concept drift in business processes from event logs

In the field of business process mining, the term "concept drift" refers to the fact that business processes tend to change over time. Several methods have been proposed to discover concept drift in business processes from event logs, for example:

[1] R. P. Jagadeesh Chandra Bose, Wil M. P. van der Aalst, Indre Zliobaite, Mykola Pechenizkiy: Dealing With Concept Drifts in Process Mining. IEEE Trans. Neural Netw. Learning Syst. 25(1): 154-171 (2014)

[2] Josep Carmona, Ricard Gavald?: Online Techniques for Dealing with Concept Drift in Process Mining. Proc. of IDA 2012, pp. 90-102

[3] Abderrahmane Maaradji, Marlon Dumas, Marcello La Rosa, Alireza Ostovar:,Detecting Sudden and Gradual Drifts in Business Processes from Execution Traces. IEEE Trans. Knowl. Data Eng. 29(10): 2140-2154 (2017)

[4] Alireza Ostovar, Abderrahmane Maaradji, Marcello La Rosa, Arthur H. M. ter Hofstede: Characterizing Drift from Event Streams of Business Processes. CAiSE 2017: 210-228

Proposed by M. Dumas

email: marlon.dumas@ut.ee

If media is biased ? An empirical analysis

News channel often try to potray news stories from their own perspectives. It has been observed particular about media houses that they are biased towards specific topics, people and political parties. In this thesis, you will be analysing a set of news stories derived from different news websites (such as BBC, CNN etc). The study will be done with an intention to explore if the news channels are biased towards specific 1) Topics, 2) People or 2) Political parties etc. You will be using data science techniques (such as opinion mining, machine learning) for performing the empirical analysis of your study.

[1] Robert M. Entman. Media framing biases and political power: Explaining slant in news of Campaign 2008. Journalism. Vol 11, Issue 4, pp. 389 - 408

[2] David Niven. Bias in the News: Partisanship and Negativity in Media Coverage of Presidents George Bush and Bill Clinton. International Journal of Press and Politics. Vol 6, Issue 3, pp. 31 - 46

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Understanding Filter bubbles in social media networks

A filter bubble is an algorithmic bias that skews or limits the information an individual user sees on the internet. The bias is caused by the weighted algorithms that search engines, social media sites and marketers use to personalize user experience. The concept is particularly important in creating opinioated individuals. In this thesis, a study will be performed to understand the effect of filter bubbles on social media individuals.

[1] Mario Haim, Andreas Graefe & Hans-Bernd Brosius, Burst of the Filter Bubble? Effects of personalization on the diversity of Google News. Journal Digital Journalism.

[2] Nguyen, Tien T. and Hui, Pik-Mai and Harper, F. Maxwell and Terveen, Loren and Konstan, Joseph A. Exploring the Filter Bubble: The Effect of Using Recommender Systems on Content Diversity, WWW 2014

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Analysing echo chambers in social networks

An echo chamber is a metaphorical description of a situation in which information, ideas, or beliefs are amplified or reinforced by communication and repetition inside a defined system. In this thesis, we will investigate echo chambers in social media platforms such as Twitter or Facebook and their effect on social media users. Techniques like network science + machine learning will be explored for understanding echo chambers in social media.

[1] Eric Gilbert, Tony Bergstrom, and Karrie Karahalios. 2009. Blogs are echo chambers: Blogs are echo chambers. In 42nd Hawaii International Conference on System Sciences. IEEE, 1–10.

[2] Eric Lawrence, John Sides, and Henry Farrell. 2010. Self-segregation or deliberation? Blog readership, participation, and polarization in American politics. Perspectives on Politics 8, 1 (2010), 141–157

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Left, Center or Right?: Controversial groups on Social Media

With respect to political views, the users in social media can often be classified broadly in three categories namely left, right or central. In this thesis, the users in social media platforms, in particularly in platforms like Facebook, will be studied anonymously. The crux of the problem will be to predict users' inclination in terms of right, center and left political parties. Data science techniques such as network science, machine learning, sentiment analysis will be explored for predicting problem.

[1] Gottipati S., Qiu M., Yang L., Zhu F., Jiang J. (2013) Predicting User’s Political Party Using Ideological Stances. In: Jatowt A. et al. (eds) Social Informatics. SocInfo 2013. Lecture Notes in Computer Science, vol 8238. Springer,

[2] Michael D. Conover, Bruno Gonc¸alves, Jacob Ratkiewicz, Alessandro Flammini and Filippo Menczer. Predicting the Political Alignment of Twitter Users IEEE Third Inernational Conference on Social Computing (SocialCom), 2011.

[3] Aaron Acosta (ateam91), Silviana Ciurea-Ilcus (smci), Michal Wegrzynski (michalw). Predicting users’ political support from their Reddit comment history . goo.gl/Gchzyw

[4] Daniel Xiaodan Zhou, Paul Resnick, Qiaozhu Mei. Classifying the Political Leaning of News Articles and Users from User Votes. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media 2011

[5] Waheed H, Anjum M, Rehman M, Khawaja A (2017) Investigation of user behavior on social networking sites. PLoS ONE 12(2)

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Measuring corporate reputation through online social media

When businesses are caught out engaging in illegal or immoral activities, their reputation might suffer. Corporate reputation is a reflection of how a business is regarded by its customers and the public in general. If corporate misbehavior negatively affects a business’ reputation, customers might switch to rival businesses. For this reason, reputation has got a central role in free markets as it has the potential to deter businesses from misbehaving. The extent, to which corporate wrongdoings trigger a reputational loss is still debated and is subject to a large body of academic works. Most of these works are based on survey methods to measure reputation. This research relies on a more direct method to measure reputational changes, by conducting a sentiment analysis of how the public reacted on Twitter to some of the most high-profile corporate misconducts. In this particular work thesis, corporate reputation will be studied using the Volkswagen (VW) scandal as a case study and the public reaction it created on the Twitter. VW’s scandal has been chosen because it has been widely covered over time through both traditional and social media. Moreover we can measure how changes in media coverage and social media reaction affected VW’s financial performance. The dataset and related literature will be provided for speeding up the work.

[1] Corné Dijkmans, PeterKerkhof,Camiel J.Beukeboom, A stage to engage: Social media use and corporate reputation, 2014.

[2] Nadine Gatzert, The impact of corporate reputation and reputation damaging events on financial performance: Empirical evidence from the literature, 2015.

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Predicting information diffusion using regression techniques

Information diffusion on online social media platforms (such as Facebook, Twitter, etc) has its applications in various domains such as viral marketing, news propagation etc. Some information spreads faster compared to others depending on the topic of interest of the online users. For example, taking Twitter as a use case, researchers have investigated tweet prediction. In other words, given a tweet, predicting how many times that tweet will be retweeted. The problem has been analysed as classification as well as regression problem. You can review the literature which are related which have analysed the problem as regression (or classification).

[1] Hong, L., Dan, O., Davison, B.D.: Predicting popular messages in twitter. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW’11, pp. 57{58. ACM, New York, NY, USA (2011)

[2] K Lytvyniuk, R Sharma, A Jurek-Loughrey. Predicting Information Diffusion in Online Social Platforms: A Twitter Case Study. International Workshop on Complex Networks and their Applications, 405-417, 2017.

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

---------------------------

The impact of social media on consumers

By analyzing social media discussions about products we would like to analyze if there exists any correlation between the sales or economic success of a particular product with the products popularity pre- and post-announcement and sales. (Also, in case of positive correlation, finding the parameters or features that drive success). In particular, we are interested in analyzing popular products (mobile phones, computers or video games) by analyzing social platforms like Facebook, Twitter or blogs or web pages etc. This may involve sentiment analysis of the discussions about products, feature extractions of the product itself and using critics opinions, ratings etc. and looking at the sales numbers of the products in question.

[1] M. Nick Hajli. A study of the impact of social media on consumers. International Journal of Market Research Vol. 56 Issue 3, 2014

[2] M Odhiambo. Social media as a tool of marketing and creating brand awareness.

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

The predictive power of the wisdom of the crowd

The term “wisdom of the crowd” refers to the collective opinion of a community or group. In comparison, expert views refer to the views expressed by the experts of a particular domain. In this thesis, you will investigate if it is the experts or if it’s the wisdom of the crowd, that can predict the box office outcome of the movies. In particular, you will analyse tweets with respect to movies around the period of release date of movies. A small dataset of tweets of various movies will be provided. However, we also expect to expand our analysis by collecting tweets about more movies during the period of thesis. The thesis involves, sentiment analysis of the tweets and subsequently proposal of the prediction model about predicting box office result of the movies.

[1] Fabian Abel, Ernesto Diaz-Aviles, Nicola Henze, Daniel Krause, and Patrick Siehndel. Analyzing the blogosphere for predicting the success of music and movie products. In Advances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on, pages 276–280. IEEE, 2010.

[2] Sitaram Asur and Bernardo A Huberman. Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 1, pages 492–499. IEEE, 2010.

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Behavior analysis of bike users in a city settings

As a part of smart cities, authorities are creating separate lanes and bicycle rack for bikers. However, the key question is the utilization of these resources put in place for better traffic management. In this thesis, we will be analyzing real dataset of a Italian city with a population of 385.192 inhabitants. The dataset is taken over a period of 6 months, from April 2017 to September 2017. We will be predicting users's behavior in terms of using these resources. We expect you to use data science techniques and machine learning techniques. Dataset is property of SRM Reti e Mobilità Srl and all the analysis must respect the NDA preserving privacy and anonymization of the users.

[1] Gabriel Martins Dias, Boris Bellalta and Simon Oechsner . Predicting Occupancy Trends in Barcelona’s Bicycle Service Stations Using Open Data , SAI Intelligent Systems Conference 2015

[2] Eoin O’Mahony, David B. Shmoys. Data Analysis and Optimization for (Citi)Bike Sharing. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence 2015

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Tracking unusual activities in Traffic police data

Sensors and cameras often being places on highways are meant for disruption free traffic to detect accidents and to take appropriate and timely actions. However, the data being collected can also be used for detecting unusual activities. In this thesis, you will be analyzing large scale traffic police data shared by the Italian authorities to detect anomalies or unusual behavior on highways. A dataset will be provided. We expect you to find unusual behavior in this traffic data using machine learning techniques in particularly anomaly detection techniques.

[1] Liang Xiong, Xi Chen, Jeff Schneider. Direct Robust Matrix Factorization for Anomaly Detection. International Conference on Data Mining (ICDM), 2011.

[2] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, Rabab Ward. Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval. https://arxiv.org/pdf/1502.06922.pdf

[3] Jefferson Ryan Medel. Anomaly Detection Using Predictive Convolutional Long Short-Term Memory Units. http://scholarworks.rit.edu/theses/9319/

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Analyzing customers behavior using purchase data

The purchasing transactions being performed by customers can provide valuable insights about the behavior of individuals. For example, transactions can infer about the purchasing power (money) as well as eating habits of the users. In this thesis, you will analyse a large scale customer transactions, using data science techniques, for predicting customer's behavior.

[1] Diego Pennacchioli1, Michele Coscia, Salvatore Rinzivillo, Dino Pedreschi, Fosca Giannotti. Explaining the Product Range Effect in Purchase Data BigData 2013

[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” SIGMOD Rec., vol. 22, no. 2, pp. 207–216, Jun. 1993.

[3] https://bigml.com/user/czuriaga/gallery/dataset/info?reload

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Analyzing question-answering system: The Quora case study

Question answering systems (QASs) generate answers of questions asked in natural languages. Early QASs were developed for restricted domains and have limited capabilities. Currently platforms like Quora have been helpful in diminishing the boundaries. In this thesis, using Quora as a case study, you will perform users analytics to understand the reasons behind the success of a platform where "all kind of questions are welcome". We expect you to perform empirical study among the users using Quora as a case study in this thesis.

[1] Abdel ghani Bouzianea, Djelloul Bouchihaa, Noureddine Doumi, Mimoun Malki. Question Answering Systems: Survey and Trends.. Procedia Computer Science

[2] Albert Tung and Eric Xu, Determining Entailment of Questions in the Quora Dataset. Class Reports https://web.stanford.edu/class/cs224n/reports/2748301.pdf

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Vulnerability analysis in multilayer networks: A data science approach

Traditionally networks are studied individually. In other words, it means, studies do not consider the relation that might exists among any two or more networks connected with each other. For example, some of the social media users are present not only on Facebook but also on Twitter and other social media platforms such as Instagram etc. In another example, consider transportation networks, it is more prudent to analyze various transportation mediums such as road, rail and air collectively. A bird eye view of the collection of various networks is called a multilayer network (ML), encompassing of various individual networks. In such ML networks, we would like to study the problem of vulnerability analysis. That is how strong a network is against any kind of breakage. The breakage could occur due to natural or unnatural (for example, earthquake, accidents, riots). In this thesis, using transportation networks as a use case, study will be performed with particular focus on vulnerability analysis.

[1] Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, Mason A. Porter; Multilayer networks, Journal of Complex Networks, Volume 2, Issue 3, 1 September 2014, Pages 203–271

[2] Aleta, Alberto, Sandro Meloni, and Yamir Moreno. "A Multilayer perspective for the analysis of urban transportation systems." Scientific reports 7 (2017): 44359.

[3] Kivelä, Mikko, et al. "Multilayer networks." Journal of complex networks 2.3 (2014): 203-271.

[4] A Furno, NE El Faouzi, R Sharma, E Zimeo. Two-level Clustering Fast Betweenness Centrality Computation for Requirement-driven Approximation. IEEE Big Data 2017

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Analyzing Server Logs for predicting Job Failures

Server logs generally refer to files which are created for monitoring the activities being performed on servers. In recent years a lot of research has been performed in analyzing server logs for analyzing the status of the jobs or tasks that arrive on servers. In this thesis, you will be analyzing logs from Google cluster, which is a is a set of machines responsible for running real Google jobs for example, search queries. The research encompasses the domain of large scale data analytics and machine learning. The main contribution of the thesis includes proposing of model to predict the job failures on servers. A real dataset of Google traces will be provided along with related literature to ramp up the learning process.

[1] Chunhong Liu et al, 2017 Predicting of Job Failure in Compute Cloud Based on Online Extreme Learning Machine: A Comparative Study 2017

[2] Andrea Rosa et al, Predicting and Mitigating Jobs Failures in Big Data Cluster 2015

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Context Based Text Classification using Deep Learning

Rent seeking is a term used by economists to capture situations where businesses acquire extra wealth without doing anything productive (lobbying is almost synonymous). A textbook example is when taxi companies lobby governments around the world to limit the entry of Uber-like operators. Rent seeking is one of the most important ideas in economics, but the measurement of rent seeking has always been a challenge for economists because of its latent nature, and the empirical literature is very thin. One way to solve this problem is by using classical machine learning (SVM etc) for text-classification however, they do not return good results. As an alternate would be to use deep learning approaches such as word embedding (word2vec) approaches for classifying the text. That is if a document is about rent seeking or not. For this topic, you will investigate state of the art literature w.r.t capturing context of documents for better classification of documents.

[1] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek. "What is Relevant in a Text Document?": An Interpretable Machine Learning Approach. Link on arxiv: https://arxiv.org/abs/1612.07843

[2]T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean. Distributed representations of words and phrases and their compositionality. Neural information processing systems

Proposed by R. Sharma

email: rajesh.sharma@ut.ee

Analysis of developers’ interactions through social network data

Nowadays, software development is supported by different social media. Developers usually communicate and manage their projects by well-known social coding communities such as GitHub, Bitbucket, and JIRA. These repositories contain a wealth of information that is available to analyze and allow us to understand how developers interact and influence each other during software development. In this context, several studies have explored the developers’ interactions and their social behaviour patterns through the analysis of social networks. Your task is to provide an overview of the main findings as well as the methods and techniques that have been proposed to understand the interactions of software developers in the context of social networks.

[1] http://ieeexplore.ieee.org/document/7752419/

[2] https://arxiv.org/abs/1407.2535

[3] https://dl.acm.org/citation.cfm?id=2666571

Proposed by E. Scott and R. Sharma

email: ezequiel.scott@ut.ee

Methods and techniques to identify dependencies among User Stories

In Agile software development, requirements are usually expressed as User Stories. Although User Stories are expected to follow a fixed structure (“As <a role>, I want to <a feature> in order to <a benefit>”), they are still written by using natural language and informal descriptions. This can lead to bad quality User Stories that overlap each other in concept and cannot be schedulable and implementable in any order. In this context, some studies have explored different approaches to organize and identify dependent user stories. Your task is to provide an overview of the methods and techniques that have been proposed to deal with the dependencies among User Stories in agile contexts.

[1] https://link.springer.com/chapter/10.1007/978-3-319-06862-6_8

[2] http://ieeexplore.ieee.org /document/7549299/

[3] http://ieeexplore.ieee.org/document/7272550/

[4] https://link.springer.com/chapter/10.1007/978-3-642-13054-0_17

Proposed by E. Scott

email: ezequiel.scott@ut.ee

Team performance in agile software development

Having teams that perform well is critical for the success of a project, and well performance is often related to effectiveness. For this reason, one of the agile principles claims that teams should have room to reflect on how to become more effective and adjust their behaviour accordingly. However, there is still a need to provide a clear definition of team performance, team effectiveness, and what enables team performance. Your task is to review the current literature on definitions, models, and approaches related to team performance, and how they can be used to improve the effectiveness of agile software development teams.

[1] Lindsjørn, Y., Sjøberg, D. I., Dingsøyr, T., Bergersen, G. R., & Dybå, T. (2016). Teamwork quality and project success in software development: A survey of agile development teams. Journal of Systems and Software, 122, 274-286.

[2] Kozak, Yavuz (2013). Barriers Against Better Team Performance in Agile Software Projects. Chalmers University of Technology, Sweden.

[3] Downey, S., & Sutherland, J. (2013, January). Scrum metrics for hyperproductive teams: how they fly like fighter aircraft. In 2013 46th Hawaii International Conference on System Sciences (pp. 4870-4878). IEEE.

Proposed by E. Scott

email: ezequiel.scott@ut.ee

Issue/bug recommender systems

In agile software development, issue allocation is often based on self-assignment. That is, developers choose the issues (e.g. user stories, bug) that they will develop during the sprint according to their own preferences and experience. Industry practices give some evidence to support this method of issue allocation but how this takes place is not completely clear yet. As far we know, developers apply different strategies for self-assigning different types of issues (new feature, enhancement, bug fixation). Recommender systems have been developed to help developers to choose their issues/bugs. The goal of this project is to review the current literature on what approaches for recommending issues/bugs to developers have been proposed, what sources of information have been used, how those approaches have been evaluated, and what is the importance for agile software development.

[1] Kanwal, J., & Maqbool, O. (2010, December). Managing open bug repositories through bug report prioritization using SVMs. In Proceedings of the International Conference on Open-Source Systems and Technologies, Lahore, Pakistan (pp. 22-24).

[2] Anvik, J., Hiew, L., & Murphy, G. C. (2006, May). Who should fix this bug?. In Proceedings of the 28th international conference on Software engineering (pp. 361-370). ACM.

[3] Alenezi, M., Banitaan, S., & Zarour, M. (2018). Using Categorical Features in Mining Bug Tracking Systems to Assign Bug Reports. arXiv preprint arXiv:1804.07803.

Proposed by E. Scott

email: ezequiel.scott@ut.ee

Cases studies of fintech companies applying agile software development

Agile software development has been popular during the last decades. There many industrial sectors that rely on agile practices, ranging from e-commerce to automotive and healthcare. The financial services and fintech industry is not an exception to this; in fact, it is considered the second most relevant industrial sector that uses agile methods [1]. However, there is still a lack of understanding to which extend this sector have tailored their processes to support their particularities. Your task is to summarize the case studies of fintech companies and financial services that have applied agile practices and to determine the particularities that impact on the software development process.

[1] "The 12th annual State of Agile Development survey”, Version One, 2018, [online] Available: https://explore.versionone.com/state-of-agile/versionone-12th-annual-state-of-agile-report.

[2] Gechevski, D., Poposka, K., Angelova, B., & Gecevska, V. (2014). AGILE SOFTWARE DEVELOPMENT PRODUCTS FOR FINTECH-FINANCIAL TECHNOLOGIES. In Proceedings of the 8th International Conference on Mass Customization and Personalization– (Vol. 23, No. 6, p. 107). UNIVERSITY OF NOVI SAD–FACULTY OF TECHNICAL SCIENCES DEPARTMENT OF INDUSTRIAL ENGINEERING AND MANAGEMENT 21000 Novi Sad, Trg Dositeja Obradovića 6, Serbia.

[3] Dapp, T. F. (2017). Fintech: the digital transformation in the financial sector. In Sustainability in a Digital World (pp. 189-199). Springer, Cham.

[4] Kilu, E., Milani, F., Scott, E., & Pfahl, D. (2019, January). Agile Software Process Improvement by Learning from Financial and Fintech Companies: LHV Bank Case Study. In International Conference on Software Quality (pp. 57-69). Springer, Cham.

Proposed by E. Scott

email: ezequiel.scott@ut.ee

Lean Software Development metrics

Lean Software Development is an approach where traditional Lean manufacturing philosophies, principles, and tools are applied to software development. In recent years, Lean Software Development has gained popularity and many companies have used this approach to optimize efficiency and minimize waste in the development of their software. In this context, the use of meaningful software metrics is critical. This project aims to summarize and review the current metrics related to Lean Software Development and that are relevant to the industry as well as to describe how they should be used in a given context.

[1] Alahyari, H., Gorschek, T., & Svensson, R. B. (2019). An exploratory study of waste in software development organizations using agile or lean approaches: A multiple case study at 14 organizations. Information and Software Technology, 105, 78-94.

[2] Feyh, M., & Petersen, K. (2013). Lean software development measures and indicators-a systematic mapping study. In Lean Enterprise Software and Systems (pp. 32-47). Springer, Berlin, Heidelberg.

[3] Kupiainen, E., Mäntylä, M. V., & Itkonen, J. (2015). Using metrics in Agile and Lean Software Development–A systematic literature review of industrial studies. Information and Software Technology, 62, 143-163.

[4] Staron, M., Meding, W., & Palm, K. (2012, May). Release readiness indicator for mature agile and lean software development projects. In International Conference on Agile Software Development (pp. 93-107). Springer, Berlin, Heidelberg.

Proposed by E. Scott and F. Milani

email: ezequiel.scott@ut.ee

A Survey of Security Risks in the Permissionless Blockchain Applications

The goal of this survey is to explain security risks that can be mitigated by the permissionless blockchain applications, such as Ethereum. In addition once the insfrastructure of the system is moved to the blockchain-supported platforms, potentially new security risks can be introduced. Thus the second goal is to identify and explain the most frequent security risks, that can be found in such a type of technology.

Proposed by R. Matulevicius