Already chosen topics:
Karl Potisepp - MapReduce efficiency
John Okugbeni - Comparison of CouchDB and MongoDB (or Riak)
Ilgün Ilgün - Automated Data Migration Tool from SQL to MongoDB
Mobile Cloud -- Huber Flores (Responsible person)
This track examines the following issues in Mobile Cloud Computing (MCC). Implementations will target Galaxy S2 i9100.
- Mobile Code Offloading with Annotations:The aim of this project is to create an application (e.g. Eclipse plugin) that may allow to the recognize annotations within the code. Annotations are recognized at runtime so that later they can be offloaded from the mobile (e.g. Android) to a server running on the cloud (Android x86). Java reflexion API may be used.
- Mobile Cloud Service Composition:The aim of this project is to create/improve a composition tool for Cloud Services using Mobile Cloud Middleware. Basically, the tool aims to execute a multi-cloud operation (workflow) by performing one-time-offloading from the smartphone.
- VM Synchronization with a Mobile (Android - Android_x86): The project consists of building CyanogenMod 9 from source Here a guide for CM7 and later to synchronize the Dalvik machine with the x_86 Dalvik so that code from the mobile device can be executed within the x_86 server x_86 Dalvik general implementation idea
- Fuzzy Logic Cloud Engine for Arduino:The idea of this project consists of implementing a Fuzzy Logic algorithm in the cloud which can be use for deciding when to transport sensor data from Arduino to Amazon S3. The aim of this project is to save energy of the micro-controller source-code base
- Simulation of Mobile Cloud Traffic: The idea of this project is to configure Android_x86 instances at the cloud (parallelization) and later simulate code offloading using benchmarking tools, Here a guide for Tsung, Here a guide for HAProxy. .
- A Comparative Analysis between Static Analysis and Heuristic Algorithms for Cloud Offload: The project consists of implementing both strategies in order to analyse Android applications. Basically, the aim of the analysis is to determine which portions of code (e.g. methods, classes) can be offloaded to the cloud.
Mobile Web Services -- Satish Srirama (Responsible person)
- Merging Mobile Cloud Middleware with Mobile Web Services Mediation Framework (MWSMF): MWSMF is an Enterprise Service Bus based middleware between Mobile Host and the mobile client. Join MCM as a service engine component to the MWSMF and demonstrate it in a scenario.
SciCloud -- Pelle Jakovits (Responsible person)
- CloudSwitch - It provides migration and deployment of virtual machines to the cloud and allows to integrate the migrated servers with existing system running locally or in other clouds. Additionally, CloudSwitch provides secure channels and firewall virtual machines that can be deployed to secure the communication and data between local cluster and virtual machines running in different public cloud platforms. Do a literature survey into papers written about the CW and its features, compare CloudWatch to other similar tools and related work that try to support similar cloud migration and deployment process.
- High Performance Computing support in cloud in different public cloud providers and their comparison. Amazon EC2 provides High Performance Computing type instances for scientists, research groups and companies who require high quality dedicated servers for resource demanding computing tasks and experiments. What other public cloud providers have similar services, what are the options for scientists to choose from and how do these options compare to each other (pricing, availability, size of available resources, performance, etc.) Start from http://aws.amazon.com/hpc-applications/ and do a literature review into scientific papers on the topic.
- HaLoop – Alternative MapReduce framework for large scale distributed cloud computing. Study the framework, it's documentation and scientific articles.. How does it differ from Hadoop, what are its disadvantages and advantages? The goal of the student is to give a very good overview of the framework and describe in detail how iterative algorithms can be adapted to it. Student should implement 1-2 iterative algorithms in Giraph.
- Spark – Alternative MapReduce framework for large scale distributed cloud computing. Study the framework, it's documentation and scientific articles.. How does it differ from Hadoop, what are its disadvantages and advantages?
- JPREGEL - A Java BSP based large scale distributed graph computing framework. It is based on the Google's proprietary Pregel framework.
- Apache Giraph – Apache Giraph is another graph processing frame-work based on Pregel, which aims to leverage existing Hadoop infrastructure to run Hadoop MapReduce like jobs that use the Bulk Synchronous Parallel model instead. Meaning that in contrast to Hadoop MapReduce, concurrent tasks are allowed to communicate between each other and the computation is divided into a number of supersteps. Giraph also aims to provide fault tolerance. The goal of the student is to give a very good overview of the framework and describe in detail how graph algorithms can be adapted to it. Student should implement 1-2 graph processing algorithms in Giraph.
- Stanford GPS is yet another Pregel-inspired implementation of BSP, developed in Stanford University. It is open source and sports several features not present in either Google Pregel or Apache Giraph, namely the support for algorithms that include global as well as vertex-centric computations (whereas the Pregel API focuses only on vertex-centric computations), the ability to repartition the graph during processing and partitioning adjacency lists of high-degree vertices to reduce the amount of communication. The goal of the student is to give a very good overview of the framework and describe in detail how graph algorithms can be adapted to it.
Hadoop cloud computing projects -- Pelle Jakovits (Responsible person)
- Hadoop on Amazon EC2 spot instances. Literature survey into existing studies that utilize a portion (30% - 50%) of (amazon) resources from temporary spot instances. What are the disadvantages and advantages of using spot instances, is it possible to reduce the costs if using them and how high per cent of spot instances can be utilized without overly compromising fault tolerance.
- Hive – Hive is Data warehouse infrastructure built on top of Hadoop, provides tools for summarization, ad-hoc querying and analysis of large datasets. What exactly are the differences between Hive and another data query language based on Hadoop called Pig.
- Mountable HDFS – Replicated mountable Hadoop distributed file system. Study its usability for replicated network storage, performance latencies and compare it to other similar solutions. What are the main advantages and disadvantages of the HDFS based solution.
- Graph computing on the Cloud - Overview of the different distributed cloud frameworks for graph processing. Additionally can implement one of the Graph computing frameworks on the SciCloud and demonstrate it.
- Cloudera Impala is based on Google Dremel and aims to provide Real-Time queries ontop of Apache Hadoop. Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. Student should study how Impala can be used to speed up MapReduce or Hive computations by replacing some of the computing tasks with Impala queries instead.
Large scale image processing -- Pelle Jakovits, Karl Potisepp (Responsible persons)
- Image Processing Using Parallel Technologies - provide an overview of the current state of art in image processing with different parallel computing technologies (Cloud, GRID, etc.). What are the major issues? What sort of images/graphics are better suited for parallel processing? What sort of algorithms/processes are more easily parallelised?
- Analysis of Web-Scale Image Data - provide an overview of the current state of art in processing web-scale image/video datasets (terabyte-, petabyte-sized). What are the major issues? What sort of problems can currently be solved with regard to this sort of data? Provide at least one reasonably detailed overview of how someone (i.e. Google) has handled this.
- Analysis of Large Image Data - provide an overview of the current state of art in processing large images (measuring several gigapixels and up). What are the major issues? What sort of problems can currently be solved with regard to this sort of data? Provide at least one reasonably detailed overview of how someone (i.e. NASA) has handled this.
Network analysis on the Cloud -- Briti Deb (Responsible person)
- Information retrieval using map reduce: Mining text corpus to find interesting patterns.
- Data processing using parallel R: Studies in using parallel R to analyze large datasets.
Data storage, text mining and web technologies -- Jürmo Mehine (Responsible Person)
- Writing an automated tool for data migration from a relational database to a non-relational one. In recent years the NoSQL movement is picking up speed and many companies are switching to non-relational databases. We want a tool that can take an existing SQL database and automatically transfer all the data from it to a NoSQL system. An incomplete list of NoSQL systems to try: Cassandra, Redis, CouchDB, MongoDB, HBase.
- Comparison of MapReduce implementations in different data stores. Several non-relational database engines use the MapReduce programming model for data processing, analysis and querying. While the high-level concept of MapReduce is the same, we are interested in understanding how data stores differ in their implementation and application of MapReduce. Some databases to consider for comparison: MongoDB, CouchDB, HBase, Riak.
- Introduction to Jaql. Jaql is a query language for JSON data. We are interested in finding out more about Jaql, because many modern NoSQL data stores store documents in JSON format and Jaql uses MapReduce for parallelism. Jaql could be a potential unifying query language for different document stores.
- Solving a data mining problem using Pig or Hive, measuring performance. Pig and Hive are tools built on top of the Apache Hadoop MapReduce framework. Pig is meant mainly for data processing and preparation, while Hive is a tool for data warehousing, querying and presentation. We are interested in seeing some text mining algorithms implemented using one or both of these tools. We want to determine the feasibility of these tools for different data analysis tasks and also to measure the performance on large data in a distributed deployment. We are also interested in measuring productivity gains when using these tools compared to using plain Hadoop MapReduce.
- Amazon public data set analysis. Find an interesting data set from AWS public datasets http://aws.amazon.com/datasets and apply text mining techniques to find interesting information (e.g sentiment analysis). One possible example is the Common Crawl corpus. Common Crawl maintains an open repository of web crawl data. You can use distributed data processing tools such as MapReduce, Pig, Hive, Mahout.
Migrating enterprise applications to the cloud -- Martti Vasar (Responsible person)
- Refactoring enterprise applications for the cloud
Generally communication latencies are the major problem in migrating enterprise applications to the cloud. The topic should study how you can remodel the deployment of the n-tier enterprise applications, so that the communication loads are minimized. For example: the topic can zero-in on Enterprise Service Bus based systems.
- Framework for monitoring performance of cloud based applications
Studying the frameworks, that let you know the health of an enterprise setup deployed on the cloud. The topic can also go to a level to monitor the health of the cloud itself.