Objectives
The objective of this course is to introduce students to the principles and methods of advanced data management and processing. The course will cover the techniques of storing and processing differrent types of data (structured, semi-structured and unstructured). It will cover the state-of-the-art in differet types of big data processing systems (e.g., stream processing, graph data processing, schalable machine and deep learning systems; Federated Learning).
Learning outcomes
On successful completion of this course, students should be able to: 1. Demonstrate knowledge of the emerging scalable data storage and processing requirement in different application domains.
2. Understand the various capabilities of advanced data management solutions and develop the ability of choosing the adequate systems for the different problems.
3. Apply the state-of-the-art advanced data processing systems on building scalable soultions for various data-intensive challenges in different application domains.
4. Apply qualitative and quantitative techniques to analyse and compare the performance of advanced data processing systems in order to identify the strengths and weakness of the various systems.
5. Demonstrate the ability of building complex data processing pipelines that integrate different systems for dealing with different data types.
Brief description of content
1. Principles of Big Data Phonmena: What is Big Data? What are the main characteristics of Big Data? What are the main sources and application domains for Big Data?
2. Batach Processing Systems for Big Data: Hadoop, Spark, Flinl. Distributed Databses and Parallel Databases.
3. Big Stream Processing Systems: Challenges, requirements and systems for processing massive amounts of streaming data in real-time.
4. Big Graph Processing: What are the main challenges of distributed graph processing? What are the adeqaute programming models/ data storage techniques for distibuted graph procesisng?
5. Big Data Analytics: Techniques for designing efficient and accurate ML/DL models in a distributed environment that can deal with the increasing amounts of available digital data.
6. Understand and design a privacy-preserving distributed ML, aka, Federated Learning system to meet the increased requirements of large-scale data analytics.
7. Case studies and projects in big data processing.
Info
Lecture and practice slots are physical in class rooms with recording
- Lectures on Thursday 16:15 – 14:00 and
- Practices on Fridays 14:15 – 16:00
Syllabus
- Introduction to Big Data
- Deployment Models
- Taming Data Volume with Apache Spark
- Taming Data Velocity with Spark Streaming
- Taming Data Variety with GraphFrames
- Gaining Value with Big Data Analytics and Federated Learning
Detailed Syllabus can be found here: https://tartuulikool-my.sharepoint.com/:b:/g/personal/ahmed79_ut_ee/EXLQOC3CfzxPmfxHWwX6cocBt3EMS-n9kPEVY6HqI0wWPg?e=knm9ll
Grading
- 60% on projects: Two types of projects, and you choose
- Applied projects: These are four mini projects (15% each, 10% the deliverables, 5% presentation)
- Research projects: These are limited in number for individual work and can be extended to a thesis topic. (20% on initial prototype, 20% on experiments, and 20% on the project report)
- 40% on three MCQs (there is a bonus grade (we take the best two))
NOTE!!: The lecturers reserve the right to call for an individual interview that can impact the final grade.
Textbooks:
- Big Data: Principles and Best Practices of Scalable Real-Time Data Systems by Nathan Marz And James Warren 2015.
- Big Data for Beginners: Understanding SMART Big Data, Data Mining & Data Analytics for Improved Business Performance, Life Decisions & More! By Vince Reynolds 2016. * A. Rajaraman, J. Leskovec, and J. D. Ullman – Mining of Massive Datasets, 1st Edition, 2011.
Reference Books:
- Dirk deRoos, Paul C. Zikopoulos, Roman B. Melynk, Bruce Brown, Rafael Coss: Hadoop for Dummies Applications (1st Edition) 2014.
- Big Data and Analytics by Seema Acharya and Subhashini Chellappan 2015.
Course Leader
Feras M. Awaysheh
Edge Intelligence and Data Analytics Research Group https://eida.cs.ut.ee/