General information

The course covers modern large-scale data management and processing solutions, with an emphasis on architectural choices and practical implementation trade-offs. The course follows a flipped classroom approach. Theory is provided via short pre-recorded materials and readings. On-site sessions are mandatory and focus on group presentations, short written exams (every second seminar), and discussion and review.

Schedule: Thursdays 14:00–18:00 (8 seminars, see table below).

Primary tools and technologies include Apache Spark (batch and Structured Streaming), Kafka, Apache Iceberg, and Airflow. Selected topics also touch MongoDB and Neo4j, and operational aspects such as reliability, governance, and cost and performance optimisation.

Recommended prior knowledge: SQL fundamentals, basic Python, Git basics, and ability to work in a Linux terminal. Prior exposure to data warehousing or data engineering concepts is helpful.

Objectives

The objective of this course is to develop students’ ability to design, build and evaluate modern large-scale data management and processing solutions. The course covers data processing models (batch, micro-batch and streaming), lakehouse architectures and ACID table formats (e.g. Apache Iceberg), change-data-capture and messaging systems (Kafka), workflow orchestration (Airflow), management of semi-structured and unstructured data (e.g. MongoDB, Neo4j), as well as principles of reliability, and cost/performance optimisation.

Learning outcomes

On successful completion of this course, students should be able to:

Analyse and justify the choice of data processing and storage architectures (e.g. batch vs streaming, lakehouse and related patterns) for different data-intensive use cases.
Design and integrate storage and processing solutions for semi-structured and unstructured data (e.g. document and graph databases) into a coherent data platform.
Evaluate and improve the reliability, data quality, governance and cost/performance characteristics of big data systems by applying appropriate metrics, data contracts and analysis techniques.
Critically reflect on data-engineering design decisions, communicate the related trade-offs to technical and non-technical stakeholders, and document solutions according to good engineering practice.

Brief description of content

Modern large-scale data management and processing architectures: batch, micro-batch and streaming, the lakehouse concept.

Practical tools: Spark, Kafka, Apache Iceberg, Airflow for storing and processing data.

Change-data-capture, data streams and workflow orchestration: moving data from raw sources to analytics- and ML-ready datasets.

Management of semi-structured and unstructured data (e.g. MongoDB, Neo4j) and simple AI-oriented data pipelines.

Principles of data quality, reliability, and cost/performance optimisation in big data platforms.

The course follows a flipped (reverse) classroom approach: theoretical material is provided via pre-recorded video lectures and readings, while on-site classroom sessions (8 × 4 hours) are mandatory and focus on short tests/exams, review, and discussion.

Learning is both individual and group-based. The course follows blended learning (recorded theory + in-person sessions) principles.

How the seminars run (4 hours)

Each seminar follows a consistent rhythm:

Group presentations and Q&A (about 60 minutes, typically 3 groups × 20 minutes including Q&A).
Short written exam every second seminar (seminars 2, 4, 6, 8), 30 minutes, pen-and-paper. If there is no exam, the time is used for guided discussion, solution review, and project troubleshooting.
Discussion and review of the week’s topics, common pitfalls, and design trade-offs.
Wrap-up and course feedback prompt at the end of the session.

There are no guided step-by-step labs during class time. Practical work is completed independently based on provided instructions (videos + GitHub READMEs). The classroom time is used for presentation, review, and discussion.

Assessment methods and criteria

The final grade is a weighted sum of the components below. All components must be completed.

Mini-tests (20%)
- Short Moodle quizzes on the pre-recorded lecture material and required readings.
- Typically completed before each seminar.
- Points are counted only if the student also attends the corresponding seminar.
Individual exams (40%)
- Pen-and-paper exams during seminars 2, 4, 6, and 8.
- Duration 30 minutes.
- Each exam covers the previous two seminar topics (one course module).
Topic presentations in groups (10% total: 7% presentation + 3% peer review)
- Students form groups (typically 4 students).
- Each group presents one topic during one seminar.
- Each group also performs one peer review using a provided rubric (structure, correctness, clarity, trade-offs, and discussion).
Project tasks in groups (30%)
- Students form groups (typically 4 students) and submit technical work.
- Work is submitted via GitHub repositories according to the provided templates and deadlines.
- The project work is evaluated based on correctness, completeness, documentation quality, and demonstrated understanding of trade-offs.

Topics and schedule

Date	Week	Type	Topic	Subtitle and description
19.02.2026	25	seminar 1	Big Data Foundations	Course overview. Data platform building blocks. Environment setup (Docker Compose and course infrastructure). Processing models overview (batch, micro-batch, streaming).
05.03.2026	27	seminar 2	Batch Processing, Spark	Distributed batch processing concepts. Spark fundamentals, DataFrames, joins, partitioning. Includes individual exam (30 min).
19.03.2026	29	seminar 3	Streaming Fundamentals	Streaming concepts and trade-offs. Event time vs processing time. Intro to Structured Streaming concepts. Kafka basics as a streaming backbone.
02.04.2026	31	seminar 4	Lakehouse Design	Lakehouse architecture and medallion patterns. Iceberg fundamentals (tables, snapshots, schema evolution, partitioning). Includes individual exam (30 min).
16.04.2026	33	seminar 5	Change Data Capture	CDC concepts and use cases. Kafka-based ingestion patterns. Applying CDC to lakehouse tables and incremental processing.
30.04.2026	35	seminar 6	Orchestration	Workflow orchestration with Airflow. DAG design, scheduling, retries, SLAs. Connecting ingestion, transformation, and validation steps. Includes individual exam (30 min).
14.05.2026	37	seminar 7	Unstructured Data & AI	Handling semi-structured and unstructured data. Document and graph storage (MongoDB, Neo4j) and integration patterns. Simple AI-oriented data pipeline patterns.
28.05.2026	39	seminar 8	Reliability, Optimizations	Reliability and governance (metrics, data contracts, data quality checks). Cost and performance optimisation (partitioning, clustering, compute sizing). Includes individual exam (30 min).

Reading list

Martin Kleppmann. Designing Data-Intensive Applications.
Tyler Akidau et al. Streaming Systems.
Additional articles, documentation links, and short recorded materials are provided in Moodle for each seminar.

Suurandmete haldus 2025/26 kevad