Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Big Data Management (LTAT.02.003)
ET
Log in

Big Data Management 2025/26 spring

  • Pealeht

General information

The course covers modern large-scale data management and processing solutions, with an emphasis on architectural choices and practical implementation trade-offs. The course follows a flipped classroom approach. Theory is provided via short pre-recorded materials and readings. On-site sessions are mandatory and focus on group presentations, short written exams (every second seminar), and discussion and review.

Schedule: Thursdays 14:00–18:00 (8 seminars, see table below).

Primary tools and technologies include Apache Spark (batch and Structured Streaming), Kafka, Apache Iceberg, and Airflow. Selected topics also touch MongoDB and Neo4j, and operational aspects such as reliability, governance, and cost and performance optimisation.

Recommended prior knowledge: SQL fundamentals, basic Python, Git basics, and ability to work in a Linux terminal. Prior exposure to data warehousing or data engineering concepts is helpful.

Objectives

The objective of this course is to develop students’ ability to design, build and evaluate modern large-scale data management and processing solutions. The course covers data processing models (batch, micro-batch and streaming), lakehouse architectures and ACID table formats (e.g. Apache Iceberg), change-data-capture and messaging systems (Kafka), workflow orchestration (Airflow), management of semi-structured and unstructured data (e.g. MongoDB, Neo4j), as well as principles of reliability, and cost/performance optimisation.

Learning outcomes

On successful completion of this course, students should be able to:

  • Analyse and justify the choice of data processing and storage architectures (e.g. batch vs streaming, lakehouse and related patterns) for different data-intensive use cases.
  • Design and integrate storage and processing solutions for semi-structured and unstructured data (e.g. document and graph databases) into a coherent data platform.
  • Evaluate and improve the reliability, data quality, governance and cost/performance characteristics of big data systems by applying appropriate metrics, data contracts and analysis techniques.
  • Critically reflect on data-engineering design decisions, communicate the related trade-offs to technical and non-technical stakeholders, and document solutions according to good engineering practice.

Brief description of content

Modern large-scale data management and processing architectures: batch, micro-batch and streaming, the lakehouse concept.

Practical tools: Spark, Kafka, Apache Iceberg, Airflow for storing and processing data.

Change-data-capture, data streams and workflow orchestration: moving data from raw sources to analytics- and ML-ready datasets.

Management of semi-structured and unstructured data (e.g. MongoDB, Neo4j) and simple AI-oriented data pipelines.

Principles of data quality, reliability, and cost/performance optimisation in big data platforms.

The course follows a flipped (reverse) classroom approach: theoretical material is provided via pre-recorded video lectures and readings, while on-site classroom sessions (8 × 4 hours) are mandatory and focus on short tests/exams, review, and discussion.

Learning is both individual and group-based. The course follows blended learning (recorded theory + in-person sessions) principles.

How the seminars run (4 hours)

Each seminar follows a consistent rhythm:

  • Group presentations and Q&A (about 60 minutes, typically 3 groups × 20 minutes including Q&A).
  • Short written exam every second seminar (seminars 2, 4, 6, 8), 30 minutes, pen-and-paper. If there is no exam, the time is used for guided discussion, solution review, and project troubleshooting.
  • Discussion and review of the week’s topics, common pitfalls, and design trade-offs.
  • Wrap-up and course feedback prompt at the end of the session.

There are no guided step-by-step labs during class time. Practical work is completed independently based on provided instructions (videos + GitHub READMEs). The classroom time is used for presentation, review, and discussion.

Assessment methods and criteria

The final grade is a weighted sum of the components below. All components must be completed.

  • Mini-tests (20%)
    • Short Moodle quizzes on the pre-recorded lecture material and required readings.
    • Typically completed before each seminar.
    • Points are counted only if the student also attends the corresponding seminar.
  • Individual exams (40%)
    • Pen-and-paper exams during seminars 2, 4, 6, and 8.
    • Duration 30 minutes.
    • Each exam covers the previous two seminar topics (one course module).
  • Topic presentations in groups (10% total: 7% presentation + 3% peer review)
    • Students form groups (typically 4 students).
    • Each group presents one topic during one seminar.
    • Each group also performs one peer review using a provided rubric (structure, correctness, clarity, trade-offs, and discussion).
  • Project tasks in groups (30%)
    • Students form groups (typically 4 students) and submit technical work.
    • Work is submitted via GitHub repositories according to the provided templates and deadlines.
    • The project work is evaluated based on correctness, completeness, documentation quality, and demonstrated understanding of trade-offs.

Topics and schedule

DateWeekTypeTopicSubtitle and description
19.02.202625seminar 1Big Data FoundationsCourse overview. Data platform building blocks. Environment setup (Docker Compose and course infrastructure). Processing models overview (batch, micro-batch, streaming).
05.03.202627seminar 2Batch Processing, SparkDistributed batch processing concepts. Spark fundamentals, DataFrames, joins, partitioning. Includes individual exam (30 min).
19.03.202629seminar 3Streaming FundamentalsStreaming concepts and trade-offs. Event time vs processing time. Intro to Structured Streaming concepts. Kafka basics as a streaming backbone.
02.04.202631seminar 4Lakehouse DesignLakehouse architecture and medallion patterns. Iceberg fundamentals (tables, snapshots, schema evolution, partitioning). Includes individual exam (30 min).
16.04.202633seminar 5Change Data CaptureCDC concepts and use cases. Kafka-based ingestion patterns. Applying CDC to lakehouse tables and incremental processing.
30.04.202635seminar 6OrchestrationWorkflow orchestration with Airflow. DAG design, scheduling, retries, SLAs. Connecting ingestion, transformation, and validation steps. Includes individual exam (30 min).
14.05.202637seminar 7Unstructured Data & AIHandling semi-structured and unstructured data. Document and graph storage (MongoDB, Neo4j) and integration patterns. Simple AI-oriented data pipeline patterns.
28.05.202639seminar 8Reliability, OptimizationsReliability and governance (metrics, data contracts, data quality checks). Cost and performance optimisation (partitioning, clustering, compute sizing). Includes individual exam (30 min).

Reading list

  • Martin Kleppmann. Designing Data-Intensive Applications.
  • Tyler Akidau et al. Streaming Systems.
  • Additional articles, documentation links, and short recorded materials are provided in Moodle for each seminar.
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment