Institute of Computer Science
  1. Courses
  2. 2022/23 fall
  3. Data Engineering (LTAT.02.007)
ET
Log in

Data Engineering 2022/23 fall

  • HomePage
  • Grading

Course Objective

The course aims at giving an overview of Data Engineering foundational concepts. It is tailored for 1st and 2nd year Msc students and PhDs who would like to strengthen their fundamental understanding of Data Engineering, i.e., Data Modelling, Collection, and Wrangling.

IMPORTANT

  • Lecturers will be in class unless announced otherwise, but they will be recorded.
  • Practices will mostly be in class unless announced otherwise, but they will be recorded.
  • Lectures are in Room 1008
  • It is not always lecture/practice sequences every week. That is, we might have two lectures in the same week; in some other weeks we can have two practice sessions in a row. Have a look at the tentative syllabus below. Any changes will be announced at least one session ahead and on the course Moodle page.
  • Slots are
 * Mon. 16.15 - 18.00 weeks 2-16 
 * Thu. 10.15 - 12.00 weeks 2-16 
  • Link for practice session (log into courses to see link) , will be used in case of online sessions.
  • Material for classes and practices will be listed on Moodle.

Prerequisites

Familiarity with the following concepts is strongly recommended to succeed in the course:

  • Algorithm and Data Structures
    • Graphs, Trees, Tables, Lists
  • Programming Languages
    • Java and Python
  • (Relational) Databases and Query Languages
    • SQL, JsonPath, and openCypher.
    • Joins, Aggregations, Table definition, and manipulation (Create, Update, Insert, Alter)

Syllabus

  • Introduction Lecture
    • What is (Big) Data?
    • The Role of Data Engineer
    • From Data Warehouse to Data Lakes
  • Introduction Practice.
    • Docker
    • Jupyter Notebooks

Part 1: Data Lifecycle

This part will be covered in ~four weeks.

  • Lectures
    • Data lifecycle, ETL/ELT, Data processing pipelines (Airflow)
    • Data Ingestion
    • Data Pre-processing
    • Data cleansing
  • Practice
    • Data cleansing

Part 2: Data Modelling and Query Languages

This part will be covered in ~seven weeks.

  • Lectures
    • Relational Data
    • Data Warehousing
      • Star and Snowflake schemas
    • NoSQL
      • Key-Value Stores
      • Document
      • Graph
  • Practice
    • Modelling and Querying Document data: MongoDB
    • Modelling and Querying Graph data: Cypher

Part 3: Scalable data processing

This part will be covered in ~four weeks

  • Lecture
    • Parallel processing with Hadoop MapReduce
    • Big Data Schema-on-read: Apache Hive
    • High-performance data analytics
  • Practice
    • Hadoop MapReduce
    • Hive
    • Singularity containers

Contacts

  • Lecturers:
    • Ahmed Awad - ahmed.awad@ut.ee
    • (Guest) Pelle Jakovits - pelle.jakovits@ut.ee
    • Feras Awaysheh - feras.awaysheh@ut.ee
  • Teaching Assistants
    • Kristo Raun
    • Mohamed Ragab

Recommended Books

  • Database System Concepts
  • Designing Data-Intensive Applications - Martin Kleppmann
  • The Data Warehouse Toolkit
  • Learning Neo4J
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment