Course Objective
The course aims at giving an overview of Data Engineering foundational concepts. It is tailored for 1st and 2nd year Msc students and PhDs who would like to strengthen their fundamental understanding of Data Engineering, i.e., Data Modelling, Collection, and Wrangling.
IMPORTANT
- Lecturers will be in class unless announced otherwise, but they will be recorded (NOT Streamed).
- Practices will be fully online and recorded.
* Mon. 12.15 - 14.00 weeks 2-16 Invite Link (log into courses to see link) * Tue. 10.15 - 12.00 weeks 2-16 Invite link (log into courses to see link)
- Material for classes and practices will be listed on GitHub https://github.com/DataSystemsGroupUT/dataeng
Prerequisites
Familiarity with the following concepts is strongly recommended to succeed in the course:
- Algorithm and Data Structures
- Graphs, Trees, Tables, Lists
- Programming Languages
- Java and Python
- (Relational) Databases and Query Languages
- SQL, JsonPath, and openCypher.
- Joins, Aggregations, Table definition, and manipulation (Create, Update, Insert, Alter)
Syllabus (Tentative)
Note: The syllabus might be subject to change and will be adjusted during August/September
- Introduction Lecture
- What is (Big) Data?
- The Role of Data Engineer
- From Data Warehouse to Data Lakes
- Introduction Practice.
- Docker
- Jupyter Notebooks
Part 1: Data Modelling and Query Languages
- Lecture
- Relational Data
- NoSQL
- Key-Value Stores
- Document
- Graph
- Data Warehousing
- Star and Snowflake schemas
- Practice
- Modelling and Querying Relational data: MySQL
- Modelling and Querying Key-Value data: Redis
- Modelling and Querying Document data: MongoDB
- Modelling and Querying Graph data: Cypher
- Extras
- Modelling and Querying RDF data: SPARQL
- Domain-Driven Design: a summary
- Event Sourcing: a summary
Part 2: (Big) Data Pipelines
- Lecture
- Big Data Systems Architectures
- ETL and Data Pipelines
- Best Practices and Anti-Patterns
- Batch vs Streaming Processing
- Data Replication
- Data Partitioning
- Transactions
- Practice
- Data Ingestion with Apache Kafka
- Data Pipelines with Apache Airflow
- Data Processing with Kafka Streams/KSQL
- Extras
- Data Pipelines with Luigi
- Data Pipelines with Apache Nifi
- Data Processing with Apache Flink
Part 3: Data Wrangling
- Lecture
- Data Cleansing
- Augmentation
- Practice
- Cleansing examples using Python
- Augmentation examples using Pandas and Tensorflow
Video Lectures
Slides
Contacts
- Lecturer:
- Riccardo Tommasini - riccardo.tommasini@ut.ee
- Teaching Assistants
- Mohamed Ragab
- Hassan Eldeeb
- Fabiano Spiga
Recommended Books