Distributed Big Data Processing

Course Description

The primary goal of this course is to teach students to independently perform a distributed processing of the Big Data, using state-of-the-art open-source technologies such as Apache Hadoop, Apache Lucene, Apache Mahout and Apache Spark. In the first part, the course covers the map-reduce programming model and its different design patterns, as well as different means of distributed Big Data storage. After that, the previously learned concepts are applied on the recommender systems, real-time data stream processing, efficient search in large textual collections and link and social network analysis.

Learning Outcomes

  1. identify big data characteristics
  2. compare distributed algorithms for big data processing
  3. develop simple algorithms for distributed big data processing
  4. apply open source technologies for distributed big data processing and storage
  5. develop a distributed recommender system
  6. develop a distributed data stream processing system
  7. analyze big networks

Forms of Teaching

Lectures

During lectures, theoretical aspects of the distributed storage and processing of the Big Data will be explained and discussed on various examples and different datasets.

Exams

Midterm exam (week 8) and final exam (week 15).

Laboratory Work

During laboratory exercises, students will solve several short practical assignments in Java and required opensource technologies (Apache Hadoop, Apache Lucene, Apache Mahout and Apache Spark) and discuss the solutions.

Grading Method

Continuous Assessment Exam
Type Threshold Percent of Grade Threshold Percent of Grade
Laboratory Exercises 0 % 40 % 0 % 40 %
Homeworks 0 % 10 % 0 % 10 %
Attendance 0 % 10 % 0 % 10 %
Mid Term Exam: Written 0 % 20 % 0 %
Final Exam: Written 0 % 20 %
Exam: Written 50 % 40 %

Week by Week Schedule

  1. Introduction to the distributed Big Data processing.
  2. Distributed Big Data Storage. Distributed File Systems.
  3. Map-reduce Programming Model.
  4. Basic Design Patterns in the Map-reduce Programming Model.
  5. Advanced Design Patterns in the Map-reduce Programming Model.
  6. Distributed Storage of the Structured Big Data.
  7. Distributed Recommender Systems.
  8. 1. midexam
  9. 1. midexam
  10. Real-time Data Stream Processing.
  11. Real-time Data Stream Processing. (2)
  12. Efficient Search in Large Textual Collections.
  13. Efficient Search in Large Textual Collections. (2)
  14. Link and Large Network Analysis.
  15. Distributed Analysis of Social Networks

Study Programmes

University graduate
Computer Science (profile)
Specialization Course (2. semester)
Software Engineering and Information Systems (profile)
Specialization Course (2. semester)
Telecommunication and Informatics (profile)
Specialization Course (2. semester)

Prerequisites

Literature

Tom White (2015.), Hadoop: The Definitive Guide, "O'Reilly Media, Inc."
Jimmy Lin, Chris Dyer (2010.), Data-intensive Text Processing with MapReduce, Morgan & Claypool Publishers
Donald Miner, Adam Shook (2012.), MapReduce Design Patterns, "O'Reilly Media, Inc."
Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman (2014.), Mining of Massive Datasets, Cambridge University Press
Michael Manoochehri (2013.), Data Just Right, Addison-Wesley
Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015.), Learning Spark, "O'Reilly Media, Inc."

Associate Lecturers

Laboratory exercises

General

ID 147660
  Summer semester
4 ECTS
L0 English Level
L1 e-Learning
30 Lectures
0 Exercises
15 Laboratory exercises
0 Project laboratory

Grading System

85 Excellent
75 Very Good
65 Good
55 Acceptable