Distributed Big Data Processing

Course Description

Introduction to Big Data. Distributed storage of Big Data and distributed file systems. MapReduce programming model. MapReduce design patterns. Distributed processing of large textual collections. Efficient search in large textual collections. Analysis of links and large networks. Distributed storage of large collections of structured data. Distributed recommender systems. Distributed Big Data processing based on dataflow programming. Distributed data stream processing in real-time. Distributed machine learning. Distributed analysis of social networks.

Learning Outcomes

  1. identify big data characteristics
  2. compare distributed algorithms for big data processing
  3. develop simple algorithms for distributed big data processing
  4. apply open source technologies for distributed big data processing and storage
  5. develop a distributed recommender system
  6. develop a distributed data stream processing system
  7. analyze big networks

Forms of Teaching

Lectures

Independent assignments

Laboratory

Week by Week Schedule

  1. Performance evaluation of distributed systems, Big data concepts, Failure and recovery
  2. External storage, physical organization, and drives, Distributed DBMS, Data replication and consistency models, Data management
  3. Time and space trade-offs in algorithms, Programming middleware for distributed systems, Performance evaluation of distributed systems, Dynamic analysis of distributed systems (parallelism, synchronisation, and simulation), Multiple simultaneous computations, Parallelism, communication, and coordination, Programming constructs for parallelism, Basic knowledge of parallel decomposition concepts, Core distributed algorithms, Parallel algorithmic patterns (divide-and-conquer, map and reduce, master-workers, others)
  4. Time and space trade-offs in algorithms, Strategies for choosing the appropriate data structure, Programming middleware for distributed systems, Performance evaluation of distributed systems, Dynamic analysis of distributed systems (parallelism, synchronisation, and simulation), Multiple simultaneous computations, Parallelism, communication, and coordination, Programming constructs for parallelism, Core distributed algorithms, Parallel algorithmic patterns (divide-and-conquer, map and reduce, master-workers, others)
  5. Information retrieval models (vector space, probabilistic, Boolean)
  6. Web search (PageRank and HITS)
  7. Strategies for choosing the appropriate data structure, External storage, physical organization, and drives, NoSQL databases
  8. Midterm exam
  9. Multiple simultaneous computations, Basic knowledge of parallel decomposition concepts, Core distributed algorithms
  10. Time series and sequences mining, Lazy evaluation and infinite streams, Transmission Control Protocol (TCP) server and client; Concurrency; Application protocols based on TCP; Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP); Simple HTTP server, Case studies focused on Java network programming and network programming in Python
  11. Time series and sequences mining, Lazy evaluation and infinite streams
  12. Programming middleware for distributed systems, Big data concepts, Data management
  13. Visit
  14. Clustering, Centrality, Degree distributions, Degree correlations, Community structure diameter, Structure of social network graphs, Social network analysis
  15. Final exam

Study Programmes

University graduate
Audio Technologies and Electroacoustics (profile)
Free Elective Courses (1. semester)
Communication and Space Technologies (profile)
Free Elective Courses (1. semester)
Computational Modelling in Engineering (profile)
Free Elective Courses (1. semester)
Computer Engineering (profile)
Free Elective Courses (1. semester) Specialization Course (3. semester)
Computer Science (profile)
Free Elective Courses (1. semester)
Control Systems and Robotics (profile)
Free Elective Courses (1. semester)
Data Science (profile)
Free Elective Courses (1. semester)
Electrical Power Engineering (profile)
Free Elective Courses (1. semester)
Electric Machines, Drives and Automation (profile)
Free Elective Courses (1. semester)
Electronic and Computer Engineering (profile)
Free Elective Courses (1. semester)
Electronics (profile)
Free Elective Courses (1. semester)
Information and Communication Engineering (profile)
Free Elective Courses (1. semester)
Software Engineering and Information Systems (profile)
Specialization Course (3. semester)
Telecommunication and Informatics (profile)
Specialization Course (3. semester)

Literature

(.), Tom White (2015.), Hadoop: The Definitive Guide, "O'Reilly Media, Inc.",
(.), Donald Miner, Adam Shook (2012.), MapReduce Design Patterns, "O'Reilly Media, Inc.",
(.), Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015.), Learning Spark, "O'Reilly Media, Inc.",
(.), Jimmy Lin, Chris Dyer (2010.), Data-intensive Text Processing with MapReduce, Morgan & Claypool Publishers,
(.), Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman (2014.), Mining of Massive Datasets, Cambridge University Press,
(.), Michael Manoochehri (2013.), Data Just Right, Addison-Wesley,

For students

General

ID 222765
  Winter semester
5 ECTS
L3 English Level
L1 e-Learning
30 Lectures
13 Laboratory exercises