Vés al contingut

Big Data Management

Crèdits
6
Tipus
  • BDMA: Obligatòria
  • MIRI: Optativa
Requisits
Aquesta assignatura no té requisits , però té capacitats prèvies
Departament
ESSI
The main goal of this course is to analyze the technological and engineering needs of Big Data Management. The enabling technology for such a challenge is cloud services, which provide the elasticity needed to properly scale the infrastructure as the needs of the company grow. Thus, students will learn advanced data management techniques (i.e., NOSQL solutions) that also scale with the infrastructure. Being Big Data Management the evolution of Data Warehousing, such knowledge (see the corresponding subject in Data Science speciality for more details on its contents) is assumed in this course , which will specifically focus on the management of data Volume and Velocity.

On the one hand, to deal with high volumes of data, we will see how a distributed file system can scale to as many machines as necessary. Then, we will study different physical structures we can use to store our data in it. Such structures can be in the form of a file format at the operating system level, or at a higher level of abstraction. In the latter case, they take the form of either sets of key-value pairs, collections of semi-structured documents or column-wise stored tables. We will see that, independently of the kind of storage we choose, current highly parallelizable processing systems using funtional programming principles (typically based on Map and Reduce functions), whose processing framework can rely on temporal files (like Hadoop MapReduce) or mainly in-memory structures (like Spark).

On the other hand, to deal with high velocity of data, we need some low latency system which processes either streams or micro-batches. However, nowadays, data production is already beyond processing technologies capacity. More data is being generated than we can store or even process on the fly. Thus, we will recognize the need of (a) some techniques to select subsets of data (i.e., filter out or sample), (b) summarize them maximizing the valuable information retained, and (c) simplify our algorithms to reduce their computational complexity (i.e., doing one single pass over the data) and provide an approximate answer.

Finally, the complexity of a Big Data project (combining all the necessary tools in a collaborative ecosystem), which typically involves several people with different backgrounds, requires the definition of a high level architecture that abstracts technological difficulties and focuses on functionalities provided and interactions between modules. Therefore, we will also analyse different software architectures for Big Data.

Professorat

Responsable

Altres

Hores setmanals

Teoria
1.9
Problemes
0
Laboratori
1.9
Aprenentatge dirigit
0
Aprenentatge autònom
6.85

Objectius

  1. Understand the main advanced methods of data management and design and implement non-relational database managers, with special emphasis on distributed systems.
    Competències relacionades: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
  2. Understand, design, explain and carry out parallel information processing in massively distributed systems.
    Competències relacionades: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
  3. Manage and process a continuous flow of data.
    Competències relacionades: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
  4. Design, implement and maintain system architectures that manage the data life cycle in analytical environments.
    Competències relacionades: CB7, CEC1, CEC2, CEC3, CTR3, CG5,

Continguts

  1. Introduction
    Big Data, Cloud Computing, Scalability
  2. Big Data Design
    Polyglot systems; Schemaless databases; Key-value stores; Wide-column stores; Document-stores
  3. Distributed Data Management
    Transparency layers; Distributed file systems; File formats; Fragmentation; Replication and synchronization; Sharding; Distributed hash; LSM-Trees
  4. In-memory Data Management
    NUMA architectures; Columnar storage; Late reconstruction; Light-weight compression
  5. Distributed Data Processing
    Distributed Query Processing; Sequential access; Pipelining; Parallelism; Synchronization barriers; Multitenancy; MapReduce; Resilient Distributed Datasets; Spark
  6. Stream management and processing
    One-pass algorithms; Sliding window; Stream to relation operations; Micro-batching; Sampling; Filtering; Sketching
  7. Big Data Architectures
    Centralized and Distributed functional architectures of relational systems; Lambda architecture

Activitats

Activitat Acte avaluatiu


Theoretical lectures

In these activities, the lecturer will introduce the main theoretical concepts of the subject. Besides lecturing, cooperative learning techniques will be used. These demand the active participation of the students, and consequently will be evaluated.
Objectius: 2 1 3 4
Continguts:
Teoria
25h
Problemes
0h
Laboratori
0h
Aprenentatge dirigit
0h
Aprenentatge autònom
25h

Teoria
2h
Problemes
0h
Laboratori
0h
Aprenentatge dirigit
0h
Aprenentatge autònom
17h

Teoria
0h
Problemes
0h
Laboratori
27h
Aprenentatge dirigit
0h
Aprenentatge autònom
54h

Metodologia docent

The course comprises theory, and lab sessions.

Theory: Classical theory lectures in conjunction with complementary explanations and problem solving.

Lab: The course contents are applied in a realistic problem in the course project, done in teams, where students will put into practice the kinds of tools studied during the course. Since this course is part of the BDMA Erasmus Mundus master syllabus, this project is conducted jointly with the Viability of Business Projects (VBP) and Debates on Ethics of Big Data (DEBD) courses.

Mètode d'avaluació

Final Mark = 60%E + 40%L

L = Weighted average of the marks of the lab deliverables and presentations
E = Final exam

Bibliografia

Bàsic

Complementari

Web links

Capacitats prèvies

Being Big Data Management the evolution of Data Warehousing, such knowledge is assumed in this course. Thus, general knowledge is expected on: Relational database desing; Database management system architecture; ETL and OLAP

Specifically, knowledge is expected on:
- Multidimensional modeling (i.e, star schemas)
- Querying relational databases
- Physical design of relational tables (i.e., partitioning)
- Hash and B-tree indexing
- External sorting algorithms (i.e., merge-sort)
- ACID transactions