Big Data Management

Profesorado
Horas semanales
Competencias
Objetivos
Contenidos
Actividades
Metodología docente
Método de evaluación
Bibliografía
Web links
Capacidades previas

Créditos

6

Tipos

BDMA: Obligatoria
MIRI: Complementaria de especialidad (Ingeniería de Servicios)

Requisitos

Esta asignatura no tiene requisitos, pero tiene capacidades previas

Departamento

ESSI

Web

https://learnsql2.fib.upc.edu/moodle

The main goal of this course is to analyze the technological and engineering needs of Big Data Management. The enabling technology for such a challenge is cloud services, which provide the elasticity needed to properly scale the infrastructure as the needs of the company grow. Thus, students will learn advanced data management techniques (i.e., NOSQL solutions) that also scale with the infrastructure. Being Big Data Management the evolution of Data Warehousing, such knowledge (see the corresponding subject in Data Science speciality for more details on its contents) is assumed in this course , which will specifically focus on the management of data Volume and Velocity.

On the one hand, to deal with high volumes of data, we will see how a distributed file system can scale to as many machines as necessary. Then, we will study different physical structures we can use to store our data in it. Such structures can be in the form of a file format at the operating system level, or at a higher level of abstraction. In the latter case, they take the form of either sets of key-value pairs, collections of semi-structured documents or column-wise stored tables. We will see that, independently of the kind of storage we choose, current highly parallelizable processing systems using funtional programming principles (typically based on Map and Reduce functions), whose processing framework can rely on temporal files (like Hadoop MapReduce) or mainly in-memory structures (like Spark).

On the other hand, to deal with high velocity of data, we need some low latency system which processes either streams or micro-batches. However, nowadays, data production is already beyond processing technologies capacity. More data is being generated than we can store or even process on the fly. Thus, we will recognize the need of (a) some techniques to select subsets of data (i.e., filter out or sample), (b) summarize them maximizing the valuable information retained, and (c) simplify our algorithms to reduce their computational complexity (i.e., doing one single pass over the data) and provide an approximate answer.

Finally, the complexity of a Big Data project (combining all the necessary tools in a collaborative ecosystem), which typically involves several people with different backgrounds, requires the definition of a high level architecture that abstracts technological difficulties and focuses on functionalities provided and interactions between modules. Therefore, we will also analyse different software architectures for Big Data.

Profesorado

Responsable

Besim Bilalli ( )

Otros

Sergi Nadal Francesch ( )

Horas semanales

Teoría

1.9

Problemas

0

Laboratorio

1.9

Aprendizaje dirigido

0

Aprendizaje autónomo

6.85

Competencias

Generic Technical Competences

Generic

CG5 - Capability to apply innovative solutions and make progress in the knowledge to exploit the new paradigms of computing, particularly in distributed environments.

Transversal Competences

Teamwork

CTR3 - Capacity of being able to work as a team member, either as a regular member or performing directive activities, in order to help the development of projects in a pragmatic manner and with sense of responsibility; capability to take into account the available resources.

Basic

CB7 - Ability to integrate knowledges and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.

Technical Competences of each Specialization

Specific

CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
CEC2 - Capacity for mathematical modelling, calculation and experimental design in engineering technology centres and business, particularly in research and innovation in all areas of Computer Science.
CEC3 - Ability to apply innovative solutions and make progress in the knowledge that exploit the new paradigms of Informatics, particularly in distributed environments.

Objetivos

Understand the main advanced methods of data management and design and implement non-relational database managers, with special emphasis on distributed systems.
Competencias relacionadas: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
Understand, design, explain and carry out parallel information processing in massively distributed systems.
Competencias relacionadas: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
Manage and process a continuous flow of data.
Competencias relacionadas: CB7, CEC1, CEC2, CEC3, CTR3, CG5,
Design, implement and maintain system architectures that manage the data life cycle in analytical environments.
Competencias relacionadas: CB7, CEC1, CEC2, CEC3, CTR3, CG5,

Contenidos

Introduction
Big Data, Cloud Computing, Scalability
Big Data Design
Polyglot systems; Schemaless databases; Key-value stores; Wide-column stores; Document-stores
Distributed Data Management
Transparency layers; Distributed file systems; File formats; Fragmentation; Replication and synchronization; Sharding; Distributed hash; LSM-Trees
In-memory Data Management
NUMA architectures; Columnar storage; Late reconstruction; Light-weight compression
Distributed Data Processing
Distributed Query Processing; Sequential access; Pipelining; Parallelism; Synchronization barriers; Multitenancy; MapReduce; Resilient Distributed Datasets; Spark
Stream management and processing
One-pass algorithms; Sliding window; Stream to relation operations; Micro-batching; Sampling; Filtering; Sketching
Big Data Architectures
Centralized and Distributed functional architectures of relational systems; Lambda architecture

Actividades

Actividad Acto evaluativo

Theoretical lectures

In these activities, the lecturer will introduce the main theoretical concepts of the subject. Besides lecturing, cooperative learning techniques will be used. These demand the active participation of the students, and consequently will be evaluated.
Objetivos: 2 1 3 4
Contenidos:

1 . Introduction
2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management
5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures

Teoría

25h

Problemas

0h

Laboratorio

0h

Aprendizaje dirigido

0h

Aprendizaje autónomo

25h

Exam

Written exam of the theoretico-practical concepts introduced along the course.
Objetivos: 2 1 3 4
Contenidos:

1 . Introduction
2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management
5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures

Teoría

2h

Problemas

0h

Laboratorio

0h

Aprendizaje dirigido

0h

Aprendizaje autónomo

17h

Lab

Students will use different NOSQL tools in a sandbox environment.
Objetivos: 2 1 3 4
Contenidos:

2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management
5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures

Teoría

0h

Problemas

0h

Laboratorio

27h

Aprendizaje dirigido

0h

Aprendizaje autónomo

54h

Metodología docente

The course comprises theory, and lab sessions.

Theory: Classical theory lectures in conjunction with complementary explanations and problem solving.

Lab: The course contents are applied in a realistic problem in the course project, done in teams, where students will put into practice the kinds of tools studied during the course. Since this course is part of the BDMA Erasmus Mundus master syllabus, this project is conducted jointly with the Viability of Business Projects (VBP) and Debates on Ethics of Big Data (DEBD) courses.

Método de evaluación

Final Mark = 60%E + 40%L

L = Weighted average of the marks of the lab deliverables and presentations
E = Final exam

Bibliografía

Básica:

Principles of distributed database systems - Özsu, M.T.; Valduriez, P, Springer, 2020. ISBN: 9783030262525
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004193569706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Encyclopedia of database systems - Liu, L.; Özsu, M.T, Springer, 2009. ISBN: 9780387399409
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991000621799706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
NoSQL distilled: a brief guide to the emerging world of polygot persistence - Sadalage, P.J.; Fowler, M, Addison-Wesley, 2013. ISBN: 9780321826626
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003990429706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
In-memory data management - Plattner, H.; Zeier, A, Springer, 2012. ISBN: 9783642295744
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004007899706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
An architecture for fast and general data processing on large clusters - Zaharia, M, ACM Books, 2016. ISBN: 9781970001563
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004088079706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Mining of massive datasets - Leskovec, J.; Rajaraman, A.; Ullman, J.D, Cambridge University Press, 2020. ISBN: 9781108476348
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004193679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Data streams: models and algorithms - Aggarwal, C.C. (ed.), Springer, 2007. ISBN: 9780387287591
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003199179706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Understanding ETL Data Pipelines for Modern Data Architectures - PALMER, Matt, O'Reilly Media, Inc., 2024. ISBN: 9781098159252
https://www.oreilly.com/library/view/understanding-etl/9781098159269/

Complementaria:

Database system: the complete book - Garcia-Molina, H.; Ullman, J.D.; Widom, J, Pearson Education Limited , 2014. ISBN: 9781292024479
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=5174436
Master data management - Loshin, D, Morgan Kaufmann/Elsevier , 2009. ISBN: 9781282285507

Web links

Summer school http://cs.ulb.ac.be/conferences/ebiss.html
PhD programme https://deds.ulb.ac.be

Capacidades previas

Being Big Data Management the evolution of Data Warehousing, such knowledge is assumed in this course. Thus, general knowledge is expected on: Relational database desing; Database management system architecture; ETL and OLAP

Specifically, knowledge is expected on:
- Multidimensional modeling (i.e, star schemas)
- Querying relational databases
- Physical design of relational tables (i.e., partitioning)
- Hash and B-tree indexing
- External sorting algorithms (i.e., merge-sort)
- ACID transactions

Big Data Management

Profesorado

Responsable

Otros

Horas semanales

Competencias

Generic Technical Competences

Generic

Transversal Competences

Teamwork

Basic

Technical Competences of each Specialization

Specific

Objetivos

Contenidos

Actividades

Theoretical lectures

Exam

Lab

Metodología docente

Método de evaluación

Bibliografía

Básica:

Complementaria:

Web links

Capacidades previas

Dónde estamos

Contacta con la FIB

Big Data Management

Usted está aquí

Profesorado

Responsable

Otros

Horas semanales

Competencias

Generic Technical Competences

Generic

Transversal Competences

Teamwork

Basic

Technical Competences of each Specialization

Specific

Objetivos

Contenidos

Actividades

Theoretical lectures

Exam

Lab

Metodología docente

Método de evaluación

Bibliografía

Básica:

Complementaria:

Web links

Capacidades previas

Dónde estamos

Contacta con la FIB