Big Data Management

Teachers
Weekly hours
Competences
Objectives
Contents
Activities
Teaching methodology
Evaluation methodology
Bibliography
Web links
Previous capacities

Credits

6

Types

Compulsory

Requirements

This subject has not requirements, but it has got previous capacities

Department

ESSI

Web

https://learnsql3.fib.upc.edu/moodle/course/view.php?id=169

The main goal of this course is to analyze the technological and engineering needs of Big Data Management. The enabling technology for such a challenge is cloud services, which provide the elasticity needed to properly scale the infrastructure as the needs of the company grow. Thus, students will learn advanced data management techniques (i.e., NOSQL solutions) that also scale with the infrastructure. Being Big Data Management the evolution of Data Warehousing, such knowledge (see the corresponding subject in Data Science speciality for more details on its contents) is assumed in this course , which will specifically focus on the management of data Volume and Velocity.

On the one hand, to deal with high volumes of data, we will see how a distributed file system can scale to as many machines as necessary. Then, we will study different physical structures we can use to store our data in it. Such structures can be in the form of a file format at the operating system level, or at a higher level of abstraction. In the latter case, they take the form of either sets of key-value pairs, collections of semi-structured documents or column-wise stored tables. We will see that, independently of the kind of storage we choose, current highly parallelizable processing systems using funtional programming principles (typically based on Map and Reduce functions), whose processing framework can rely on temporal files (like Hadoop MapReduce) or mainly in-memory structures (like Spark).

On the other hand, to deal with high velocity of data, we need some low latency system which processes either streams or micro-batches. However, nowadays, data production is already beyond processing technologies capacity. More data is being generated than we can store or even process on the fly. Thus, we will recognize the need of (a) some techniques to select subsets of data (i.e., filter out or sample), (b) summarize them maximizing the valuable information retained, and (c) simplify our algorithms to reduce their computational complexity (i.e., doing one single pass over the data) and provide an approximate answer.

Finally, the complexity of a Big Data project (combining all the necessary tools in a collaborative ecosystem), which typically involves several people with different backgrounds, requires the definition of a high level architecture that abstracts technological difficulties and focuses on functionalities provided and interactions between modules. Therefore, we will also analyse different software architectures for Big Data.

Teachers

Person in charge

Alex Barceló Cuerda ( )

Others

Marc Maynou Yelamos ( )
Sergi Nadal Francesch ( )

Weekly hours

Theory

1.9

Problems

0

Laboratory

1.9

Guided learning

0

Autonomous learning

6.85

Competences

Transversal Competences

Teamwork

CT3 - Ability to work as a member of an interdisciplinary team, as a normal member or performing direction tasks, in order to develop projects with pragmatism and sense of responsibility, making commitments taking into account the available resources.

Third language

CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Entrepreneurship and innovation

CT1 - Know and understand the organization of a company and the sciences that govern its activity; have the ability to understand labor standards and the relationships between planning, industrial and commercial strategies, quality and profit. Being aware of and understanding the mechanisms on which scientific research is based, as well as the mechanisms and instruments for transferring results among socio-economic agents involved in research, development and innovation processes.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.

Generic Technical Competences

Generic

CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
CG3 - Define, design and implement complex systems that cover all phases in data science projects

Technical Competences

Especifics

CE2 - Apply the fundamentals of data management and processing to a data science problem
CE4 - Apply scalable storage and parallel data processing methods, including data streams, once the most appropriate methods for a data science problem have been identified
CE5 - Model, design, and implement complex data systems, including data visualization
CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats

Objectives

Understand the main advanced methods of data management and design and implement non-relational database managers, with special emphasis on distributed systems.
Related competences: CT3, CT5, CG1, CG3, CE2, CE4, CE5, CB6, CB7, CB8, CB9, CB10,
Understand, design, explain and carry out parallel information processing in massively distributed systems.
Related competences: CT3, CT5, CG1, CG3, CE2, CE4, CE5, CB6, CB7, CB8, CB9, CB10,
Manage and process a continuous flow of data.
Related competences: CT3, CT5, CG1, CG3, CE2, CE4, CE5, CB6, CB7, CB8, CB9, CB10,
Design, implement and maintain system architectures that manage the data life cycle in analytical environments.
Related competences: CT3, CT5, CT1, CG1, CG3, CE2, CE4, CE5, CE12, CE13, CB6, CB7, CB8, CB9, CB10,

Introduction
Big Data, Cloud Computing, Scalability
Big Data Design
Polyglot systems; Schemaless databases; Key-value stores; Wide-column stores; Document-stores
Distributed Data Management
Transparency layers; Distributed file systems; File formats; Fragmentation; Replication and synchronization; Sharding; Distributed hash; LSM-Trees
In-memory Data Management
NUMA architectures; Columnar storage; Late reconstruction; Light-weight compression
Distributed Data Processing
Distributed Query Processing; Sequential access; Pipelining; Parallelism; Synchronization barriers; Multitenancy; MapReduce; Resilient Distributed Datasets; Spark
Stream management and processing
One-pass algorithms; Sliding window; Stream to relation operations; Micro-batching; Sampling; Filtering; Sketching
Big Data Architectures
Centralized and Distributed functional architectures of relational systems; Lambda architecture

Activities

Activity Evaluation act

Theoretical lectures

In these activities, the lecturer will introduce the main theoretical concepts of the subject. The active participation of the students will be required.
Objectives: 2 1 3 4
Contents:

1 . Introduction
5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures
2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management

Theory

25h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

25h

Exam

Written exam of the theoretico-practical concepts introduced along the course.
Objectives: 2 1 3 4
Contents:

1 . Introduction
5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures
2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management

Theory

2h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

17h

Lab

Students will use different NOSQL tools in a sandbox environment.
Objectives: 2 1 3 4
Contents:

5 . Distributed Data Processing
6 . Stream management and processing
7 . Big Data Architectures
2 . Big Data Design
3 . Distributed Data Management
4 . In-memory Data Management

Theory

0h

Problems

0h

Laboratory

27h

Guided learning

0h

Autonomous learning

54h

Teaching methodology

The course comprises theory, and lab sessions.

Theory: Classical theory lectures in conjunction with complementary explanations and problem solving.

Lab: There will be a project done in teams where students will put into practice the kinds of tools studied during the course. This will be evaluated in two deliverables and individual tests.

Evaluation methodology

Final Mark = 60%E + 40%L

L = Weighted average of the marks of the lab deliverables and tests
E = Final exam

Bibliography

Basic:

Principles of distributed database systems - Özsu, M.T.; Valduriez, P, Springer, 2020. ISBN: 9783030262525
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004193569706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Encyclopedia of database systems - Liu, L.; Özsu, M.T, Springer, 2009. ISBN: 9780387399409
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004877013906711&context=L&vid=34CSUC_UPC:VU1&lang=ca
NoSQL distilled: a brief guide to the emerging world of polygot persistence - Sadalage, P.J.; Fowler, M, Addison-Wesley, 2013. ISBN: 9780321826626
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003990429706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
In-memory data management - Plattner, H.; Zeier, A, Springer, 2012. ISBN: 9783642295744
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004007899706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
An architecture for fast and general data processing on large clusters - Zaharia, M, ACM Books, 2016. ISBN: 9781970001563
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004088079706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Mining of massive datasets - Leskovec, J.; Rajaraman, A.; Ullman, J.D, Cambridge University Press, 2020. ISBN: 9781108476348
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004193679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Data streams: models and algorithms - Aggarwal, C.C. (ed.), Springer, 2007. ISBN: 9780387287591
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003199179706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Understanding ETL Data Pipelines for Modern Data Architectures - PALMER, Matt, O'Reilly Media, Inc., 2024. ISBN: 9781098159252
https://www.oreilly.com/library/view/understanding-etl/9781098159269/

Complementary:

Database systems : the complete book - Garcia-Molina, Hector; Ullman, Jeffrey D; Widom, Jennifer, Pearson Education Limited , [2014]. ISBN: 9781292024479
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004168919706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Master data management - Loshin, D, Morgan Kaufmann/Elsevier , 2009. ISBN: 9781282285507

Web links

Summer school http://cs.ulb.ac.be/conferences/ebiss.html
PhD programme https://deds.ulb.ac.be

Previous capacities

Being Big Data Management the evolution of Data Warehousing, such knowledge is assumed in this course. Thus, general knowledge is expected on: Relational database desing; Database management system architecture; ETL and OLAP

Specifically, knowledge is expected on:
- Multidimensional modeling (i.e, star schemas)
- Querying relational databases
- Physical design of relational tables (i.e., partitioning)
- Hash and B-tree indexing
- External sorting algorithms (i.e., merge-sort)
- ACID transactions

Big Data Management

Teachers

Person in charge

Others

Weekly hours

Competences

Transversal Competences

Teamwork

Third language

Entrepreneurship and innovation

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Theoretical lectures

Exam

Lab

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Where we are

Contact with us

Big Data Management

You are here

Teachers

Person in charge

Others

Weekly hours

Competences

Transversal Competences

Teamwork

Third language

Entrepreneurship and innovation

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Theoretical lectures

Exam

Lab

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Where we are

Contact with us