The main goal of this course is to analyze the technological and engineering needs of Big Data Management. The enabling technology for such a challenge is cloud services, which provide the elasticity needed to properly scale the infrastructure as the needs of the company grow. Thus, students will learn advanced data management techniques (i.e., NOSQL solutions) that also scale with the infrastructure. Being Big Data Management the evolution of Data Warehousing, such knowledge (see the corresponding subject in Data Science speciality for more details on its contents) is assumed in this course , which will specifically focus on the management of data Volume and Velocity.
On the one hand, to deal with high volumes of data, we will see how a distributed file system can scale to as many machines as necessary. Then, we will study different physical structures we can use to store our data in it. Such structures can be in the form of a file format at the operating system level, or at a higher level of abstraction. In the latter case, they take the form of either sets of key-value pairs, collections of semi-structured documents or column-wise stored tables. We will see that, independently of the kind of storage we choose, current highly parallelizable processing systems using funtional programming principles (typically based on Map and Reduce functions), whose processing framework can rely on temporal files (like Hadoop MapReduce) or in-memory structures (like Spark).
On the other hand, to deal with high velocity of data, we need some low latency system which processes either streams or micro-batches. However, nowadays, data production is already beyond processing technologies capacity. More data is being generated than we can store or even process on the fly. Thus, we will recognize the need of (a) some techniques to select subsets of data (i.e., filter out or sample), (b) summarize them maximizing the valuable information retained, and (c) simplify our algorithms to reduce their computational complexity (i.e., doing one single pass over the data) and provide an approximate answer.
Finally, the complexity of a Big Data project (combining all the necessary tools in a collaborative ecosystem), which typically involves several people with different backgrounds, requires the definition of a high level architecture that abstracts technological difficulties and focuses on functionalities provided and interactions between modules. Therefore, we will also analyse different software architectures for Big Data.
Person in charge
Alberto Abello Gamazo (
Sergi Nadal Francesch (
Generic Technical Competences
CG5 - Capability to apply innovative solutions and make progress in the knowledge to exploit the new paradigms of computing, particularly in distributed environments.
CTR3 - Capacity of being able to work as a team member, either as a regular member or performing directive activities, in order to help the development of projects in a pragmatic manner and with sense of responsibility; capability to take into account the available resources.
CB7 - Ability to integrate knowledges and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
Technical Competences of each Specialization
CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
CEC2 - Capacity for mathematical modelling, calculation and experimental design in engineering technology centres and business, particularly in research and innovation in all areas of Computer Science.
CEC3 - Ability to apply innovative solutions and make progress in the knowledge that exploit the new paradigms of Informatics, particularly in distributed environments.
Understand the differences and benefits of in-memory data management.
Understand the execution flow of a distributed query.
Identify the difficulties of scalability and parallelization.
Stream management and processing
One-pass algorithms; Sliding window; Stream to relation operations; Micro-batching; Sampling; Filtering; Sketching
Big Data Architectures
Centralized and Distributed functional architectures of relational systems; Data Wareshousing architectures; Service Oriented Architecture; Lambda architecture
In these activities, the lecturer will introduce the main theoretical concepts of the subject. Besides lecturing, cooperative learning techniques will be used. These demand the active participation of the students, and consequently will be evaluated. Objectives:213567 Contents:
The course comprises theory, problems, and lab sessions.
Theory: The theory classes comprise the teacher's explanations and constitute the main part of the course. The students will also have some contents to be read and prepared outside the classroom, and will be asked to participate in cooperative learning activities to solve some problems.
Lab: There will be some lab sessions to introduce some of the technologies and a cluster execution environment.
Project: At the end of the course, there will be a small project where the student will show all the knowledge acquired during the semester in a proof of concept.
Final Mark = 50% min (10,1.1 * sum (Ci * Wi) / sum (Wi)) + 10% P + 30% E + 10%Pr
Ci = Marks on collaborative activities and laboratories
Wi = Weight equal to 1, 2, 4, 6 or 8 (depending on the relevance and difficulty of the corresponding activity / laboratory)
Pe = Peer evaluation
E = Exam
Pr = Project
Calculation of Pe: Students will have multiple mates in the activities carried out during the semester and those will report on them. Based on these reports, the teacher will assign a mark to each student.
Being Big Data Management the evolution of Data Warehousing, such knowledge is assumed in this course. Thus, general knowledge is expected on: Relational database desing; Database management system architecture; ETL and OLAP
Specifically, knowledge is expected on:
- Multidimensional modeling (i.e, star schemas)
- Querying relational databases
- Physical design of relational tables (i.e., partitioning)
- Hash and B-tree indexing
- External sorting algorithms (i.e., merge-sort)
- ACID transactions
Where we are
B6 Building Campus Nord
C/Jordi Girona Salgado,1-3
08034 BARCELONA Spain
Tel: (+34) 93 401 70 00