Advanced Databases

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
ESSI
This course will train students in the competencies necessary to design and configure analytical databases, evaluating the different possible alternatives in the context of their company. Concepts of generic relational databases (applicable to decision-making environments for data storage) are dealt with, in order to further study non-relational alternatives, also known as NOSQL managers, most appropriate for Big Data environments. First, data warehouse concepts and then key-value and columnars and architectures (distributed and in memory) will be presented alternatives to relational databases in certain scenarios. An introduction to the design and configuration of non-relational databases is also included, emphasizing the configuration in analytical environments and, including massive data processing in functional style environments. The knowledge given is essential to face the tasks of the data engineer.

Teachers

Person in charge

  • Alberto Abello Gamazo ( )

Others

  • Oscar Romero Moral ( )
  • Sergi Nadal Francesch ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0.4
Autonomous learning
6

Competences

Technical Competences

Technical competencies

  • CE7 - Demonstrate knowledge and ability to apply the necessary tools for the storage, processing and access to data.

Transversal Competences

Transversals

  • CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT6 - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.
  • CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.

Generic Technical Competences

Generic

  • CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.

Objectives

  1. Be able to discuss and justify in detail the bottlenecks of the relational managers in front of alternative storage and processing systems.
    Related competences: CE7, CT4, CT6, CT7, CG1, CG2, CB2, CB3,
  2. Be able to analyze the pros and cons of having (or not) a unique model of reference that adapts to all possible storage scenarios.
    Related competences: CT7, CG1, CG2, CB2, CB3,
  3. Be able to detect and correct defects in a logical design.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  4. Be able to obtain the logical scheme from a conceptual schema expressed in UML taking into account the consequences of the variety and variability of the data.
    Related competences: CE7, CT4, CT6, CT7, CG1, CG2, CB2, CB3,
  5. Being able, given certain characteristics of the data (volume, schema / data variability, expected workload), choose the data model and the appropriate physical structures to guarantee a correct balance between maintenance of the database and performance .
    Related competences: CE7, CT4, CT6, CT7, CG1, CG2, CB2, CB3,
  6. Be able to explain the operation and calculate the cost of access of the main data structures used by the managers.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  7. Be able to obtain the access plan for a consultation based on optimization criteria.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  8. Be able to reproduce the execution of algorithms that intervene in a process tree and estimate its cost.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  9. Be able to decide the indexes (primary and secondary) that must be defined based on the expected operations.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  10. Be able to choose and justify the use of storage based on rows or columns.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  11. Be able to explain and use the main mechanisms of parallel processing of queries in distributed environments, and detect bottlenecks.
    Related competences: CE7, CT4, CT6, CT7, CG2, CB2, CB3,
  12. Be able to discuss and justify in detail the architectural principles that share the new non-relational storage systems.
    Related competences: CE7, CT6, CT7, CG2, CB2, CB3,
  13. Being able, given a specific scenario with user requirements (partial or total), identify what characteristics of relational managers would potentially act as bottleneck and talk about what types of storage managers would be most appropriate.
    Related competences: CE7, CT4, CT6, CT7, CG1, CG2, CB2, CB3,
  14. Be able to justify and use distributed functional data processing environments, like MapReduce/Spark.
    Related competences: CE7, CT4, CT6, CT7, CG1, CG2, CB2, CB3,

Contents

  1. Introduction
    Data warehousing and Big Data
  2. Data Warehousing
    Data warehousing. ETL data flows. Data integration. OLAP tools.
  3. Techniques for the improvement of performance of database systems
    Materialized views Data structures and indexing techniques (hash, trees and bitmaps). Techniques of compression and columnar storage. Parallelism
  4. Distributed databases
    Taxonomy of distributed databases. Architectures. Distributed database design (fragmentation and replication). Measures of scalability. Non-relational Key-Value systems.
  5. Distributed data processing
    Importance of parallel sequential access. Synchronization barriers (Bulk Synchronous Parallel model). Distributed processing environments of functional data (MapReduce and Spark). Abstraction of distributed datasets (Resilient Distributed Datasets).

Activities

Activity Evaluation act


Introduction

Introduction of the subject, motivation and overview of existing data management tools, their advantages and disadvantages
Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Study of data warehouses



Theory
10h
Problems
0h
Laboratory
12h
Guided learning
1h
Autonomous learning
32h

Study of techniques for improving the performance of database systems

Learning the types of management problems of materialized views and indexing structures, as well as the main associated costs that they have in each case
Objectives: 9 6 10
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
1h
Autonomous learning
12h

Study of distributed databases

Learning the principles of distributed databases and their application in NOSQL systems
Objectives: 1 12 13 11
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
1h
Autonomous learning
12h

Study of the distributed processing of data

Learning of distributed data processing techniques in functional style environments
Objectives: 14
Contents:
Theory
10h
Problems
0h
Laboratory
10h
Guided learning
1h
Autonomous learning
24h

Final exam

Global examination of the subject
Objectives: 1 2 12 9 7 8 3 4 13 6 11 10 14 5
Week: 15 (Outside class hours)
Type: final exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
2h
Autonomous learning
10h

Teaching methodology

At the theory hours, the teacher exposes the concepts corresponding to one of the contents. Some of the concepts are not discussed by the teacher, but students must work materials published on the virtual campus. The doubts that may arise when reading these materials are resolved by the teacher or their peers, through cooperative learning activities.

At the laboratory hours, the teacher presents different exercises, which students must solve in pairs and will be solved in class. Apart from that, two projects will also be carried out: one descriptive analysis of data in a data warehouse and the other one for predictive analysis in a Big Data environment.

The course also has an autonomous learning component, given that they will have to work with different data management tools (relational and non-relational). Apart from the support material, they must be able to solve doubts or usage problems of these managers.

Evaluation methodology

Final grade = max(20%EP+40%EF ; 60% EF) + 40% P

EP = partial (mid term) exam mark
EF = final exam mark
P = project mark, as a weighted average of the course projects

For students who may take the resit session, the reassessment examination mark will replace EF.

Bibliography

Basic:

Complementary:

Previous capacities

Be able to read and understand materials in English.
Be able to list the stages that make up the software engineering process.
Be able to understand conceptual schemas in UML.
Be able to create, query and manipulate databases with SQL.

Addendum

Contents

El contingut de l'assignatura no canviarà.

Teaching methodology

Es modificaran les classes magistrals per classes invertides (flipped classroom) en que l'estudiant haurà d'estudiar materials (texts o videos) de forma independent abans de la sessió corresponent, que es dedicarà a la realitazió d'exercicis i discussió/resolució de dubtes.

Evaluation methodology

El métode d'avaluació no canviarà.

Contingency plan

La realització d'exercicis, discussions i resolució de dubtes de les classes invertides es farà de forma remota.