Advanced Databases

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
ESSI
This subject trains students in the skills needed to design and configure analytical databases, evaluating the different possible alternatives in the context of their company. It deals with concepts of generic relational databases (applicable to decision-making data storage environments), in order to then delve into non-traditional alternatives, also known as NewSQL managers, more appropriate for Big Data environments. First, data warehouse concepts will be presented, and then data managers (columnar) and architectures (distributed and in memory) alternative to traditional relational databases in certain scenarios. Big data processing in functional style environments is also included.
The knowledge imparted is essential to tackle the tasks of the data engineer.

Teachers

Person in charge

  • Alberto Abello Gamazo ( )

Others

  • Besim Bilalli ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Technical Competences

Technical competencies

  • CE7 - Demonstrate knowledge and ability to apply the necessary tools for the storage, processing and access to data.

Transversal Competences

Transversals

  • CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT6 [Avaluable] - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.

Generic Technical Competences

Generic

  • CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.

Objectives

  1. Be able to discuss and justify in detail architectural principles and the bottlenecks of the relational managers in front of alternative storage and processing systems.
    Related competences: CE7, CT4, CT6, CG1, CG2, CB2, CB3,
  2. Be able to obtain the logical scheme of a data warehouse from a conceptual schema expressed in UML, detect and correct defects in it.
    Related competences: CE7, CT4, CT6, CB2, CB3,
  3. Be able to choose and justify the use of storage based on rows or columns.
    Related competences: CE7, CT4, CT6, CG2, CB2, CB3,
  4. Be able to explain and use the main mechanisms of parallel processing of queries in distributed environments, and detect bottlenecks.
    Related competences: CE7, CT4, CT6, CG2, CB2, CB3,
  5. Be able to justify and use distributed functional data processing environments, like MapReduce/Spark.
    Related competences: CE7, CT4, CT6, CG1, CG2, CB2, CB3,

Contents

  1. Introduction
    Data warehousing and Big Data
  2. Data Warehousing
    Data warehousing. ETL data flows. Data integration. OLAP tools. Techniques of compression and columnar storage.
  3. Distributed databases
    Taxonomy of distributed databases. Architectures. Distributed database design (fragmentation and replication). Parallelism. Measures of scalability. Distirbuted file systems.
  4. Distributed data processing
    Importance of parallel sequential access. Synchronization barriers (Bulk Synchronous Parallel model). Distributed processing environments of functional data (MapReduce and Spark). Abstraction of distributed datasets (Resilient Distributed Datasets). Big Data architectures.

Activities

Activity Evaluation act


Introduction

Introduction of the subject, motivation and overview of existing data management tools, their advantages and disadvantages
Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Study of data warehouses


Objectives: 2 3
Contents:
Theory
10h
Problems
0h
Laboratory
14h
Guided learning
0h
Autonomous learning
38h

Study of distributed databases

Learning the principles of distributed databases and their application in NOSQL systems
Objectives: 1 4
Contents:
Theory
6h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
4h

Study of the distributed processing of data

Learning of distributed data processing techniques in functional style environments
Objectives: 1 4 5
Contents:
Theory
10h
Problems
0h
Laboratory
12h
Guided learning
0h
Autonomous learning
38h

Final exam

Global examination of the subject
Objectives: 1 2 4 3 5
Week: 15 (Outside class hours)
Type: theory exam
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
10h

Teaching methodology

The course consists of theory and laboratory sessions.

Theory: Reverse class techniques will be used that require the student to work on multimedia materials before class. Theory classes consist of complementary teacher explanations and problem solving.

Laboratory: Representative tools will be used for the application of theoretical concepts (for example, Indyco Builder, PotgreSQL, Pentaho Data Integration, Spark). There will also be two projects, in which students will work in teams: one on descriptive data analysis in a data warehouse and the other on predictive analysis in a Big Data environment. Consequently, there will be two deliverables outside of class hours, but students will also be assessed individually in the classroom on the knowledge gained during each of the projects.

The course has an autonomous learning component, as the students will have to work with different data management and processing tools. Apart from the support material, students should be able to resolve doubts or problems using these tools.

Evaluation methodology

Final grade = min(10 ; max(20%EP+40%EF ; 60% EF) + 40% P + 10% C)

EP = partial (mid term) exam mark
EF = final exam mark
P = project mark, as a weighted average of the course projects
C = participation in the class

For students who may take the resit session, the reassessment examination mark will replace EF.

Bibliography

Basic:

Complementary:

  • Exercises Big Data Management - , , .
  • Exercises Data Warehousing - , , .

Web links

Previous capacities

Be able to read and understand materials in English.
Be able to list the stages that make up the software engineering process.
Be able to understand conceptual schemas in UML.
Be able to create, query and manipulate databases with SQL.