Semantic Data Management

Teachers
Weekly hours
Competences
Objectives
Contents
Activities
Teaching methodology
Evaluation methodology
Bibliography
Previous capacities

Credits

6

Types

Compulsory

Requirements

This subject has not requirements, but it has got previous capacities

Department

ESSI

Web

https://learnsql.fib.upc.es/moodle/

Big Data is traditionally defined with the three V's: Volume, Velocity and Variety. Traditionally, Big Data has been associated with Volume (e.g., the Hadoop ecosystem) and recently Velocity has earned its momentum (especially, with the arrival of Stream processors such as Spark). However, currently, associating Big Data with simply Volume or Velocity is a capital mistake. The biggest challenge in Big Data Management is nowadays the Variety challenge and how to tackle Variety in real-world projects is yet not clear and there are no standarized solutions (such as Hadoop for Volume or Spark for Velocity) for this challenge. Yet, the main efforts in Big Data, nowadays, go in this direction.

In this course the student will be introduced to advanced database technologies, modeling techniques and methods for tackling Variety for decision making. We will also explore the difficulties that arise when combining Variety with Volume and / or Velocity. The focus of this course is on the need to enrich the available data (typically owned by the organization) with external repositories (special attention will be paid to Open Data), in order to gain further insights into the organization business domain. There is a vast amount of examples of external data to be considered as relevant in the decision making processes of any company. For example, data coming from social networks such as Facebook or Twitter; data released by governmental bodies (such as town councils or governments); data coming from sensor networks (such as those in the city services within the Smart Cities paradigm); etc.

This is a new hot topic without a clear and established (mature enough) methodology. For this reason, it requires rigorous thinking, innovation and a strong technical background in order to master the inclusion of external data in an organization decision making processes. Accordingly, this course focuses on three main aspects:

1.- The use of property graphs to ingest, process and query highly unstructured data. The course covers the basic graph algorithms to perform graph-oriented data analysis and foundations on large graph processing.
2.- The use of knowledge graphs to overcome data exchange and data integration, specially with / from third parties, and the use of graph embeddings to perform analysis on them.
3.- Fundamentals on data integration for Big Data and its current application in real-world projects.

Teachers

Person in charge

Anna Queralt Calafat ( )

Others

Gerard Pons Recasens ( )
Oscar Romero Moral ( )

Weekly hours

Theory

2

Problems

0

Laboratory

2

Guided learning

0

Autonomous learning

7.11

Competences

Transversal Competences

Teamwork

CT3 - Ability to work as a member of an interdisciplinary team, as a normal member or performing direction tasks, in order to develop projects with pragmatism and sense of responsibility, making commitments taking into account the available resources.

Third language

CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Entrepreneurship and innovation

CT1 - Know and understand the organization of a company and the sciences that govern its activity; have the ability to understand labor standards and the relationships between planning, industrial and commercial strategies, quality and profit. Being aware of and understanding the mechanisms on which scientific research is based, as well as the mechanisms and instruments for transferring results among socio-economic agents involved in research, development and innovation processes.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.

Generic Technical Competences

Generic

CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
CG3 - Define, design and implement complex systems that cover all phases in data science projects

Technical Competences

Especifics

CE3 - Apply data integration methods to solve data science problems in heterogeneous data environments
CE5 - Model, design, and implement complex data systems, including data visualization
CE9 - Apply appropriate methods for the analysis of non-traditional data formats, such as processes and graphs, within the scope of data science
CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats

Objectives

Learn, understand and apply the fundamentals of property graphs
Related competences: CT3, CT5, CG1, CE5, CE9, CB6, CB9, CB10,
Learn, understand and apply the fundamentals of knowledge graphs
Related competences: CT3, CT5, CG1, CE5, CE9, CB6, CB9, CB10,
Perform graph data processing both in centralized and distributed environments
Related competences: CT3, CT5, CG1, CE5, CE9, CB6, CB9, CB10,
Integrate, combine and refine semi-structured or non-structured data using graph formalisms
Related competences: CT3, CT5, CT1, CG1, CG3, CE3, CE5, CE9, CE12, CE13, CB6, CB7, CB8, CB9,
Determine how to apply graph formalisms to solve the Variety challenge (data integration)
Related competences: CT5, CT1, CG3, CE3, CE5, CE9, CE12, CE13, CB6, CB7, CB9,
Apply property or knowledge graphs to solve realistic problems such as data integration, graph-based data analysis, etc.
Related competences: CT3, CT5, CT1, CG1, CG3, CE3, CE5, CE9, CE12, CE13, CB6, CB7, CB8, CB9, CB10,

Introduction and formalisation of Variety in Big Data and its management
Definition of data management tasks: from a database perspective and knowledge representation.

Definition of Variety and Big Data. Syntactic and Semantic heterogeneities. Impact of heterogeneities in the identified data management tasks.

Data integration. Theoretical framework for the management and integration of heterogeneous data sources.

Main components of an integration system: data sources, global schema and mappings.

The concept of canonical model for data integration. Definition of data model. Main characteristics of a canonical data model.
Property graphs management
Data structures. Integrity constraints.

Basic operations. Based on topology, content and hybrid.

Graph query languages: GraphQL.

Graph database concept: tool heterogeneity when implementing the graph structures. Impact of such decisions in the main operations.

Distributed graph databases. Need and difficulties. Thinking like a vertex paradigm as standard de facto in distributed graph processing.

Main distributed graph algorithms.
Knowledge graph management
Data structure. RDF. Origin and relationship with Linked Open Data. Integrity constraints.

Data structure: RDFS and OWL. Relationship with first order logic. Foundations in Description Logics. Integrity constraints. Reasoning.

Basic operations and query language. SPARQL and underlying algebra. Entailment regimes (reasoning).

Triplestores. Differences with graph databases. Native implementations. Implementations based on the relational data model. Impact of such decisions on the basic operations.

Distributed triplestore. Needs and difficulties. Graph Engine 1.0 as paradigm of distributed triplestore.

Main distributed algorithms.

Graph embeddings.
Graphs as solution to the Variety challenge
Graphs as the best canonical model for data integration.

Graph data models main features. Differences with other data models (specially the relational data model).

Data and metadata concepts and their formalization in graph models.

Use cases (highlighting topological benefits): fraud detection, bioinformatics, traffic and logistics, social networks, etc.

Introduction to the main graph models: property graph and knowledge graphs.
Property and knowledge graphs comparison. Use cases
Recap about both models. Commonalities and differences. Concepts to borrow between both paradigms.

Main use cases. Metadata management: Data Lake semantification and data governance.

Main use cases. Exploitation of their topological features: recommenders on graphs and data mining.

Visualization: by means of a GUI (Gephi) or programmatically (D3.js or GraphLab).

Activities

Activity Evaluation act

Lectures

During lectures the main concepts will be discussed. Lectures will combine master lectures and active / cooperative learning activities. The student is meant to have a pro-active attitude during active / cooperative learning activities. During master lectures, the student is meant to listen, take notes and ask questions.
Objectives: 2 5 3 1
Contents:

1 . Introduction and formalisation of Variety in Big Data and its management
2 . Property graphs management
3 . Knowledge graph management
5 . Property and knowledge graphs comparison. Use cases
4 . Graphs as solution to the Variety challenge

Theory

25h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

28h

Hands-on Session

The student will be asked to practice the different concepts introduced in the lectures. This includes problem solving either on the computer or on paper.
Objectives: 6 5 4
Contents:

2 . Property graphs management
3 . Knowledge graph management
5 . Property and knowledge graphs comparison. Use cases

Theory

0h

Problems

0h

Laboratory

27h

Guided learning

0h

Autonomous learning

60h

Final Exam

Written exam of the theoretical concepts introduced along the course.
Objectives: 2 5 3 4 1
Contents:

1 . Introduction and formalisation of Variety in Big Data and its management
2 . Property graphs management
3 . Knowledge graph management
5 . Property and knowledge graphs comparison. Use cases
4 . Graphs as solution to the Variety challenge

Theory

2h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

8h

Teaching methodology

The course spans theory and lab sessions.

Theory: These lectures are based on teacher's explanations and constitute the main part of the course. The students will also have some contents to read and prepare outside the classroom and will be asked to participate in cooperative learning activities during the lectures.

Laboratory: Mainly, the lab sessions will be dedicated to hands-on of the concepts introduced in the theory lectures. Specific and relevant tools will be introduced in these sessions. Small-sized projects will be conducted using these tools.

Evaluation methodology

Final mark = 40% EX + 60% LAB

EX = Final exam mark
LAB = Weighted mark of the labs

Bibliography

Basic:

Data Integration: A Theoretical Perspective - Lenzerini, Maurizio, PODS '02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. ISBN: 1-58113-507-6
https://doi.org/10.1145/543613.543644
Managing and mining graph data - Aggarwal, Charu C; Wang, Haixun, Springer, 2010. ISBN: 9781441960443
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003843179706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
The description logic handbook: theory, implementation and applications - Baader, Franz, Cambridge University Press, 2003. ISBN: 0521781760
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002562579706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Web data management - Abiteboul, Serge, Cambridge University Press, 2012. ISBN: 9781107012431
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003929239706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Ontology-Driven software development - Pan, Jeff Z, Spinger, cop. 2013. ISBN: 9783642312250
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003980109706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Data management and query processing in semantic web databases - Groppe, Sven, Springer, 2011. ISBN: 9783642193569
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003898129706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Database systems : the complete book - Garcia-Molina, Hector; Ullman, Jeffrey D; Widom, Jennifer, Pearson Education, [2014]. ISBN: 9781292024479
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004168919706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
A Survey of RDF Data Management Systems - Özsu, M. Tamer, Cornell University Library, 2016.
https://arxiv.org/abs/1601.00707
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing - Sahu, Siddhartha; Mhedhbi, Amine; Salihoglu, Semih; Lin, Jimmy; Özsu, M. Tamer, Cornell University Library, 2017.
https://arxiv.org/abs/1709.03188

Previous capacities

The student must be familiar with basics on databases and data modeling. Advanced programming skills are mandatory.

Semantic Data Management

Teachers

Person in charge

Others

Weekly hours

Competences

Transversal Competences

Teamwork

Third language

Entrepreneurship and innovation

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Lectures

Hands-on Session

Final Exam

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Previous capacities

Where we are

Contact with us

Semantic Data Management

You are here

Teachers

Person in charge

Others

Weekly hours

Competences

Transversal Competences

Teamwork

Third language

Entrepreneurship and innovation

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Lectures

Hands-on Session

Final Exam

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Previous capacities

Where we are

Contact with us