Semantic Data Management

Professors
Hores setmanals
Objectius
Continguts
Activitats
Metodologia docent
Mètode d'avaluació
Bibliografia
Capacitats prèvies

Crèdits

6

Tipus

Obligatòria

Requisits

Aquesta assignatura no té requisits, però té capacitats prèvies

Departament

ESSI

Web

https://learnsql.fib.upc.es/moodle/

Big Data is traditionally defined with the three V's: Volume, Velocity and Variety. Traditionally, Big Data has been associated with Volume (e.g., the Hadoop ecosystem) and recently Velocity has earned its momentum (especially, with the arrival of Stream processors such as Spark). However,
currently, associating Big Data with simply Volume or Velocity is simply a mistake. The biggest challenge in Big Data Management is nowadays the Variety challenge and how to tackle Variety in real-world projects is yet not clear and there are no standarized solutions (such as Hadoop for Volume or Spark for Velocity) for this challenge.

In this course the student will be introduced to advanced database technologies, modeling techniques and methods for tackling Variety for decision making. We will also explore the difficulties that arise when combining Variety with Volume and / or Velocity. The focus of this course is on the need to enrich the available data (typically owned by the organization) with external repositories (special attention will be paid to Open Data), in order to gain further insights into the organization business domain. There is a vast amount of examples of external data to be considered as relevant in the decision making processes of any company. For example, data coming from social networks such as Facebook or Twitter; data released by governmental bodies (such as town councils or governments); data coming from sensor networks (such as those in the city services within the Smart Cities paradigm); etc.

This is a new hot topic without a clear and established (mature enough) methodology. For this reason, it requires rigorous thinking, innovation and a strong technical background in order to master the inclusion of external data in an organization decision making processes. Accordingly, this course focuses on three main aspects:

1.- Technical aspect. This represents the core discussion in the course and includes:
- dealing with semi-structured or non-structured data (as in the Web),
- the effective use of metadata to understand external data as by means of Linked Data,
- mastering the main formalisms (mostly coming from the Semantic Web) to enrich the data with metadata (ontology languages, RDF, XML, etc.),
- determine relevant sources, apply and use semantic mechanisms to automate the addition (potentially integration), linkage and / or cross of data between heterogeneous data sources,
- refining and visualizing Open Data

2.- Entrepreneurship and innovation, which includes:
- working on the visionary aspect to boost new analytical perspectives on a business domain by considering external sources and,
- developing added value to current systems by means of (such) external data

Professors

Responsable

Oscar Romero Moral ( )

Altres

Besim Bilalli ( )
Petar Jovanovic ( )

Hores setmanals

Teoria

1.9

Problemes

0

Laboratori

1.9

Aprenentatge dirigit

0

Aprenentatge autònom

6.85

Objectius

Determine how to apply graph formalisms to solve the Variety challenge (data integration)
Competències relacionades:
Master the main semantic-aware formalisms to enable semantic modeling
Competències relacionades:
Determine how to apply graph formalisms to solve the Variety challenge (data integration)
Competències relacionades:
Reinforce team work capabilities in order to develop innovative solutions by means of complementing the organization data with external data
Competències relacionades:
Perform graph data processing both in centralized and distributed environments
Competències relacionades:

Continguts

Introduction and formalisation of Variety in Big Data and its management
Definition of data management tasks: from a database perspective and knowledge representation.

Definition of Variety and Big Data. Syntactic and Semantic heterogeneities. Impact of heterogeneities in the identified data management tasks.

Data integration. Theoretical framework for the management and integration of heterogeneous data sources.

Main components of an integration system: data sources, global schema and mappings.

The concept of canonical model for data integration. Definition of data model. Main characteristics of a canonical data model.
Graphs as solution to the Variety challenge
Graphs as the best canonical model for data integration.

Graph data models main features. Differences with other data models (specially the relational data model).

Data and metadata concepts and their formalization in graph models.

Use cases (highlighting topological benefits): fraud detection, bioinformatics, traffic and logistics, social networks, etc.

Introduction to the main graph models: property graph and knowledge graphs.
Property graphs management
Data structures. Integrity constraints.

Basic operations. Based on topology, content and hybrid.

Graph query languages: GraphQL.

Graph database concept: tool heterogeneity when implementing the graph structures. Impact of such decisions in the main operations.

Distributed graph databases. Need and difficulties. Thinking like a vertex paradigm as standard de facto in distributed graph processing.

Main distributed graph algorithms.
Knowledge graph management
Data structure. RDF. Origin and relationship with Linked Open Data. Integrity constraints.

Data structure: RDFS and OWL. Relationship with first order logic. Foundations in Description Logics. Integrity constraints. Reasoning.

Basic operations and query language. SPARQL and underlying algebra. Entailment regimes (reasoning).

Triplestores. Differences with graph databases. Native implementations. Implementations based on the relational data model. Impact of such decisions on the basic operations.

Distributed triplestore. Needs and difficulties. Graph Engine 1.0 as paradigm of distributed triplestore.

Main distributed algorithms.
Property and knowledge graphs comparison. Use cases
Recap about both models. Commonalities and differences. Concepts to borrow between both paradigms.

Main use cases. Metadata management: Data Lake semantification and data governance.

Main use cases. Exploitation of their topological features: recommenders on graphs and data mining.

Visualization: by means of a GUI (Gephi) or programmatically (D3.js or GraphLab).

Activitats

Activitat Acte avaluatiu

Lectures

During lectures the main concepts will be discussed. Lectures will combine master lectures and active / cooperative learning activities. The student is meant to have a pro-active attitude during active / cooperative learning activities. During master lectures, the student is meant to listen, take notes and ask questions.
Objectius: 3 2 4 1 5
Continguts:

2 . Graphs as solution to the Variety challenge
5 . Property and knowledge graphs comparison. Use cases
1 . Introduction and formalisation of Variety in Big Data and its management
3 . Property graphs management
4 . Knowledge graph management

Teoria

25.5h

Problemes

0h

Laboratori

0h

Aprenentatge dirigit

0h

Aprenentatge autònom

28h

Hands-on Session

The student will be asked to practice the different concepts introduced in the lectures. This includes problem solving either on the computer or on paper.
Objectius: 4 1
Continguts:

5 . Property and knowledge graphs comparison. Use cases
3 . Property graphs management
4 . Knowledge graph management

Teoria

0h

Problemes

0h

Laboratori

25.5h

Aprenentatge dirigit

3h

Aprenentatge autònom

60h

Final Exam

Written exam of the theoretical concepts introduced along the course.
Objectius: 3 2 1 5
Continguts:

2 . Graphs as solution to the Variety challenge
5 . Property and knowledge graphs comparison. Use cases
1 . Introduction and formalisation of Variety in Big Data and its management
3 . Property graphs management
4 . Knowledge graph management

Teoria

2h

Problemes

0h

Laboratori

0h

Aprenentatge dirigit

0h

Aprenentatge autònom

8h

Metodologia docent

The course comprises theory and lab sessions.

Theory: These lectures comprise the teacher's explanations and constitute the main part of the course. The students will also have some contents to read and prepare outside the classroom and will be asked to participate in cooperative learning activities.

Laboratory: Mainly, the lab sessions will be dedicated to the practice (with and without computer) of the concepts introduced in the theory lectures. Specific and relevant tools will be introduced in these sessions. Small-sized projects will be conducted using these tools.

Project: The course contents are applied in a realistic problem in the course project.

Mètode d'avaluació

Final mark = 10% EC + 40% EX + 40% LAB + 10% P

EX = Final exam mark
LAB = Weighted mark of the labs
EC = Mark from the activities in the theoretical sessions
P = Project

EC = In some theory sessions some activities will be proposed. The students need to solve it and hand it out to the lecturer before the end of the session.

LAB: There are three lab sessions, each one with a potential different weight. LABs will be performed in groups assigned by the lecturer.

P: Final course project

Bibliografia

Bàsica:

Data Integration: A Theoretical Perspective - Lenzerini, Maurizio, ACM, 2002. ISBN: 1-58113-507-6
https://doi.org/10.1145/543613.543644
Managing and mining graph data - Aggarwal, Charu C; Wang, Haixun, Springer, cop. 2010. ISBN: 9781441960443
http://cataleg.upc.edu/record=b1384488~S1*cat
The Description logic handbook : theory, implementation, and applications - Baader, Franz von, Cambridge University Press, 2003. ISBN: 0521781760
http://cataleg.upc.edu/record=b1230856~S1*cat
Web data management - Abiteboul, S, Cambridge University Press, 2011. ISBN: 9781107012431
http://cataleg.upc.edu/record=b1410074~S1*cat
Ontology-Driven software development - Pan, Jeff Z, Spinger, cop. 2013. ISBN: 9783642312250
http://cataleg.upc.edu/record=b1427265~S1*cat
Data management and query processing in semantic web databases - Groppe, Sven, Springer, cop. 2011. ISBN: 9783642193569
http://cataleg.upc.edu/record=b1394290~S1*cat
Database systems : the complete book - Garcia-Molina, Hector; Ullman, Jeffrey D; Widom, Jennifer, Pearson Education, 2009. ISBN: 978-0131873254
http://cataleg.upc.edu/record=b1346544~S1*cat
A Survey of RDF Data Management Systems - Özsu, M. Tamer, Cornell University Library, 2016.
https://arxiv.org/abs/1601.00707
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing - Sahu, Siddhartha; Mhedhbi, Amine; Salihoglu, Semih; Lin, Jimmy; Özsu, M. Tamer, Cornell University Library, 2017.
https://arxiv.org/abs/1709.03188

Capacitats prèvies

The student must be familiar with basics on databases and data modeling. Programming skills are also mandatory.

© Facultat d'Informàtica de Barcelona - Universitat Politècnica de Catalunya - Avís legal sobre aquest web
Aquest web utilitza cookies pròpies per oferir una millor experiència i servei. En continuar amb la navegació entenem que acceptes la nostra política de cookies..

Semantic Data Management

Professors

Responsable

Altres

Hores setmanals

Objectius

Continguts

Activitats

Lectures

Hands-on Session

Final Exam

Metodologia docent

Mètode d'avaluació

Bibliografia

Bàsica:

Capacitats prèvies

On som

Contacta amb la FIB

Semantic Data Management

Esteu aquí

Professors

Responsable

Altres

Hores setmanals

Objectius

Continguts

Activitats

Lectures

Hands-on Session

Final Exam

Metodologia docent

Mètode d'avaluació

Bibliografia

Bàsica:

Capacitats prèvies

On som

Contacta amb la FIB