This course introduces the principles and techniques of semantic data management for representing, integrating, and exploiting complex and heterogeneous data. Students learn how graph-based data models enable the explicit representation of entities and relationships, overcoming the limitations of traditional key-based data models when dealing with highly connected data. The course covers property graphs and knowledge graphs as foundational abstractions for semantic data integration.
The first part of the course focuses on property graphs, which build upon traditional graph data management systems and provide the basis for efficient graph storage, querying, and processing. Within this framework, students study fundamental graph algorithms and graph processing techniques to analyze structure, connectivity, and patterns in large-scale graph data.
The second part of the course introduces knowledge graphs, which extend graph data management with semantic annotations and formal vocabularies, enabling symbolic reasoning, inference, and richer forms of data integration. This perspective highlights how semantics add interpretability and reasoning capabilities beyond purely structural graph analysis.
The final part of the course presents a complementary form of graph exploitation based on graph embeddings. By mapping graph elements into continuous vector spaces, embeddings enable the application of machine learning techniques directly on graph-structured data. This includes an introduction to graph neural networks (GNNs) as a powerful paradigm for representation learning on graphs that explicitly captures structural and relational context.
As this is a rapidly evolving and still maturing research area, there is no single, well-established methodology. Consequently, the course emphasizes rigorous reasoning, technical depth, and innovation, preparing students to effectively incorporate complex, graph-structured data into organizational decision-making processes.
Teachers
Person in charge
Anna Queralt Calafat (
)
Others
Gerard Pons Recasens (
)
Oscar Romero Moral (
)
Weekly hours
Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
7.11
Competences
Transversal Competences
Teamwork
CT3 - Ability to work as a member of an interdisciplinary team, as a normal member or performing direction tasks, in order to develop projects with pragmatism and sense of responsibility, making commitments taking into account the available resources.
Third language
CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
Entrepreneurship and innovation
CT1 - Know and understand the organization of a company and the sciences that govern its activity; have the ability to understand labor standards and the relationships between planning, industrial and commercial strategies, quality and profit. Being aware of and understanding the mechanisms on which scientific research is based, as well as the mechanisms and instruments for transferring results among socio-economic agents involved in research, development and innovation processes.
Basic
CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.
Generic Technical Competences
Generic
CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
CG3 - Define, design and implement complex systems that cover all phases in data science projects
Technical Competences
Especifics
CE3 - Apply data integration methods to solve data science problems in heterogeneous data environments
CE5 - Model, design, and implement complex data systems, including data visualization
CE9 - Apply appropriate methods for the analysis of non-traditional data formats, such as processes and graphs, within the scope of data science
CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats
Objectives
Learn, understand and apply the fundamentals of property graphs
Related competences:
CT3,
CT5,
CG1,
CE5,
CE9,
CB6,
CB9,
CB10,
Learn, understand and apply the fundamentals of knowledge graphs
Related competences:
CT3,
CT5,
CG1,
CE5,
CE9,
CB6,
CB9,
CB10,
Perform graph data processing both in centralized and distributed environments
Related competences:
CT3,
CT5,
CG1,
CE5,
CE9,
CB6,
CB9,
CB10,
Integrate, combine and refine semi-structured or non-structured data using graph formalisms
Related competences:
CT3,
CT5,
CT1,
CG1,
CG3,
CE3,
CE5,
CE9,
CE12,
CE13,
CB6,
CB7,
CB8,
CB9,
Determine how to apply graph formalisms to solve the Variety challenge (data integration)
Related competences:
CT5,
CT1,
CG3,
CE3,
CE5,
CE9,
CE12,
CE13,
CB6,
CB7,
CB9,
Apply property or knowledge graphs to solve realistic problems such as data integration, graph-based data analysis, etc.
Related competences:
CT3,
CT5,
CT1,
CG1,
CG3,
CE3,
CE5,
CE9,
CE12,
CE13,
CB6,
CB7,
CB8,
CB9,
CB10,
Contents
Introduction and formalization of semantic data management
Definition of data management tasks from the perspectives of databases and knowledge representation. Syntactic and semantic heterogeneity, and the impact of data heterogeneity on different data management tasks. Concept of data integration and definition of a theoretical framework for managing and integrating heterogeneous data sources. The need for a canonical data model for data integration, including the definition of a data model and the essential characteristics of canonical data models.
Property graphs
Data structures. Model integrity constraints. Basic operations based on topology, content, and hybrid approaches. Graph query languages: GraphQL and Cypher. Graph database concepts. Native implementations and implementations based on relational algebra. Impact of these design decisions on core operations. Efficient graph design. Impact of these heterogeneities on the main operations. Distributed graph databases: motivation and challenges. The thinking like a vertex paradigm as the de facto standard for distributed graph processing. Main distributed graph processing algorithms.
Knowledge graphs
RDF, RDFS, and OWL. Data structures. Integrity constraints. Relationship with first-order logic. Foundations in Description Logics. Inference. Basic operations and query languages. SPARQL and its algebra. Entailment regimes (inference).
Property and knowledge graphs comparison. Use cases
Recap about both models. Commonalities and differences. Concepts to borrow between both paradigms.
Main use cases. Metadata management: Data Lake semantification and data governance.
Main use cases. Exploitation of their topological features: recommenders on graphs and data mining.
Visualization: by means of a GUI (Gephi) or programmatically (D3.js or GraphLab).
Embeddings and GNNs
Concept of embeddings. Properties. Application to graphs and connection with Machine Learning and learning algorithms. GNN architectures. Applications.
Activities
ActivityEvaluation act
Lectures
During lectures the main concepts will be discussed. Lectures will combine master lectures and active / cooperative learning activities. The student is meant to have a pro-active attitude during active / cooperative learning activities. During master lectures, the student is meant to listen, take notes and ask questions. Objectives:2531 Contents:
The student will be asked to practice the different concepts introduced in the lectures. This includes problem solving either on the computer or on paper. Objectives:654 Contents:
Lectures: The instructor presents the topic. Students follow the lecture, take notes, and prepare additional material outside the classroom. They may also be asked to carry out activities during these sessions.
Laboratory: Laboratory sessions are mainly devoted to practical work (with or without a computer) on the concepts introduced in the lecture sessions. Tools relevant to the introduced concepts are presented and used in projects during these sessions. Laboratory work requires the submission of project-based assignments, to be developed both in class and at home, which are assessed together with an on-site examination.
Evaluation methodology
Final grade = 40% EX + 60% LAB
EX = Final exam grade
LAB = Weighted grade of the laboratory work. Laboratory assessment is based on the submission (E) and an on-site assessment test (C) related to the submission. The final laboratory grade is computed as the geometric mean of E and C.
Bibliography
Basic:
Data Integration: A Theoretical Perspective -
Lenzerini, Maurizio,
PODS '02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. ISBN: 1-58113-507-6 https://doi.org/10.1145/543613.543644
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing -
Sahu, Siddhartha; Mhedhbi, Amine; Salihoglu, Semih; Lin, Jimmy; Özsu, M. Tamer,
Cornell University Library, 2017. https://arxiv.org/abs/1709.03188
Neural Network Methods in Natural Language Processing (Synthesis Lectures on Human Language Technologies) -
Goldberg, Yoav; Hirst, Graemer,
Morgan & Claypool , 2017. ISBN: 9781681732350 https://mitpressbookstore.mit.edu/book/9781681732350
A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications -
Cai, HongYun and Zheng, Vincent W. and Chang, Kevin Chen-Chuan,
IEEE Transactions on Knowledge and Data Engineering, 9 (2018). ISBN: 1558-2191 10.1109/TKDE.2018.2807452
Previous capacities
The student must be familiar with basics on databases, data modeling, logics and linera algebra. Advanced programming skills are mandatory.