Big Data is traditionally defined with the three V's: Volume, Velocity and Variety. Traditionally, Big Data has been associated with Volume (e.g., the Hadoop ecosystem) and recently Velocity has earned its momentum (especially, with the arrival of Stream processors such as Spark). However, even if Variety has been part of the Big Data definition, how to tackle Variety in real-world projects is yet not clear and there are no standarized solutions (such as Hadoop for Volume or Spark for Velocity) for this challenge.
In this course the student will be introduced to advanced database technologies, modeling techniques and methods for tackling Variety for decision making. We will also explore the difficulties that arise when combining Variety with Volume and / or Velocity. The focus of this course is on the need to enrich the available data (typically owned by the organization) with external repositories (special attention will be paid to Open Data), in order to gain further insights into the organization business domain. There is a vast amount of examples of external data to be considered as relevant in the decision making processes of any company. For example, data coming from social networks such as Facebook or Twitter; data released by governmental bodies (such as town councils or governments); data coming from sensor networks (such as those in the city services within the Smart Cities paradigm); etc.
This is a new hot topic without a clear and established (mature enough) methodology. For this reason, it requires rigorous thinking, innovation and a strong technical background in order to master the inclusion of external data in an organization decision making processes. Accordingly, this course focuses on three main aspects:
1.- Technical aspect. This represents the core discussion in the course and includes:
- dealing with semi-structured or non-structured data (as in the Web),
- the effective use of metadata to understand external data as by means of Linked Data,
- mastering the main formalisms (mostly coming from the Semantic Web) to enrich the data with metadata (ontology languages, RDF, XML, etc.),
- determine relevant sources, apply and use semantic mechanisms to automate the addition (potentially integration), linkage and / or cross of data between heterogeneous data sources,
- refining and visualizing Open Data
2.- Ethic and social aspects, which includes:
- data ownership aspects,
- ethics and,
- identifying knowing legal frameworks (such as that of the LOPDP in Spain)
3.- Entrepreneurship and innovation, which includes:
- working on the visionary aspect to boost new analytical perspectives on a business domain by considering external sources and,
- developing added value to current systems by means of (such) external data Web:https://learnsql.fib.upc.es/moodle/
Person in charge
Oscar Romero Moral (
Generic Technical Competences
CG4 - Capacity for general and technical management of research, development and innovation projects, in companies and technology centers in the field of Informatics Engineering.
Entrepreneurship and innovation
CTR1 - Capacity for knowing and understanding a business organization and the science that rules its activity, capability to understand the labour rules and the relationships between planning, industrial and commercial strategies, quality and profit. Capacity for developping creativity, entrepreneurship and innovation trend.
CTR3 - Capacity of being able to work as a team member, either as a regular member or performing directive activities, in order to help the development of projects in a pragmatic manner and with sense of responsibility; capability to take into account the available resources.
Technical Competences of each Specialization
CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
CEC3 - Ability to apply innovative solutions and make progress in the knowledge that exploit the new paradigms of Informatics, particularly in distributed environments.
Determine relevant external sources to be considered in the decision making processes in order to generate added value in the day-by-day processes
Master the main semantic-aware formalisms to enable semantic modeling
Integrate, combine and refine semi-structured or non-structured data mostly coming from the Web into decisional systems
Reinforce team work capabilities in order to develop innovative solutions by means of complementing the organization data with external data
Enable effective open / linked data visualization
Recognise the social and legal aspects of open data
Big Data & Business Intelligence 2.0. The relevance of external data. Open Data.
Linked Data and Semantic Modeling
Definition. The four rules. The 5 stars of linked data. The relevance of metadata.
XML and XML databases
XML. XPATH. XQUERY. Foundations on XML Databases
JSON and Document-Stores
Principles on document-stores. JSON and BSON
RDF, Graph Databases and Triplestores
RDF. RDFS. SPARQL. Foundations on graph-databases. Foundations on triplestores
Ontology Languages and Ontology-Based Data Access
OWL. Datalog. Description Logics. Foundations on Ontology-based Data Access
Refining, Combining and Integrating External Data
XSLT. Restructuring data. Data integration. Stream Processing. Mashups. Data Warehousing 2.0. Distributed Systems
Theory of visualization. Visual representations. User Experience
Legal Aspects and Innovation and Enterpreunership
Ethics. Legal frameworks. LOPDP. Adding value to day-by-day processes
During lectures the main concepts will be discussed. Lectures will combine master lectures and active / cooperative learning activities. The student is meant to have a pro-active attitude during active / cooperative learning activities. During master lectures, the student is meant to listen, take notes and ask questions.
The students will be asked to choose a case study on which integrating external data would add value to the current organization processes. They will need to determine what data should be considered and what is the benefit of this new solution. The students are asked to sketch a solution and prepare a demo, which will be presented to their classmates.
The course comprises theory, lab sessions and seminars.
Theory: These lectures comprise the teacher's explanations and constitute the main part of the course. The students will also have some contents to be read and prepared outside the classroom and will be asked to participate in cooperative learning activities.
Laboratory: Mainly, the lab sessions will be dedicated to the practice (with and without computer) of the concepts introduced in the theory lectures, by means of markable exercises that will be done during the class time. Some tools will be used for the design and practice on a specific DBMS.
Seminar: The students will have to prepare a practical seminar by themselves.This seminar is focused on developing innovative and entrepreneurial aspects related to the inclusion of external data into the organization decisional systems.
Final mark = 30% P + 30% EX + 30% S + 10% C
EX = Final exam mark
P = Course practice
S = Seminars and hand-outs
C = Peer evaluation
S: This mark corresponds to the seminars students must prepare and present in front of their classmates. It also includes the hand-outs to be delivered during the course. Based on the seminar presentation and the delivered materials, the lecturer will assign a mark.
P: Each group will prepare a course practice during the whole term. There will be several hand-outs during the course that the lecturer will pick up and assess. These hand-outs account for 50% of P. The rest of the mark (50%) will rely on the final presentation in the last week of the course.
C: During the course practice each student will interact with two other students. Since the practice is meant to entail several working hours each student will peer-mark his / her teammates twice (at midterm and at the end of the term). The lecturer will assign a mark to each student according to the peer-marking received by his / her teammates.