Advanced Topics in Data Engineering II

Credits

Types

Compulsory

Requirements

This subject has not requirements , but it has got previous capacities

Department

ETSETB;FIB;FME;ESSI;ENTEL

The course is structured in two different parts: the study of data driven software engineering and the study of aspects related to data privacy and security.

1. Software Engineering. The availability of large volumes of data from both the development of software systems and their use makes it possible to use them in various stages and activities of software engineering and, even going further, defines a new approach that considers data as the cornerstone of the software life cycle. The first part of the course presents this new vision of software engineering and delves into the emerging software engineering practices and tools to automatize the construction of ML-enabled components, and the end-to-end ML component life cycle, from model building to production deployment.

2. Data privacy and security. Data analysis techniques can help to obtain information to anticipate various problems, make its source known and help to implement solutions, in contexts as varied as business competitiveness, marketing, social relations, transport, health, education and politics. However, while data analysis is extremely valuable, it also has a crucial drawback: it increasingly invades the privacy of the people about whom data is collected. The second part of the course presents basic concepts of information privacy and delves into the main privacy technologies and metrics, as well as the anonymization algorithms used to prevent any disclosure of sensitive information about individuals

Teachers

Person in charge

Javier Parra Arnau (javier.parra@upc.edu)
Silverio Juan Martínez Fernández (silverio.martinez@upc.edu)

Others

Esteve Pallares Segarra (esteve@entel.upc.edu)
Jordi Forne Muñoz (jforne@entel.upc.edu)
Santiago Del Rey Juarez (santiago.del.rey@upc.edu)
Víctor Rubio Jornet (victor.rubio.jornet@upc.edu)

Weekly hours

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Competences

Technical Competences

Technical competencies

CE1 - Skillfully use mathematical concepts and methods that underlie the problems of science and data engineering.

CE2 - To be able to program solutions to engineering problems: Design efficient algorithmic solutions to a given computational problem, implement them in the form of a robust, structured and maintainable program, and check the validity of the solution.

CE3 - Analyze complex phenomena through probability and statistics, and propose models of these types in specific situations. Formulate and solve mathematical optimization problems.

CE7 - Demonstrate knowledge and ability to apply the necessary tools for the storage, processing and access to data.

CE8 - Ability to choose and employ techniques of statistical modeling and data analysis, evaluating the quality of the models, validating and interpreting them.

Transversal Competences

Transversals

CT4 [Avaluable] - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.

Basic

CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.

CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.

CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy

Generic Technical Competences

Generic

CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.

CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.

CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.

Objectives

Interpret the basic concepts of Software Engineering for ML systems, especially in relation to the use and exploitation of MLOps practices.
Related competences: CG1, CB2,
Apply and analyze good software engineering practices related to data science and machine learning projects
Related competences: CE1, CE2, CT4, CG1, CG4, CB2, CB5,
Apply and analyze MLOps practices to build ML models, fostering reproducibility and quality assurance.
Related competences: CE1, CE2, CE3, CE7, CT4, CG1, CG4, CB2, CB5,
Apply and analyze MLOps practices to deploy ML models, fostering API development.
Related competences: CE1, CE2, CE7, CE8, CG1, CG2, CB2, CB5,
Understand the privacy risks associated with browsing and publishing data. To achieve a deeper understanding of the different privacy metrics and their application in different scenarios.
Related competences: CE1, CE3, CE8, CT4, CG2, CB3, CB5,
Understand the main anonymization algorithms for statistical databases.
Related competences: CE1, CE2, CE3, CE8, CT4, CG1, CG2, CG4, CB2, CB3, CB5,
Evaluate the trade-off between privacy and data usability .
Related competences: CE1, CE3, CE8, CT4, CG1, CG4, CB2, CB3, CB5,
Understand the privacy risks in communitacions and the anonymous communication systems.
Related competences: CE1, CE3, CE8, CG1, CG4, CB2, CB5,

Introduction to Software Engineering
First, the traditional concept of software engineering is presented. Then, the impact of data availability on this traditional concept is analyzed. The resulting software life cycle when considering data is shown. Motivating the need for software engineering for ML systems. Introduction to MLOps and key concepts. Requirements engineering for ML.
Good software engineering practices for data science and machine learning projects
The complexity and diversity of data science projects and machine learning systems call for engineering techniques to ensure they are built in a robust and future-proof manner. On this chapter we address software engineering best practices for data science projects software including ML components.
MLOps practices to build ML models and manage the quality of the software and its development process
The complexity and diversity of data science projects and ML systems call for engineering techniques to ensure they are built in a robust and future-proof manner. On this chapter we address software engineering best practices for data science projects software including ML components: version control systems; ML pipeline reproducibility and tracking; software measurement for ML; quality assurance and testing for ML, including environmental sustainability.
MLOps practices to deploy ML models
The complexity and diversity of ML systems call for engineering techniques to ensure they are deployed in a robust and production-ready manner. On this chapter we address software engineering best practices for ML components: software architecture for ML; deploying ML models; APIs for ML; packaging of ML components.
Introduction to data privacy and security
Motivation. Definition of basic concepts. Attackers and trusted parties. Privacy metrics.
Algorithms for data anonymization
Statistical disclosure control. Measure the risk of disclosure. Microaggregation algorithms. Measurement of privacy-utility trade-off. Case studies.
Privacy in personalised information systems
User profiles: a measure of privacy risk. Privacy-enhancing technologies.
Security and privacy in communications
Cryptographic algorithms. Authentication and key management. Anonymous communication systems.

Activities

Activity Evaluation act

Study of basic concepts of Software Engineering for ML systems (MLOps)

Objectives: 1
Contents:

1 . Introduction to Software Engineering

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Study of good software engineering practices for data science and machine learning projects

Objectives: 2
Contents:

2 . Good software engineering practices for data science and machine learning projects

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Study of MLOps practices to build ML models and software quality management and its development process

Objectives: 3
Contents:

3 . MLOps practices to build ML models and manage the quality of the software and its development process

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Study of MLOps practices to deploy ML models

Objectives: 4
Contents:

4 . MLOps practices to deploy ML models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Practical development of a case study of MLOps practices in the context of ML-based systems

The student will progressively develop a practice that allows him to exercise the basic concepts introduced in the theory part. It will be developed in teams of 4-5 students. The resulting software, duly documented, will be uploaded to a code repository. The team will present a report, written in English, summarizing the main aspects of the practice, for example, the process of building an ML component of an ML-based system, and an evaluation of the accuracy of the models and algorithms used.
Objectives: 2 3 4
Contents:

3 . MLOps practices to build ML models and manage the quality of the software and its development process
4 . MLOps practices to deploy ML models
2 . Good software engineering practices for data science and machine learning projects

Theory

Problems

Laboratory

13h

Guided learning

Autonomous learning

31.5h

First partial exam: Software Engineering part (PARC1)

Evaluation of the first part of the course
Objectives: 1 2 3 4
Week: 7

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Final Exam (EXF)

This exam evaluates the two parts of the subject. Students who have failed any of the two partial tests are required. The rest of the students can also apply if they want to improve their grades
Objectives: 1 5 2 3 4 6 7 8
Week: 15 (Outside class hours)

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Second partial examination: part of Privacy and Data Security (PARC2)

Evaluation of the second part of the subject
Objectives: 5 6 7 8
Week: 14

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Study of introductory concepts on data privacy and security

Objectives: 5 6 7 8

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Practical development of data anonymization algorithms

Objectives: 6 7

Theory

Problems

Laboratory

15h

Guided learning

Autonomous learning

22.5h

Study of risks and privacy technologies for personalised information systems

Objectives: 5 7

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Study of mechanisms and technologies for communications security and privacy

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Teaching methodology

The theoretical contents of the course are taught in the theory classes. These classes are complemented with practical examples and problems that students must solve in the Autonomous Learning hours.

In the laboratory sessions, the knowledge acquired in the theory classes is consolidated by solving problems and developing practices related to the theoretical contents. During the laboratory classes, the teacher will introduce new techniques and will leave an important part of the class for the students to work on the proposed exercises.

Evaluation methodology

The evaluation is structured according to the two parts of the course: software engineering (PART1) and data privacy (PART2).

For the first part, the grade is calculated by weighting the grade of a theoretical exam (weight 40%) with the grade of the laboratory of this part of the subject (weight 60%)
PART1 = 40% PARC1 + 60% LABO1 * IndivFactorLABO1
- PARC1: Examination at the end of the first part of the course.
- LABO1: Delivery of the laboratory project of the first part of the course.
- IndivFactorLABO1: The individual factor IndivFact is a multiplicative factor among 0.8 and 1.2 (and similarly, cannot make LABO1 grow beyond 10). This factor is obtained from the evaluation that the teacher makes about the participation of the student in the project development and the evaluation that the team mates make on this very participation. In really exceptional situations, IndivFact can be less than 0.8 for those students who have really very low participation in the project along the course.

For the second part, the grade is calculated by weighting the grade of a theoretical exam (weight 50%) with the grade of the practical of this part of the subject (weight 50%). That is to say,

PART2 = 50% PARC2 + 50% LABO2, where

- PARC2: Examination at the end of the second part of the course.
- LABO2: Delivery of practices of the second part of the course.

The final grade of the course, NOTA-FIN, is calculated as the arithmetic mean of the two parts of the course:
NOTA-FIN = 50% PART1 + 50% PART2.

The course will be considered passed when both of the following conditions are met:
* The NOTA-FIN is equal to or higher than 5.
* The student has attended the partial exams PARC1 and PARC2 and has submitted the majority of the reports corresponding to LABO1 and LABO2.

Students may only be evaluated in a final exam if NOTA-FIN < 5. In this case, the final exam will cover only the contents corresponding to the partial exams that have not been passed. Thus, if PARC1 and/or PARC2 are equal to or higher than 5, the material assessed in those partial exams will be considered passed and will not be included in the final exam.

On the other hand, if during the grading of an assignment or laboratory exercise the instructor has well-founded doubts about the information or authorship of the work submitted by the student (such as incomplete problems, results that are not logically justified, or extreme discrepancies in level), the publication of the provisional grade will be withheld and the student will be called to an oral verification session before issuing the final grade.

Bibliography

Basic

Machine Learning in Production: From Models to Products - Kästner, Christian, MIT Press, 2025. ISBN: 9780262049726
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991005330527706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Statistical disclosure control for microdata: methods and applications in R - Templ, M, Springer International Publishing AG, 2017. ISBN: 9783319502724
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991001685219706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Complementary

SEET@ICSE - Lanubile, Filippo; Martínez-Fernández, Silverio; Quaranta, Luigi, SEET@ICSE, 2023.
https://doi.org/10.1109/ICSE-SEET58685.2023.00015
IEEE software - LANUBILE, Filippo; MARTÍNEZ-FERNÁNDEZ, Silverio; QUARANTA, Luigi, IEEE software, 2024. ISBN: 0740-7459
https://doi.org/10.1109/MS.2023.3310768
Reliable Machine Learning - Chen, Cathy, O'Reilly Media, Inc., 2022. ISBN: 1098106172
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=30130756
Data privacy: foundations, new developments and the big data challenge - Torra i Reventós, V, Springer International Publishing, 2017. ISBN: 9783319573564
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004122599706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Advanced research in data privacy - Navarro-Arribas, G.; Torra i Reventós, V. (eds.), Springer International Publishing, 2015. ISBN: 9783319098852
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004048289706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Previous capacities

Those given by the subjects of the previous quarters of the degree