Advanced Topics in Data Engineering II

You are here

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
ETSETB;FIB;FME;ESSI;ENTEL
The course is structured in two different parts: the study of data driven software engineering and the study of aspects related to data privacy and security.

1. Software Engineering. The availability of large volumes of data from both the development of software systems and their use makes it possible to use them in various stages and activities of software engineering and, even going further, defines a new approach that considers data as the cornerstone of the software life cycle. The first part of the course presents this new vision of software engineering and delves into the emerging software engineering practices and tools to automatize the construction of ML-enabled components, and the end-to-end ML component life cycle, from model building to production deployment.

2. Data privacy and security. Data analysis techniques can help to obtain information to anticipate various problems, make its source known and help to implement solutions, in contexts as varied as business competitiveness, marketing, social relations, transport, health, education and politics. However, while data analysis is extremely valuable, it also has a crucial drawback: it increasingly invades the privacy of the people about whom data is collected. The second part of the course presents basic concepts of information privacy and delves into the main privacy technologies and metrics, as well as the anonymization algorithms used to prevent any disclosure of sensitive information about individuals

Teachers

Person in charge

  • Silverio Juan Martínez Fernández ( )

Others

  • Esteve Pallares Segarra ( )
  • Javier Parra Arnau ( )
  • Jordi Forne Muñoz ( )
  • Santiago Del Rey Juarez ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Technical Competences

Technical competencies

  • CE1 - Skillfully use mathematical concepts and methods that underlie the problems of science and data engineering.
  • CE2 - To be able to program solutions to engineering problems: Design efficient algorithmic solutions to a given computational problem, implement them in the form of a robust, structured and maintainable program, and check the validity of the solution.
  • CE3 - Analyze complex phenomena through probability and statistics, and propose models of these types in specific situations. Formulate and solve mathematical optimization problems.
  • CE7 - Demonstrate knowledge and ability to apply the necessary tools for the storage, processing and access to data.
  • CE8 - Ability to choose and employ techniques of statistical modeling and data analysis, evaluating the quality of the models, validating and interpreting them.

Transversal Competences

Transversals

  • CT4 [Avaluable] - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.
  • CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy

Generic Technical Competences

Generic

  • CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.
  • CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.

Objectives

  1. Interpret the basic concepts of Software Engineering for ML systems, especially in relation to the use and exploitation of MLOps practices.
    Related competences: CG1, CB2,
  2. Apply and analyze good software engineering practices related to data science and machine learning projects
    Related competences: CE1, CE2, CT4, CG1, CG4, CB2, CB5,
  3. Apply and analyze MLOps practices to build ML models, fostering reproducibility and quality assurance.
    Related competences: CE1, CE2, CE3, CE7, CT4, CG1, CG4, CB2, CB5,
  4. Apply and analyze MLOps practices to deploy ML models, fostering API development.
    Related competences: CE1, CE2, CE7, CE8, CG1, CG2, CB2, CB5,
  5. Understand the privacy risks associated with browsing and publishing data. To achieve a deeper understanding of the different privacy metrics and their application in different scenarios.
    Related competences: CE1, CE3, CE8, CT4, CG2, CB3, CB5,
  6. Understand the main anonymization algorithms for statistical databases.
    Related competences: CE1, CE2, CE3, CE8, CT4, CG1, CG2, CG4, CB2, CB3, CB5,
  7. Evaluate the trade-off between privacy and data usability .
    Related competences: CE1, CE3, CE8, CT4, CG1, CG4, CB2, CB3, CB5,
  8. Understand the privacy risks in communitacions and the anonymous communication systems.
    Related competences: CE1, CE3, CE8, CG1, CG4, CB2, CB5,

Contents

  1. Introduction to Software Engineering
    First, the traditional concept of software engineering is presented. Then, the impact of data availability on this traditional concept is analyzed. The resulting software life cycle when considering data is shown. Motivación de la necesidad de ingeniería de software para sistemas ML. Introducción a MLOps y conceptos clave. Ingeniería de requisitos para ML.
  2. Good software engineering practices for data science and machine learning projects
    The complexity and diversity of data science projects and machine learning systems call for engineering techniques to ensure they are built in a robust and future-proof manner. On this chapter we address software engineering best practices for data science projects software including ML components.
  3. MLOps practices to build ML models and manage the quality of the software and its development process
    The complexity and diversity of data science projects and ML systems call for engineering techniques to ensure they are built in a robust and future-proof manner. On this chapter we address software engineering best practices for data science projects software including ML components: version control systems; ML pipeline reproducibility and tracking; software measurement for ML; quality assurance for ML.
  4. MLOps practices to deploy ML models
    The complexity and diversity of ML systems call for engineering techniques to ensure they are deployed in a robust and production-ready manner. On this chapter we address software engineering best practices for ML components: software architecture for ML; deploying ML models; APIs for ML.
  5. Introduction to data privacy and security
    Motivation. Definition of basic concepts. Attackers and trusted parties. Privacy metrics.
  6. Algorithms for data anonymization
    Statistical disclosure control. Measure the risk of disclosure. Microaggregation algorithms. Measurement of privacy-utility trade-off. Case studies.
  7. Privacy in personalised information systems
    User profiles: a measure of privacy risk. Privacy-enhancing technologies.
  8. Security and privacy in communications
    Cryptographic algorithms. Authentication and key management. Anonymous communication systems.

Activities

Activity Evaluation act


Study of basic concepts of Software Engineering for ML systems (MLOps)


Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h

Study of good software engineering practices for data science and machine learning projects


Objectives: 2
Contents:
Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
2h

Study of MLOps practices to build ML models and software quality management and its development process


Objectives: 3
Contents:
Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
2h

Study of MLOps practices to deploy ML models


Objectives: 4
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
2h

Practical development of a case study of MLOps practices in the context of ML-based systems

The student will progressively develop a practice that allows him to exercise the basic concepts introduced in the theory part. It will be developed in teams of 4-5 students. The resulting software, duly documented, will be uploaded to a code repository. The team will present a report, written in English, summarizing the main aspects of the practice, for example, the process of building an ML component of an ML-based system, and an evaluation of the accuracy of the models and algorithms used.
Objectives: 2 3 4
Contents:
Theory
0h
Problems
0h
Laboratory
13h
Guided learning
0h
Autonomous learning
31.5h

First partial exam: Software Engineering part (PARC1)

Evaluation of the first part of the course
Objectives: 1 2 3 4
Week: 7
Theory
1.5h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
5.5h

Final Exam (EXF)

This exam evaluates the two parts of the subject. Students who have failed any of the two partial tests are required. The rest of the students can also apply if they want to improve their grades
Objectives: 1 5 2 3 4 6 7 8
Week: 15 (Outside class hours)
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Second partial examination: part of Privacy and Data Security (PARC2)

Evaluation of the second part of the subject
Objectives: 5 6 7 8
Week: 14
Theory
1.5h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
5.5h

Study of introductory concepts on data privacy and security


Objectives: 5 6 7 8
Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
5h

Practical development of data anonymization algorithms


Objectives: 6 7
Theory
0h
Problems
0h
Laboratory
15h
Guided learning
0h
Autonomous learning
22.5h

Study of risks and privacy technologies for personalised information systems


Objectives: 5 7
Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Study of mechanisms and technologies for communications security and privacy



Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Teaching methodology

The theoretical contents of the course are taught in the theory classes. These classes are complemented with practical examples and problems that students must solve in the Autonomous Learning hours.

In the laboratory sessions, the knowledge acquired in the theory classes is consolidated by solving problems and developing practices related to the theoretical contents. During the laboratory classes, the teacher will introduce new techniques and will leave an important part of the class for the students to work on the proposed exercises.

Evaluation methodology

The evaluation is structured according to the two parts of the course: software engineering (PART1) and data privacy (PART2).

For the first part, the grade is calculated by weighting the grade of a theoretical exam (weight 40%) with the grade of the laboratory of this part of the subject (weight 60%)
PART1 = 40% PARC1 + 60% LABO1 * IndivFactorLABO1
- PARC1: Examination at the end of the first part of the course.
- LABO1: Delivery of the laboratory project of the first part of the course.
- IndivFactorLABO1: The individual factor IndivFact is a multiplicative factor among 0.8 and 1.2 (and similarly, cannot make LABO1 grow beyond 10). This factor is obtained from the evaluation that the teacher makes about the participation of the student in the project development and the evaluation that the team mates make on this very participation. In really exceptional situations, IndivFact can be less than 0.8 for those students who have really very low participation in the project along the course.

For the second part, the grade is calculated by weighting the grade of a theoretical exam (weight 50%) with the grade of the practical of this part of the subject (weight 50%)
PART2 = 50% PARC2 + 50% LABO2
- PARC2: Examination at the end of the second part of the course.
- LABO2: Delivery of practices of the second part of the course.

The final grade of the course, NOTA-FIN, is calculated as the arithmetic mean of the two parts of the course:
NOTA-FIN = 50% PART1 + 50% PART2


In case of not passing the course by the evaluation of mid-term exams, there is an evaluation by a final exam, where the mid-term exams are released if they are passed.

Bibliography

Basic:

Complementary:

Previous capacities

Those given by the subjects of the previous quarters of the degree