Advanced Multivariate Analysis

Weekly hours
Competences
Objectives
Contents
Activities
Teaching methodology
Evaluation methodology
Bibliography
Previous capacities

Credits

6

Types

Elective

Requirements

This subject has not requirements, but it has got previous capacities

Department

EIO

The course starts covering advanced multivariate statistical methods which have been proved their utility in unsupervised learning: nonparametric multivariate density estimation, clustering basd on density estimation, nonlinear dimensionality reduction (or manifold learning: nonlinear and nonparametric generalizations of principal components, PCA, and multidimensional scaling, MDS).

Then there is a second part at which nonparametric multivariate statistics modelling for supervised learning is explored, with the objective of extending the classical multiple Linear Model (LM) and Generalized Linear Model (GLM) in flexibility and prediction power, without losing interpretability. Here the Additive Model and the Generalized Additive Model (GAM) are introduced.The model selection and validation is emphasized.

The last part of the course will cover the topic of Interpretable Machine Learning (IML). Machine Learning models are increasingly accurate in their predictions. Many times the improvements in predictive efficiency are achieved at the cost of increasing model complexity, which is why we often refer to them as "black boxes". The growth in ubiquity and complexity of machine learning algorithms means that more and more voices are claiming to understand how and why these algorithms make their decisions. In response to this demand, in recent years a whole literature has appeared (known as "Interpretable Machine Learning" or "eXplainable Artificial Intelligence", IML or XAI) whose purpose is to provide transparency and interpretability to automatic algorithms in order to gain the trust of potential users.We will introduce some of the current IML tools, describe how to use them in practice through examples (implemented in R and Python) and show their theoretical foundations. We will see that Multivariate Analysis techniques can help to develop interpretability tools.

A fundamental part of the course is the study of real cases, both by the teacher and by students at the weekly assignments.

Teachers

Person in charge

Pedro Delicado Useros ( )

Weekly hours

Theory

4

Problems

0

Laboratory

0

Guided learning

0

Autonomous learning

7.1

Competences

Transversal Competences

Information literacy

CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.

Third language

CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.

Generic Technical Competences

Generic

CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats

Technical Competences

Especifics

CE3 - Apply data integration methods to solve data science problems in heterogeneous data environments
CE5 - Model, design, and implement complex data systems, including data visualization
CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
CE8 - Extract information from structured and unstructured data by considering their multivariate nature.
CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats

Objectives

Know the structure of the main unsupervised learning problems.
Related competences: CT4, CT5, CE10,
Learn different methods for dimensionality reduction when the standard assumptions in classical Multivariate Analysis are not fulfilled
Related competences: CT4, CT5, CG2, CE3, CE5, CE6, CE8, CE10, CB6, CB10,
Learn how to combine dimensionality reduction techniques with prediction algorithms
Related competences: CT5, CG2, CE6, CE8, CE10, CB10,
At the end of the course the student will be able to propose, estimate, interpret and validate non-parametric versions of linear regression models and generalized linear models.
Related competences: CT4, CT5, CG2, CE5, CE6, CE8, CE10, CB10,
At the end of the course the student will know properly how to choose the smoothing parameters which in nonparametric regression models control the trade-off between good fit to the observed sample and good generalization.
Related competences: CT4, CT5, CG2, CE5, CE6, CE8, CE10, CB10,
At the end of the course, the student will be aware of the need to provide interpretability to machine learning algorithms, he/she will know the most common interpretability techniques, he/she will know how to classify them and what relationships there are between them, and he/she will know how to use them in R and/or Python.
Related competences: CT4, CT5, CG2, CE6, CE8, CE13, CB7,

Unsupervised Learning through Advanced Multivariate Analysis
a. Introduction to Unsupervised Learning.
b. Density estimation.
c. Clustering
i. Mixture models
ii. DBSCAN
d. Nonlinear dimensionality reduction.
i. Principal curves.
ii. Local Multidimensional Scaling.
iii. ISOMAP.
iv. t-Stochastic Neighbor Embedding.
Nonparametric regression models
a. Nonparametric regression model. Local polynomial regression. Linear smoothers. Choosing the smoothing parameter.
b. Generalized nonparametric regression model. Estimation by maximum local likelihood.
c. Spline smoothing. Penalized least squares nonparametric regression. Cubic splines and interpolation. Smoothing splines. B-splines. Fitting generalized nonparametric regression models with splines.
d. Multiple (generalized) nonparametric regression. The curse of dimensionality. Additive Models and Generalized Additive Models.
Interpretable Machine Learning
a. Introduction to interpretability in machine learning.
i.Transparent models versus black-box models.
ii. Global methods (relevance of variables) versus local methods (explainability).
b. Interpretability methods for specific models.
i. Random forests.
ii. Neural networks.
c. Model-agnostic interpretability methods.
i. Global methods (Importance of variables through disturbances. Importance based on the Shapley Value. Partial dependency graph. Cumulative local effects graphs.)
ii. Local methods (LIME: Local interpretable model-agnostic explanations. Local importance based on the Shapley Value. SHAP: SHApley Additive ExPlanations. Break down graphics. ICE: Individual conditional expectation, or ceteris paribus chart.)
d. Interpretability in deep image learning.
i. Gradient-based methods (Grad-CAM, Saliency maps).
ii. Perturbation-based methods (LIME for images, SHAP's DeepExplainer).

Activities

Activity Evaluation act

Unsupervised Learning through Advanced Multivariate Analysis

Unsupervised Learning through Advanced Multivariate Analysis
Objectives: 1 2 3
Contents:

1 . Unsupervised Learning through Advanced Multivariate Analysis

Theory

18h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

34.3h

Nonparametric regression models

Nonparametric regression models
Objectives: 4 5
Contents:

2 . Nonparametric regression models

Theory

20h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

34.3h

Interpretable Machine Learning

Interpretable Machine Learning
Objectives: 6
Contents:

3 . Interpretable Machine Learning

Theory

16h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

27.3h

Teaching methodology

There are two weekly 2 hours session.
The first three hours are devoted to the exposition of the theoretical subjects by the teacher.
The last hour is dedicated to implement these contents: Each student has his laptop in class and he or she performs the tasks proposed by the teacher.
Each week ends with an assigment to students who must be delivered in 7 days. The software used will be primarily R.

Evaluation methodology

Homeworks will be assigned during the course. Homework grades will be worth 40% of your course grade.

There will be an exam at the end of the semester and will evaluate the assimilation of the basic concepts on the whole subject. The final exam will have a first short theoretical part (closed books) and a second longer practical part (open books, to be done by the students with their own laptops, with structure similar to homeworks).

Course Grade = 0.4 * Hwk Grade + 0.6 * Exam Grade

Bibliography

Basic:

The Elements of statistical learning : data mining, inference, and prediction - Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome, Springer, cop. 2009. ISBN: 9780387952840
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003549679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
All of nonparametric statistics - Wasserman, Larry, Springer, cop. 2010. ISBN: 9781441920447
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003728809706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Generalized additive models : an introduction with R - Wood, Simon N, CRC Press/Taylor & Francis Group, [2017]. ISBN: 9781498728331
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004129709706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Explanatory model analysis: explore, explain and examine predictive models - Biecek, P.; Burzykowski, T, Oxford University Press, 2018. ISBN: 9780367135591
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004922848206711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Statistical foundations of data science - Fan, Jianqing; Li, Runze; Zhang, Cun-hui; Zou, Hui, Oxon : CRC Press, 2020. ISBN: 9781466510845
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991005054179106711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Previous capacities

- Principal Component Analysis, Multidimensional Scaling and Clustering, at the level covered by the mandatory subject "Multivariate Analysis" (1st course of MDS).
- Knowledge of the statistical software R and R-Studio.

Advanced Multivariate Analysis

Teachers

Person in charge

Weekly hours

Competences

Transversal Competences

Information literacy

Third language

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Unsupervised Learning through Advanced Multivariate Analysis

Nonparametric regression models

Interpretable Machine Learning

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Previous capacities

Where we are

Contact with us

Advanced Multivariate Analysis

You are here

Teachers

Person in charge

Weekly hours

Competences

Transversal Competences

Information literacy

Third language

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Unsupervised Learning through Advanced Multivariate Analysis

Nonparametric regression models

Interpretable Machine Learning

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Previous capacities

Where we are

Contact with us