Advanced Multivariate Analysis

You are here

Credits
6
Types
Elective
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
The course starts covering advanced multivariate statistical methods which have been proved their utility in unsupervised learning: nonparametric multivariate density estimation, clustering basd on density estimation, nonlinear dimensionality reduction (or manifold learning: nonlinear and nonparametric generalizations of principal components, PCA, and multidimensional scaling, MDS), dimensionality reduction with sparsity (another way to extend PCA and MDS).

Then there is a second part at which nonparametric multivariate statistics modelling for supervised learning is explored, with the objective of extending the classical multiple Linear Model (LM) and Generalized Linear Model (GLM) in flexibility and prediction power, without losing interpretability. Here the Additive Model and the Generalized Additive Model (GAM) are introduced.The model selection and validation is emphasized.

The third part of the course is devoted to Functional Data Analysis (FDA), a corpus of statistical methods and tools which extend multivariate analysis (MVA) from finite dimension to infinite dimension. In FDA each observation is a complete function (an element of an infinite dimensional space) instead of being a vector (as in MVA). Then FDA can then be thought of as the statistical analysis of samples of curves. In the last two decades, many standard statistical methods have been adapted to functional data: regression models (LM, GLM, non-parametric regression, ...), multivariate analysis (PCA, MDS, Clustering, Depth measures, ...), time series, spatial statistics, among other. At the same time, its methods have been applied to quite broadly in medicine, science, business, engineering, demography and social sciences, etc.

The last part of the course will cover the topic of Interpretable Machine Learning (IML). Machine Learning models are increasingly accurate in their predictions. Many times the improvements in predictive efficiency are achieved at the cost of increasing model complexity, which is why we often refer to them as "black boxes". The growth in ubiquity and complexity of machine learning algorithms means that more and more voices are claiming to understand how and why these algorithms make their decisions. In response to this demand, in recent years a whole literature has appeared (known as "Interpretable Machine Learning" or "eXplainable Artificial Intelligence", IML or XAI) whose purpose is to provide transparency and interpretability to automatic algorithms in order to gain the trust of potential users.We will introduce some of the current IML tools, describe how to use them in practice through examples (implemented in R and Python) and show their theoretical foundations. We will see that Multivariate Analysis techniques can help to develop interpretability tools.

A fundamental part of the course is the study of real cases, both by the teacher and by students at the weekly assignments.

Teachers

Person in charge

  • Pedro Delicado Useros ( )

Weekly hours

Theory
3
Problems
0
Laboratory
1
Guided learning
0
Autonomous learning
7.1

Competences

Transversal Competences

Information literacy

  • CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.

Third language

  • CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
  • CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.

Generic Technical Competences

Generic

  • CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats

Technical Competences

Especifics

  • CE3 - Apply data integration methods to solve data science problems in heterogeneous data environments
  • CE5 - Model, design, and implement complex data systems, including data visualization
  • CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
  • CE8 - Extract information from structured and unstructured data by considering their multivariate nature.
  • CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
  • CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats

Objectives

  1. Know the structure of the main unsupervised learning problems.
    Related competences: CT4, CT5, CE10,
  2. Learn different methods for dimensionality reduction when the standard assumptions in classical Multivariate Analysis are not fulfilled
    Related competences: CT4, CT5, CG2, CE3, CE5, CE6, CE8, CE10, CB10,
  3. Learn how to combine dimensionality reduction techniques with prediction algorithms
    Related competences: CT5, CG2, CE6, CE8, CE10, CB10,
  4. At the end of the course the student will be able to propose, estimate, interpret and validate non-parametric versions of linear regression models and generalized linear models.
    Related competences: CT4, CT5, CG2, CE5, CE6, CE8, CE10, CB10,
  5. At the end of the course the student will know properly how to choose the smoothing parameters which in nonparametric regression models control the trade-off between good fit to the observed sample and good generalization.
    Related competences: CT4, CT5, CG2, CE5, CE6, CE8, CE10, CB10,
  6. At the end of the course the students will be able to identify situations in which they can treat their data as functional, to represent them computationally, to apply simple FDA techniques (descriptions, dimensionality reduction, regression) and to visualize the results.
    Related competences: CT4, CT5, CG2, CE3, CE5, CE6, CE8, CE10, CB6, CB10,
  7. At the end of the course, the student will be aware of the need to provide interpretability to machine learning algorithms, he/she will know the most common interpretability techniques, he/she will know how to classify them and what relationships there are between them, and he/she will know how to use them in R and/or Python.
    Related competences: CT4, CT5, CG2, CE6, CE8, CE13, CB7,

Contents

  1. Unsupervised Learning through Advanced Multivariate Analysis
    a. Introduction to Unsupervised Learning. Main problems in unsupervised learning (density estimation, dimensionality reduction, latent variables, clustering).
    b. Nonlinear dimensionality reduction.
    i. Principal curves. ii. Local Multidimensional Scaling. iii. ISOMAP. iv. t-Stochastic Neighbor Embedding.
    c. Dimensionality reduction with sparsity
    i. Matrix decompositions, approximations, and completion. Nuclear norm. ii. Sparse Principal Components. iii. Applications: (i) Recommender systems. (ii) Estimating causal effects.
  2. Nonparametric regression models
    a. Nonparametric regression model. Local polynomial regression. Linear smoothers. Choosing the smoothing parameter.
    b. Generalized nonparametric regression model. Estimation by maximum local likelihood.
    c. Spline smoothing. Penalized least squares nonparametric regression. Cubic splines and interpolation. Smoothing splines. B-splines. Fitting generalized nonparametric regression models with splines.
    d. Multiple (generalized) nonparametric regression. The curse of dimensionality. Additive Models and Generalized Additive Models.
  3. Functional Data Analysis
    a. Introduction to Functional Data Analysis (FDA). An overview of FDA. Concepts of Functional Analysis useful in FDA.
    b. Observed functional data and its computational representation.
    i. Developments in bases of functions. ii. Smoothing: Kernel, Local Polynomials, Splines. iii. Registration and transformations of functional data.
    c. Exploratory analysis of functional data.
    i. Location and dispersion statistics. ii. Depth measurements. iii. Outliers detection.
    d. Dimensionality reduction.
    i. Functional Principal Components. ii. Multidimensional Scaling.
    e. Regression with explanatory functional data and scalar response.
    i. Scalar response and functional regressor. ii. Functional response.
    f. Applications: FDA in Demography.
  4. Interpretable Machine Learning
    a. Introduction to interpretability in machine learning.
    i.Transparent models versus ¿black box¿ models. ii. Global methods (relevance of variables) versus local methods (explainability).
    b. Interpretability methods for specific models.
    i. Random forests. ii. Neural networks.
    c. Model-agnostic interpretability methods.
    i. Global methods (Importance of variables through disturbances. Importance based on the Shapley Value. Partial dependency graph. Cumulative local effects graphs.)
    ii. Local methods (LIME: Local interpretable model-agnostic explanations. Local importance based on the Shapley Value. SHAP: SHApley Additive ExPlanations. Break down graphics. ICE: Individual conditional expectation, or ceteris paribus chart.)
    d. Interpretability in deep image learning.
    i. Gradient-based methods (Grad-CAM, Saliency maps). ii. Perturbation-based methods (LIME for images, SHAP's DeepExplainer).

Activities

Activity Evaluation act


Unsupervised Learning through Advanced Multivariate Analysis

Unsupervised Learning through Advanced Multivariate Analysis
Objectives: 1 2 3
Contents:
Theory
12h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
28.4h

Nonparametric regression models

Nonparametric regression models
Objectives: 4 5
Contents:
Theory
12h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
28.4h

Functional Data Analysis

Functional Data Analysis
Objectives: 6
Contents:
Theory
9h
Problems
0h
Laboratory
3h
Guided learning
0h
Autonomous learning
21.3h

Interpretable Machine Learning

Interpretable Machine Learning
Objectives: 7
Contents:
Theory
9h
Problems
0h
Laboratory
3h
Guided learning
0h
Autonomous learning
21.3h

Teaching methodology

There are two weekly 2 hours session.
The first three hours are devoted to the exposition of the theoretical subjects by the teacher.
The last hour is dedicated to implement these contents: Each student has his laptop in class and he or she performs the tasks proposed by the teacher.
Each week ends with an assigment to students who must be delivered in 7 days. The software used will be primarily R.

Evaluation methodology

Homeworks will be assigned during the course. Homework grades will be worth 40% of your course grade.

There will be an exam at the end of the semester and will evaluate the assimilation of the basic concepts on the whole subject. The final exam will have a first short theoretical part (closed books) and a second longer practical part (open books, to be done by the students with their own laptops, with structure similar to homeworks).

Course Grade = 0.4 * Hwk Grade + 0.6 * Exam Grade

Bibliography

Basic:

Previous capacities

.