The course starts covering advanced multivariate statistical methods which have been proved their utility in unsupervised learning: nonparametric multivariate density estimation, clustering basd on density estimation, nonlinear dimensionality reduction (or manifold learning: nonlinear and nonparametric generalizations of principal components, PCA, and multidimensional scaling, MDS).
Then there is a second part at which nonparametric multivariate statistics modelling for supervised learning is explored, with the objective of extending the classical multiple Linear Model (LM) and Generalized Linear Model (GLM) in flexibility and prediction power, without losing interpretability. Here the Additive Model and the Generalized Additive Model (GAM) are introduced.The model selection and validation is emphasized.
The last part of the course will cover the topic of Interpretable Machine Learning (IML). Machine Learning models are increasingly accurate in their predictions. Many times the improvements in predictive efficiency are achieved at the cost of increasing model complexity, which is why we often refer to them as "black boxes". The growth in ubiquity and complexity of machine learning algorithms means that more and more voices are claiming to understand how and why these algorithms make their decisions. In response to this demand, in recent years a whole literature has appeared (known as "Interpretable Machine Learning" or "eXplainable Artificial Intelligence", IML or XAI) whose purpose is to provide transparency and interpretability to automatic algorithms in order to gain the trust of potential users.We will introduce some of the current IML tools, describe how to use them in practice through examples (implemented in R and Python) and show their theoretical foundations. We will see that Multivariate Analysis techniques can help to develop interpretability tools.
A fundamental part of the course is the study of real cases, both by the teacher and by students at the weekly assignments.
Teachers
Person in charge
Pedro Delicado Useros (
)
Others
Cristian Pachón Garcia (
)
Weekly hours
Theory
4
Problems
0
Laboratory
0
Guided learning
0
Autonomous learning
7.1
Competences
Transversal Competences
Information literacy
CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.
Third language
CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
Basic
CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.
Generic Technical Competences
Generic
CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats
Technical Competences
Especifics
CE3 - Apply data integration methods to solve data science problems in heterogeneous data environments
CE5 - Model, design, and implement complex data systems, including data visualization
CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
CE8 - Extract information from structured and unstructured data by considering their multivariate nature.
CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats
Objectives
Know the structure of the main unsupervised learning problems.
Related competences:
CT4,
CT5,
CE10,
Learn different methods for dimensionality reduction when the standard assumptions in classical Multivariate Analysis are not fulfilled
Related competences:
CT4,
CT5,
CG2,
CE3,
CE5,
CE6,
CE8,
CE10,
CB6,
CB10,
Learn how to combine dimensionality reduction techniques with prediction algorithms
Related competences:
CT5,
CG2,
CE6,
CE8,
CE10,
CB10,
At the end of the course the student will be able to propose, estimate, interpret and validate non-parametric versions of linear regression models and generalized linear models.
Related competences:
CT4,
CT5,
CG2,
CE5,
CE6,
CE8,
CE10,
CB10,
At the end of the course the student will know properly how to choose the smoothing parameters which in nonparametric regression models control the trade-off between good fit to the observed sample and good generalization.
Related competences:
CT4,
CT5,
CG2,
CE5,
CE6,
CE8,
CE10,
CB10,
At the end of the course, the student will be aware of the need to provide interpretability to machine learning algorithms, he/she will know the most common interpretability techniques, he/she will know how to classify them and what relationships there are between them, and he/she will know how to use them in R and/or Python.
Related competences:
CT4,
CT5,
CG2,
CE6,
CE8,
CE13,
CB7,
Contents
Unsupervised Learning through Advanced Multivariate Analysis
a. Introduction to Unsupervised Learning.
b. Density estimation.
c. Clustering
i. Mixture models
ii. DBSCAN
d. Nonlinear dimensionality reduction.
i. Principal curves.
ii. Local Multidimensional Scaling.
iii. ISOMAP.
iv. t-Stochastic Neighbor Embedding.
Nonparametric regression models
a. Nonparametric regression model. Local polynomial regression. Linear smoothers. Choosing the smoothing parameter.
b. Generalized nonparametric regression model. Estimation by maximum local likelihood.
c. Spline smoothing. Penalized least squares nonparametric regression. Cubic splines and interpolation. Smoothing splines. B-splines. Fitting generalized nonparametric regression models with splines.
d. Multiple (generalized) nonparametric regression. The curse of dimensionality. Additive Models and Generalized Additive Models.
Interpretable Machine Learning
a. Introduction to interpretability in machine learning.
i.Transparent models versus black-box models.
ii. Global methods (relevance of variables) versus local methods (explainability).
b. Interpretability methods for specific models.
i. Random forests.
ii. Neural networks.
c. Model-agnostic interpretability methods.
i. Global methods (Importance of variables through disturbances. Importance based on the Shapley Value. Partial dependency graph. Cumulative local effects graphs.)
ii. Local methods (LIME: Local interpretable model-agnostic explanations. Local importance based on the Shapley Value. SHAP: SHApley Additive ExPlanations. Break down graphics. ICE: Individual conditional expectation, or ceteris paribus chart.)
d. Interpretability in deep image learning.
i. Gradient-based methods (Grad-CAM, Saliency maps).
ii. Perturbation-based methods (LIME for images, SHAP's DeepExplainer).
Activities
ActivityEvaluation act
Unsupervised Learning through Advanced Multivariate Analysis
Unsupervised Learning through Advanced Multivariate Analysis Objectives:123 Contents:
There are two weekly 2 hours session.
The first three hours are devoted to the exposition of the theoretical subjects by the teacher.
The last hour is dedicated to implement these contents: Each student has his laptop in class and he or she performs the tasks proposed by the teacher.
Each week ends with an assigment to students who must be delivered in 7 days. The software used will be primarily R.
Evaluation methodology
Homeworks will be assigned during the course. Homework grades will be worth 40% of your course grade.
There will be an exam at the end of the semester and will evaluate the assimilation of the basic concepts on the whole subject. The final exam will have a first short theoretical part (closed books) and a second longer practical part (open books, to be done by the students with their own laptops, with structure similar to homeworks).
- Principal Component Analysis, Multidimensional Scaling and Clustering, at the level covered by the mandatory subject "Multivariate Analysis" (1st course of MDS).
- Knowledge of the statistical software R and R-Studio.