Multivariate Analysis

You are here

Credits
6
Types
Specialization compulsory (Data Science)
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
The objective of MVA is to provide the students with the knowledge of the statistical concepts of multivariate data analysis and their most basic methodologies and techniques, which constitute a core mainstream for Data Mining.

Teachers

Person in charge

  • Arturo Palomino Gayete ( )

Others

  • Belchin Adriyanov Kostov ( )
  • Daniel Fernández Martínez ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0.15
Autonomous learning
7.39

Competences

Generic Technical Competences

Generic

  • CG1 - Capability to apply the scientific method to study and analyse of phenomena and systems in any area of Computer Science, and in the conception, design and implementation of innovative and original solutions.
  • CG3 - Capacity for mathematical modeling, calculation and experimental designing in technology and companies engineering centers, particularly in research and innovation in all areas of Computer Science.

Transversal Competences

Information literacy

  • CTR4 - Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

Reasoning

  • CTR6 - Capacity for critical, logical and mathematical reasoning. Capability to solve problems in their area of study. Capacity for abstraction: the capability to create and use models that reflect real situations. Capability to design and implement simple experiments, and analyze and interpret their results. Capacity for analysis, synthesis and evaluation.

Technical Competences of each Specialization

Specific

  • CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
  • CEC2 - Capacity for mathematical modelling, calculation and experimental design in engineering technology centres and business, particularly in research and innovation in all areas of Computer Science.

Objectives

  1. Multivariate description of data
    Related competences: CG1, CG3, CEC1, CEC2, CTR4, CTR6,
  2. Data visualisation
    Related competences: CG3, CTR4,
  3. Multivariate inference
    Related competences: CG3, CEC1, CEC2, CTR6,
  4. Classification of new individuals
    Related competences: CG1, CG3, CEC1, CEC2, CTR6,

Contents

  1. Introduction to Multivariate Data Analysis
    Advantages of the multivariate treatment. Examples of multivariate data. Probabilistic and distribution free methods. Exploratory versus modeling approach.
  2. Principal Component Analysis
    Analysis of individuals. Analysis of variables. Visual representation of the information. Dimensionality reduction. Supplementary information
  3. Correspondence Analysis
    Method for exploring and visualizing rows and columns of a contingency table.
  4. Multiple Correspondence Analysis
    Method for exploring and visualizing datasets with categorical variables. Usually, datasets obtained from a survey or a questionnaire.
  5. Factor Analysis
    Dimension reduction method. Very common in text mining. Examples of how to use it for textual data will be detailed.
  6. Association rules
    Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
  7. Multiple Factor Analysis
    This method deals with dataset where variables are organised in groups. Typically, from data coming from different sources of variables. The method highlights a common structure of all the groups, and the specificity of each group. It allows to compare the results of several PCAs or MCAs in a unique frame of reference. The groups of variables can be continuous, categorical or can be a contingency table.
  8. Discriminant Analysis and Naïve Bayes
    Discriminant Analysis (DA) and Naïve Bayes (NB) are classification methods. DA classifies observations into non-overlapping groups, based on scores on one or more quantitative predictor variables. NB is a simple learning algorithm that utilises Bayes rule together with a strong assumption that the attributes are conditionally independent, given the class.
  9. Classification and Regression Trees
    This method can predict or classify. It explains how an outcome variable's values can be predicted or classified based on other values. It has a very helpful graphical structure.
  10. Hierarchical and Partitioning Clustering
    Two approaches to clustering methods used to classify observations, within a data set, into multiple groups based on their similarity.
  11. Model-based Clustering
    In the family of these algorithms, one uses certain models for clusters and tries to optimise the fit between the data and the models. In the model-based clustering approach, the data are viewed as coming from a mixture of probability distributions, each of which represents a different cluster.

Activities

Activity Evaluation act


Introduction to the course + Multivariate Data Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
5h

Principal Component Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Correspondence Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Multiple Correspondence Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Factor Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Association rules


Objectives: 2 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Multiple Factor Analysis


Objectives: 2 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Discriminant Analysis and Naïve Bayes


Objectives: 3 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Classification and Regression Trees


Objectives: 2 3 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Hierarchical and Partitioning Clustering


Objectives: 2 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Model-based Clustering


Objectives: 2 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Association rules


Objectives: 4
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Final Practical Work



Week: 18
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
13h

Quiz



Week: 14
Type: theory exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
13h

Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h


Teaching methodology

The course aims to give the statistical foundations for data mining. Learning is done through a combination of theoretical explanation and its application to a real case. The lectures will develop the necessary scientific knowledge, while lab classes will be its application to solving problems of data mining. The implementation of practices fosters generic skills related to teamwork and presentation of results and serve to integrate different knowledge of the subject. The software used will be primarily R.

Evaluation methodology

The course evaluation will be based on the marks obtained in practical exercises conducted during the course, an examination grade, and the grade obtained in the final practice.
Each practice will lead to the drafting of the relevant report writing and may be made jointly, up to a maximum of three students per group.
The exercises conducted throughout the course aim to consolidate the learning of multivariate techniques.
The final practice is that students show their maturity to solve a real problem using multivariate visualisation techniques, clustering interpretation, and prediction. Students will choose between different alternatives to solve the problem. This practice will be presented and publicly defended, in which the student must answer any questions about the theoretical models and methods used in the solution. Practices are conducted using the software R.
The written test will be held on the last day of class and evaluate the assimilation of the basic concepts of the subject. While the presentation of the second practice will be done during the examination period.

The in-class exercises are weighted 30%, examination 30%, and final practice 40%.

Bibliography

Basic:

Complementary:

Web links

Previous capacities

The course implies having previously done a basic course in statistics, programming and mathematics; in particular having adquired the following concepts:
- Average, covariance and correlation matrix.
- Hypothesis Test
- Matrix algebra, eigenvalues ​​and eigenvectors.,
- programing algorithms.
- multiple linear-regression

Addendum

Contents

NO HI HA CANVIS RESPECTE LA INFORMACIÓ PUBLICADA A LA GUIA DOCENT NO CHANGES REGARDING THE INFORMATION PUBLISHED IN THE TEACHING GUIDE

Teaching methodology

NO HI HA CANVIS RESPECTE LA INFORMACIÓ PUBLICADA A LA GUIA DOCENT NO CHANGES REGARDING THE INFORMATION PUBLISHED IN THE TEACHING GUIDE

Evaluation methodology

NO HI HA CANVIS RESPECTE LA INFORMACIÓ PUBLICADA A LA GUIA DOCENT NO CHANGES REGARDING THE INFORMATION PUBLISHED IN THE TEACHING GUIDE

Contingency plan

FER LES CLASSES PER VIDEOCONFERENCIA DO THE CLASSES BY VIDEOCONFERENCE