The objective of MVA is to provide the students with the knowledge of the statistical concepts of multivariate data analysis and their most basic methodologies and techniques, which constitute a core mainstream for Data Mining.
Teachers
Person in charge
Dante Conti (
)
Karina Gibert Oliveras (
)
Sergi Ramirez Mitjans (
)
Others
Ariel Duarte López (
)
David Rodriguez Segado (
)
Weekly hours
Theory
2
Problems
0
Laboratory
2.2
Guided learning
0
Autonomous learning
7.53
Competences
Transversal Competences
Information literacy
CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.
Third language
CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
Entrepreneurship and innovation
CT1 - Know and understand the organization of a company and the sciences that govern its activity; have the ability to understand labor standards and the relationships between planning, industrial and commercial strategies, quality and profit. Being aware of and understanding the mechanisms on which scientific research is based, as well as the mechanisms and instruments for transferring results among socio-economic agents involved in research, development and innovation processes.
Basic
CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.
Generic Technical Competences
Generic
CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats
CG3 - Define, design and implement complex systems that cover all phases in data science projects
Technical Competences
Especifics
CE5 - Model, design, and implement complex data systems, including data visualization
CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
CE7 - Identify the limitations imposed by data quality in a data science problem and apply techniques to smooth their impact
CE8 - Extract information from structured and unstructured data by considering their multivariate nature.
CE9 - Apply appropriate methods for the analysis of non-traditional data formats, such as processes and graphs, within the scope of data science
CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
CE11 - Analyze and extract knowledge from unstructured information using natural language processing techniques, text and image mining
CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation
CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats
Classification of new individuals
Related competences:
CT1,
CG3,
CE6,
CE10,
CB6,
CB7,
Contents
Introduction to Multivariate Data Analysis
Advantages of the multivariate treatment. Examples of multivariate data. Probabilistic and distribution free methods. Exploratory versus modeling approach.
Principal Component Analysis
Analysis of individuals. Analysis of variables. Visual representation of the information. Dimensionality reduction. Supplementary information. Singular value decomposition.
Singular Value Decomposition. Biplots
Method for exploring and visualizing rows and columns of a table through single value decomposition
Factor Analysis
Dimension reduction method.
Multidimensional Scaling
This method deals with data relating to distances between elements. Usually uses data from distances or similarities. The method reveals a common structure of all the elements and the specificity of each of them, evidencing what makes them close or distant.
Hierarchical and Partitioning Clustering
Two approaches to clustering methods used to classify observations, within a data set, into multiple groups based on their similarity.
Automatic profiling methods
Profiling methods help to understand the common characteristics of clusters.
Multivariate normal distribution
Particularities of the normal distribution in the general case of multivariate approaches, where the points are distributed in several dimensions.
Discriminant Analysis
Discriminant Analysis (DA) and Naïve Bayes (NB) are classification methods. DA classifies observations into non-overlapping groups, based on scores on one or more quantitative predictor variables. NB is a simple learning algorithm that utilises Bayes rule together with a strong assumption that the attributes are conditionally independent, given the class.
Classification and Regression Trees
This method can predict or classify. Explains how the values of a result variable can be predicted or classified based on other values. It has a very useful graphic structure.
Association rules
Find common patterns, associations, correlations, or causal structures between sets of items or objects in transaction databases, relational databases, and other information repositories.
Activities
ActivityEvaluation act
Introduction to the course + Multivariate Data Analysis
The course aims to give the statistical foundations for data mining. Learning is done through a combination of theoretical explanation and its application to a real case. The lectures will develop the necessary scientific knowledge, while lab classes will be its application to solving problems of data mining. The implementation of practices fosters generic skills related to teamwork and presentation of results and serve to integrate different knowledge of the subject. The software used will be primarily R.
Evaluation methodology
The course evaluation will be based on the marks obtained in practical exercises conducted during the course, a theory grade, and the grade obtained in the final practice.
Each practice will lead to the drafting of the relevant report writing and may be made jointly, up to a maximum of four students per group.
The exercises conducted throughout the course aim to consolidate the learning of multivariate techniques.
The final practice is that students show their maturity to solve a real problem using multivariate visualisation techniques, clustering interpretation, and prediction. Students will choose between different alternatives to solve the problem. This practice will be presented and publicly defended, in which the student must answer any questions about the theoretical models and methods used in the solution. Practices are conducted using the software R.
The written tests will evaluate the assimilation of the basic concepts of the subject. There will be three tests during the curse, in theory class. While the presentation of the practice will be done during the examination period.
The in-class exercises are weighted 20%, theory 40%, and final practice 40%.
Bibliography
Basic:
The Elements of statistical learning : data mining, inference, and prediction -
Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome,
Springer, cop. 2009. ISBN: 9780387848570 http://cataleg.upc.edu/record=b1343839~S1*cat
Applied multivariate statistical analysis -
Johnson, Richard A.; Wichern, Dean W,
Pearson Education Limited, [2014]. ISBN: 9781292024943 http://cataleg.upc.edu/record=b1520493~S1*cat
Exploratory multivariate analysis by example using R -
Husson, François; Lê, Sébastien; Pagès, Jérôme,
CRC Press, Taylor & Francis Group, 2017. ISBN: 9781315301860 http://cataleg.upc.edu/record=b1496325~S1*cat
Aprender de los datos : el análisis de componentes principales : una aproximación desde el Data Mining -
Aluja Banet, Tomàs; Morineau, Alain, EUB ,
1999.
ISBN: 8483120224 http://cataleg.upc.edu/record=b1153963~S1*cat
Multivariate descriptive statistical analysis : correspondence analysis and related techniques for large matrices -
Lebart, Ludovic; Morineau, Alain; Warwick, Kenneth M, John Wiley and Sons ,
cop. 1984.
ISBN: 0471867438 http://cataleg.upc.edu/record=b1004061~S1*cat
The course implies having previously done a basic course in statistics, programming and mathematics; in particular having adquired the following concepts:
- Average, covariance and correlation matrix.
- Hypothesis Test.
- Matrix algebra, eigenvalues and eigenvectors.
- programing algorithms.
- multiple linear-regression.