Data Analysis

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
The aim of the course on Data Analysis is to provide the philosophy and the main methods for extracting the information contained in the data. It covers the preparation of the data, the exploratory analysis, the visualization of the information, the modeling of patterns and its implementation in computer systems.

Teachers

Person in charge

  • Jan Graffelman ( )

Others

  • Josep Anton Sánchez Espigares ( )
  • Tomas Aluja Banet ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0.4
Autonomous learning
5.5

Competences

Technical Competences

Technical competencies

  • CE1 - Skillfully use mathematical concepts and methods that underlie the problems of science and data engineering.
  • CE2 - To be able to program solutions to engineering problems: Design efficient algorithmic solutions to a given computational problem, implement them in the form of a robust, structured and maintainable program, and check the validity of the solution.
  • CE3 - Analyze complex phenomena through probability and statistics, and propose models of these types in specific situations. Formulate and solve mathematical optimization problems.
  • CE4 - Use current computer systems, including high performance systems, for the process of large volumes of data from the knowledge of its structure, operation and particularities.
  • CE8 - Ability to choose and employ techniques of statistical modeling and data analysis, evaluating the quality of the models, validating and interpreting them.

Transversal Competences

Transversals

  • CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB4 - That the students can transmit information, ideas, problems and solutions to a specialized and non-specialized public.

Generic Technical Competences

Generic

  • CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.
  • CG3 - Work in multidisciplinary teams and projects related to the processing and exploitation of complex data, interacting fluently with engineers and professionals from other disciplines.
  • CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.

Objectives

  1. Exploratory Data Analysis
    Related competences: CE1, CE2, CE3, CE4, CE8, CT7, CG3, CG4, CB2, CB4,
    Subcompetences:
    • Pre-processing. Outliers, missing values. Transformations
    • PCA, SVD, Factor Analysis. Multidmensional Scaling.
    • Correspondence Analysis. Multiple Correspondence Analysis.
    • Clustering. Profiling.
  2. Discriminant Analysis with probabilistic hypothesis
    Related competences: CE1, CE3, CE8, CT4, CT7, CG2,
    Subcompetences:
    • Normal multivariate distribution. Sampling distributions.
    • Linear Discriminat Analisis, Discriminació de Fisher. Quadratic Discriminant Analisis.
  3. Multivariate modeling
    Related competences: CE1, CE3, CE8, CT4, CT7, CG1, CG2, CG4, CB2,
    Subcompetences:
    • Principal Component Regression, Partial Least Squares Regression
    • Multivariate Regression
    • Canonical Correlation Analysis
  4. Time series
    Related competences: CE1, CE3, CE8,
    Subcompetences:
    • Applications of the Kalman Filter
    • Outlier, Calendar Effects and Intervention Analysis
    • Univariate models of time series

Contents

  1. Data preprocessing
    Outliers, missing data and transformations
  2. Principal component analysis
    Multivariate description of a table of continous variables. Regression with principal components.
  3. Factor analysis
    The singular value decomposition, biplots, factor analysis
  4. Multidimensional scaling (MDS)
    Distance measures. Metric multidimensional scaling. Algorithms.
  5. Cluster analysis
    Hierarchical clustering techniques. Agglomeration methods. Ward's criterion. Dendrogram.
  6. Correspondence analysis
    Contingency tables. Row and column profiles. Independence and chi-square statistics. Simple correspondence analysis. Biplot.
  7. Discriminant analysis
    Multivariate normal distribution. Fisher's linear discriminant analysis.
  8. Univariate time series models
    Exponential smoothing, ARIMA models
  9. Intervention analysis
    Outliers, seasonal effects, intervention analysis.

Activities

Activity Evaluation act


Data preprocessing

Practical on data preprocessing
Objectives: 1
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
4h

Principal component analysis

Application of principal component analysis in practical data analysis
Objectives: 1
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
6h

Factor analysis

Practical data analysis using the method
Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
3h
Guided learning
0h
Autonomous learning
4h

Multidimensional scaling

Analysis of distance matrices with this method
Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Clustering

Application of the method to quantitative data matrices.

Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
4h

Correspondence Analysis

Application of the method with cross tables.
Objectives: 2
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Discriminant Analysis

Application of the method to empirical data sets
Objectives: 2
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
4h

Univariate time series models

Fitting time series models to data sets on the computer
Objectives: 4
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
6h

Intervention analysis

Application of intervention analysis to real data sets
Objectives: 4
Contents:
Theory
2h
Problems
0h
Laboratory
3h
Guided learning
0h
Autonomous learning
4h

Practical on exploratory data analysis

Student do an exploratory analysis of a data set and hand in a questionnaire about it.
Objectives: 1 2 3 4
Week: 8 (Outside class hours)
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
3h
Autonomous learning
15h

Project

Students realize, in couples, a complete multivariate study of a certain dataset using the techniques they studied during the course, and hand in a written report about it.
Objectives: 1 2 3 4
Week: 15 (Outside class hours)
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
3h
Autonomous learning
13h

Exam concering basic concepts

There are two exams related to the theoretical concepts of the course.
Objectives: 1 2 3 4
Week: 14
Type: final exam
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
14h

Teaching methodology

The learning process is a combination of theoretical explanation and practical application. The theory classes are used to explain the basic scientific contents of the course, whereas the laboratory sessions work on their application to solve real-life problems.

Practicals and project form the basis for working out the transversal competences of the students, related to team-work and public presentation of results. Practicals and project also serve to integrate the different pieces of knowledge of the course.

For hands-on computer training we use the R statistical environment.

Evaluation methodology

The student's final grade for the course is based on grades obtained for weekly homework assignments (25%), a partial exam half-way the course (25%), a final exam covering the second half of the course (25%) and a project (25%).

Each weekly assignments consists of resolving a questionnaire. These assigments aim at consolidating knowledge of the techniques exposed in the theoretical sessions. The assignments require analysis of datasets in the statistical environment R.

A project is carried out by a group of two students, and students have to show they can resolve problems with the techniques they have learned during the course. Each group hands in a written report about their project at the end of the course.

The two exams will be programmed according to the calendar of the faculty, and evaluate if students have assimilated the basic concepts of the material of the course.

For the resit exam, the student can choose to do a re-examination of only the first partial (25%), or of only the second partial (25%), or of both partials (50%). The re-evaluation thus represents at most 50% of the final course grade.

Bibliography

Basic:

Complementary:

Previous capacities

Knowledge of basic statistical concepts, descriptive statistics, hypothesis testing. Familiarity with the statistical software R.