Statistical Inference and Modelling

You are here

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values.

This course addresses the basic knowledge and skills needed to start the process of Data Science, rigorously, using tools of traditional statistical inference and adapted to the new context of massive data on any type of data. This includes accessing, debugging, and preparing data for exploratory and modeling data analysis (statistics or machine learning). Relevantly, this subject places special emphasis on the fundamental concepts and the different stages of the underlying analytical process in any Data Science project.

Teachers

Person in charge

  • Lidia Montero Mercadé ( )

Others

  • Josep Franquet Fàbregas ( )

Weekly hours

Theory
1.8
Problems
0
Laboratory
1.8
Guided learning
0
Autonomous learning
6.4

Competences

Transversal Competences

Information literacy

  • CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.

Third language

  • CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Generic Technical Competences

Generic

  • CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
  • CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats

Technical Competences

Especifics

  • CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
  • CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem

Objectives

  1. Know how to perform inference processes based on data and in a traditional parametric way for decision making.
    Related competences: CT5, CE6, CB6, CB9,
  2. Know how to make a report on data quality and pre-processed
    Related competences: CT4, CT5, CG2, CB6,
  3. Determination of significant characteristics aimed at numerical and categorical targets in groups of individuals
    Related competences: CT4, CT5, CG2,
  4. Estimation of parameters and interpretation of linear models of normal response
    Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
  5. Validation of normal response models. Identification of unusual and influential data. Residual analysis
    Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
  6. Inference of hypotheses on single and multiple parameters in normal response models
    Related competences: CT5, CG2, CE6, CB6,
  7. Estimation of parameters and interpretation of linear models of binary response
    Related competences: CT5, CE6, CB9,
  8. Validation of binary response models. Identification of unusual and influential data. Residual types
    Related competences: CT4, CT5, CG1, CG2, CE6, CB6,
  9. Inference of hypotheses on single and multiple parameters in binary response models
    Related competences: CG1, CE6, CB9,
  10. Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
    Related competences: CT5, CG1, CE10, CB6,
  11. Validation of nominal and ordinal polytomous response models. Identification of unusual and influential data.
    Related competences: CT5, CG2, CE10, CB6,
  12. Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
    Related competences: CT5, CG1, CG2, CE6, CE10,
  13. Estimation of parameters and interpretation of linear models by counting
    Related competences: CT5, CG1, CG2, CE10, CB9,
  14. Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
    Related competences: CT5, CG1, CE6, CB6,
  15. Inference of hypotheses on simple and multiple parameters in counting models
    Related competences: CT5, CE6,
  16. Know how to design factorial and fractional factorial experiments
    Related competences: CT5, CG1, CE6, CB6, CB9,

Contents

  1. Classical vs Fisherian inference
    Classical Inference. Likelihood function. Properties of MLE. Likelihood ratio test.
    Parametric vs non-parametric inferential procedures.
    Using historical data for hypothesis testing. Links to Fisherian inference and bootstrapping.
  2. Data Quality
    Univariate and multivariate outliers.
    Missing data. Imputation procedures: deterministic, stochastic.
  3. Normal linear models
    Description of the normal linear model. Estimation by least squares. Model comparison. Goodness of fit. Diagnostics: influential data and outliers. Use of categorical explanatory variables. Model selection. Prediction.
    Neural network estimation of linear regression models.
  4. Generalized linear models
    Statement of the generalized linear models. Models for binary response data. Models for count data. Overdispersion issues. Multinomial response data. Model comparison. Diagnostics: influential data and outliers. Model comparison and selection.
  5. Design of Experiments
    Factorial and fractional factorial experimental designs.
    Modern data analysis techniques for experimental design

Activities

Activity Evaluation act


Classical vs Fisherian Inference

Know how to differentiate the conditions of applicability of the different methods of inference and know how to choose the most appropriate to the process of Data Science in hand. Perform inference processes to draw conclusions about populations. Use p-values, confidence intervals, and permutation tests for decision-making and interpretation of analyzes in a recurring or one-time Data Science problem.
Objectives: 1
Contents:
Theory
4h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
12h

Data quality

Problems in the quality of the data: It is a question of seeing in the Case Study the problems that present or can present the data: Inconsistencies, redundancy. Missing data. Outliers. How to make a Data Quality Report. What is the standardization of data.
Objectives: 2
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
3h

Profiling and feature selection

Application of statistical inference to determine the relationships between variables present in a DB and a response variable (numerical or categorical)
Objectives: 3
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
1h

Estimation of parameters and interpretation of linear models of normal response

Perspective of modeling by linear regression techniques: statistical components involved. Roles: response / explanatory variables. Estimation by least squares. Properties of estimators. Inferential processes involved.
Objectives: 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Validation of normal response models. Identification of unusual and influential data. Waste analysis

Elements involved in the validation of regression modeling. Influential and / or atypical values
Objectives: 5
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Inference of hypotheses on single and multiple parameters in normal response models

Inference on parameter estimators in linear models of normal response. Confidence intervals, confidence regions. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
Objectives: 6
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h

Estimation of parameters and interpretation of linear models of binary response

Maximum likelihood estimation. Role of the link function. Link function used. Properties of estimators. Inferential processes involved.
Objectives: 7
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h

Validation of binary response models. Identification of unusual and influential data. Type of waste


Objectives: 8
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h

Inference of hypotheses on single and multiple parameters in binary response models

Inference on parameter estimators in linear models of a binary response. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
Objectives: 9
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response

Maximum likelihood estimation. Nominal versus ordinal modelling. Link functions used. Properties of estimators. Inferential processes involved.
Objectives: 10
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
2h

Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes

Deviance and Pearson residuals. Student residuals. Unusual and influential data indicators, by extending the indicators used in normal regression.
Objectives: 11
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models

Inference on parameter estimators in linear polytomous response models. Confidence intervals. Simple, multiple hypothesis tests, linear combinations. Inference about predictions and confidence interval calculations.
Objectives: 12
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Estimation of parameters and interpretation of linear models by counting

Maximum likelihood estimate. Poisson modeling, negative binomial. Overdispersion. Link functions used. Inferential processes involved.
Objectives: 13
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models

Unusual and influential data indicators. Overdispersion checking. How to overcome overdispersion.
Objectives: 14
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Inference of hypotheses on simple and multiple parameters in counting models

Inference on parameter estimators in linear models by counts. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference on predictions and calculations of confidence intervals.
Objectives: 15
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h

Theory and practice of factorial and fractional factorial experiment design


Objectives: 16
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Partial Exam


Objectives: 1 2 3 4 5 6
Week: 7
Type: lab exam
Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
6h

Final Exam


Objectives: 7 8 9 10 11 12 13 14 15 16
Week: 14
Type: theory exam
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
8h

Linear Model Assignment


Objectives: 2 3 4 5 6
Week: 12
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
20h

Generalized Linear Model Assignment


Objectives: 7 8 9 10 11 12 13 14 15
Week: 14
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
20h

Teaching methodology

The learning of the subject consists of three different phases:
1. Acquisition of specific knowledge through the study of the bibliography and the material provided by the teachers.
2. The acquisition of skills in specific techniques of data analysis, selection of the statistical modeling process and validation of the model and
3. Integration of knowledge, skills and competences (specific and transversal) through the resolution of real case studies.

In the Theory classes the fundamentals of the methodologies and techniques of the subject are exposed. Laboratory classes are used to learn the use of specific techniques for solving problems, using the appropriate computer tools, in this sense students must first repeat a problem solved by teachers and then solve one similar to the first. . While the Case Studies, solved in groups and in hours of self-learning, serve to put into practice the knowledge, skills and competencies in solving real cases.

Evaluation methodology

The evaluation of the subject integrates the three phases of learning described: knowledge, skills and competences.

Knowledge is assessed by two exams conducted in the middle (T1, weight 1/3) and during the week of final exams of the course (T2, weight 2/3). In case of failing the partial exam, the student may repeat it as an extension of the final exam (note T).

The skills will be evaluated from the delivery of 2 practices, as well as the transversal competences. Each of the blocks 1, 2 and 3 for the first practice (P1) and 4 and 5 for the second (P2) will involve a practice that the student must do individually or in groups of 2. The average of the marks gives the mark P.

The Final Grade (NF) is calculated:

Partial Exam (T1, 1/3) and Final Exam (T2, 2/3).
Practice 1 (P1) and Practice 2 (P2)
P: Practice Note P = (P1 + P2) / 2.
T: Theory Note = Max (T2, (T1 + 2T2) / 3).
NF: Final Grade = 0.6T + 0.4P.

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Students must have sufficient knowledge of algebra and mathematical analysis to assimilate concepts related to set algebra, numerical series, functions of real variables of one or more dimensions, derivation, and integration. Students must have taken a course in probability and statistics