Statistical Inference and Modelling

Teachers
Weekly hours
Competences
Objectives
Contents
Activities
Teaching methodology
Evaluation methodology
Bibliography
Web links
Previous capacities

Credits

Types

Compulsory

Requirements

This subject has not requirements, but it has got previous capacities

Department

EIO

Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values.

This course addresses the basic knowledge and skills needed to start the process of Data Science, rigorously, using tools of traditional statistical inference and adapted to the new context of massive data on any type of data. This includes accessing, debugging, and preparing data for exploratory and modeling data analysis (statistics or machine learning). Relevantly, this subject places special emphasis on the fundamental concepts and the different stages of the underlying analytical process in any Data Science project.

Teachers

Person in charge

Lidia Montero Mercadé ( )

Others

Josep Franquet Fàbregas ( )

Weekly hours

Theory

1.8

Problems

Laboratory

1.8

Guided learning

Autonomous learning

6.4

Competences

Transversal Competences

Information literacy

CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.

Third language

CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Generic Technical Competences

Generic

CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats

Technical Competences

Especifics

CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem

Objectives

Know how to perform inference processes based on data and in a traditional parametric way for decision making.
Related competences: CT5, CE6, CB6, CB9,
Know how to make a report on data quality and pre-processed
Related competences: CT4, CT5, CG2, CB6,
Determination of significant characteristics aimed at numerical and categorical targets in groups of individuals
Related competences: CT4, CT5, CG2,
Estimation of parameters and interpretation of linear models of normal response
Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
Validation of normal response models. Identification of unusual and influential data. Residual analysis
Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
Inference of hypotheses on single and multiple parameters in normal response models
Related competences: CT5, CG2, CE6, CB6,
Estimation of parameters and interpretation of linear models of binary response
Related competences: CT5, CE6, CB9,
Validation of binary response models. Identification of unusual and influential data. Residual types
Related competences: CT4, CT5, CG1, CG2, CE6, CB6,
Inference of hypotheses on single and multiple parameters in binary response models
Related competences: CG1, CE6, CB9,
Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
Related competences: CT5, CG1, CE10, CB6,
Validation of nominal and ordinal polytomous response models. Identification of unusual and influential data.
Related competences: CT5, CG2, CE10, CB6,
Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
Related competences: CT5, CG1, CG2, CE6, CE10,
Estimation of parameters and interpretation of linear models by counting
Related competences: CT5, CG1, CG2, CE10, CB9,
Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
Related competences: CT5, CG1, CE6, CB6,
Inference of hypotheses on simple and multiple parameters in counting models
Related competences: CT5, CE6,
Know how to design factorial and fractional factorial experiments
Related competences: CT5, CG1, CE6, CB6, CB9,

Classical vs Fisherian inference
Classical Inference. Likelihood function. Properties of MLE. Likelihood ratio test.
Parametric vs non-parametric inferential procedures.
Using historical data for hypothesis testing. Links to Fisherian inference and bootstrapping.
Data Quality
Univariate and multivariate outliers.
Missing data. Imputation procedures: deterministic, stochastic.
Normal linear models
Description of the normal linear model. Estimation by least squares. Model comparison. Goodness of fit. Diagnostics: influential data and outliers. Use of categorical explanatory variables. Model selection. Prediction.
Neural network estimation of linear regression models.
Generalized linear models
Statement of the generalized linear models. Models for binary response data. Models for count data. Overdispersion issues. Multinomial response data. Model comparison. Diagnostics: influential data and outliers. Model comparison and selection.
Design of Experiments
Factorial and fractional factorial experimental designs.
Modern data analysis techniques for experimental design

Activities

Activity Evaluation act

Classical vs Fisherian Inference

Know how to differentiate the conditions of applicability of the different methods of inference and know how to choose the most appropriate to the process of Data Science in hand. Perform inference processes to draw conclusions about populations. Use p-values, confidence intervals, and permutation tests for decision-making and interpretation of analyzes in a recurring or one-time Data Science problem.
Objectives: 1
Contents:

1 . Classical vs Fisherian inference

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Data quality

Problems in the quality of the data: It is a question of seeing in the Case Study the problems that present or can present the data: Inconsistencies, redundancy. Missing data. Outliers. How to make a Data Quality Report. What is the standardization of data.
Objectives: 2
Contents:

2 . Data Quality

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Profiling and feature selection

Application of statistical inference to determine the relationships between variables present in a DB and a response variable (numerical or categorical)
Objectives: 3
Contents:

1 . Classical vs Fisherian inference

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Estimation of parameters and interpretation of linear models of normal response

Perspective of modeling by linear regression techniques: statistical components involved. Roles: response / explanatory variables. Estimation by least squares. Properties of estimators. Inferential processes involved.
Objectives: 4
Contents:

3 . Normal linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Validation of normal response models. Identification of unusual and influential data. Waste analysis

Elements involved in the validation of regression modeling. Influential and / or atypical values
Objectives: 5
Contents:

3 . Normal linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Inference of hypotheses on single and multiple parameters in normal response models

Inference on parameter estimators in linear models of normal response. Confidence intervals, confidence regions. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
Objectives: 6
Contents:

3 . Normal linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Estimation of parameters and interpretation of linear models of binary response

Maximum likelihood estimation. Role of the link function. Link function used. Properties of estimators. Inferential processes involved.
Objectives: 7
Contents:

4 . Generalized linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Validation of binary response models. Identification of unusual and influential data. Type of waste

Objectives: 8
Contents:

4 . Generalized linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Inference of hypotheses on single and multiple parameters in binary response models

Inference on parameter estimators in linear models of a binary response. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
Objectives: 9
Contents:

4 . Generalized linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response

Maximum likelihood estimation. Nominal versus ordinal modelling. Link functions used. Properties of estimators. Inferential processes involved.
Objectives: 10
Contents:

4 . Generalized linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes

Deviance and Pearson residuals. Student residuals. Unusual and influential data indicators, by extending the indicators used in normal regression.
Objectives: 11
Contents:

4 . Generalized linear models

Theory

0.5h

Problems

Laboratory

Guided learning

Autonomous learning

Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models

Inference on parameter estimators in linear polytomous response models. Confidence intervals. Simple, multiple hypothesis tests, linear combinations. Inference about predictions and confidence interval calculations.
Objectives: 12
Contents:

4 . Generalized linear models

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Estimation of parameters and interpretation of linear models by counting

Maximum likelihood estimate. Poisson modeling, negative binomial. Overdispersion. Link functions used. Inferential processes involved.
Objectives: 13
Contents:

4 . Generalized linear models

Theory

0.5h

Problems

Laboratory

Guided learning

Autonomous learning

Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models

Unusual and influential data indicators. Overdispersion checking. How to overcome overdispersion.
Objectives: 14
Contents:

4 . Generalized linear models

Theory

0.5h

Problems

Laboratory

Guided learning

Autonomous learning

Inference of hypotheses on simple and multiple parameters in counting models

Inference on parameter estimators in linear models by counts. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference on predictions and calculations of confidence intervals.
Objectives: 15
Contents:

4 . Generalized linear models

Theory

0.5h

Problems

Laboratory

Guided learning

Autonomous learning

Theory and practice of factorial and fractional factorial experiment design

Objectives: 16
Contents:

5 . Design of Experiments

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Partial Exam

Objectives: 1 2 3 4 5 6
Week: 7

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Final Exam

Objectives: 7 8 9 10 11 12 13 14 15 16
Week: 14

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Linear Model Assignment

Objectives: 2 3 4 5 6
Week: 12

Theory

Problems

Laboratory

Guided learning

Autonomous learning

20h

Generalized Linear Model Assignment

Objectives: 7 8 9 10 11 12 13 14 15
Week: 14

Theory

Problems

Laboratory

Guided learning

Autonomous learning

20h

Teaching methodology

The learning of the subject consists of three different phases:
1. Acquisition of specific knowledge through the study of the bibliography and the material provided by the teachers.
2. The acquisition of skills in specific techniques of data analysis, selection of the statistical modeling process and validation of the model and
3. Integration of knowledge, skills and competences (specific and transversal) through the resolution of real case studies.

In the Theory classes the fundamentals of the methodologies and techniques of the subject are exposed. Laboratory classes are used to learn the use of specific techniques for solving problems, using the appropriate computer tools, in this sense students must first repeat a problem solved by teachers and then solve one similar to the first. . While the Case Studies, solved in groups and in hours of self-learning, serve to put into practice the knowledge, skills and competencies in solving real cases.

Evaluation methodology

The evaluation of the subject integrates the three phases of learning described: knowledge, skills and competences.

Knowledge is assessed by two exams conducted in the middle (T1, weight 1/3) and during the week of final exams of the course (T2, weight 2/3). In case of failing the partial exam, the student may repeat it as an extension of the final exam (note T).

The skills will be evaluated from the delivery of 2 practices, as well as the transversal competences. Blocks 1, 2 and 3 for the first practice (P1) and 4 and 5 for the second (P2). The practice has to be developed individually or in groups of maximum 3 people. Each practice will be assessed individually through a questionnaire. The average of the marks gives the mark P.

The Final Grade (NF) is calculated:

Partial Exam (T1, 1/3) and Final Exam (T2, 2/3).
Practice 1 (P1) and Practice 2 (P2)
P: Practice Note P = (P1 + P2) / 2.
T: Theory Note = Max (T2, (T1 + 2T2) / 3).
NF: Final Grade = 0.5T + 0.5P if T > 3.5 otherwise NF = T

Bibliography

Basic:

Applied regression analysis and generalized linear models - Fox, John, SAGE, 2016. ISBN: 9781452205663
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004150669706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
An R companion to applied regression - Fox, J.; Weisberg, S, SAGE Publications, Inc, 2019. ISBN: 9781544336473
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004175439706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Ggplot2: elegant graphics for data analysis - Wickham, H, Springer, 2016. ISBN: 9783319242774
http://cataleg.upc.edu/record=99100487437720671~S1*cat
Design and Analysis of Experiments - Montgomery, D, Wiley, 2020. ISBN: 9781119722106
http://cataleg.upc.edu/record=99100491634860671~S1*cat
Statistics for experimenters : design, innovation, and discovery - Box, George E. P; Hunter, J. Stuart; Hunter, William Gordon, John Wiley & Sons, 2005. ISBN: 9780471718130
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002902039706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Complementary:

The Elements of statistical learning : data mining, inference, and prediction - Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome, Springer , cop. 2009. ISBN: 9780387952840
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003549679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Probability and statistics with reliability, queuing and computer science applications - Trivedi, K.S,, John Wiley and Sons , 2016. ISBN: 1119285429
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002351769706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Mathematical Statistics with applications - Mendenhall, W.; Wackerly, D.; Scheaffer, R, Thomson Brooks/Cole , 2008. ISBN: 9780495110811
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004874536506711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Web links

Previous capacities

Students must have sufficient knowledge of algebra and mathematical analysis to assimilate concepts related to set algebra, numerical series, functions of real variables of one or more dimensions, derivation, and integration. Students must have taken a course in probability and statistics

Statistical Inference and Modelling

You are here

Teachers

Person in charge

Others

Weekly hours

Competences

Transversal Competences

Information literacy

Third language

Basic

Generic Technical Competences

Generic

Technical Competences

Especifics

Objectives

Contents

Activities

Classical vs Fisherian Inference

Data quality

Profiling and feature selection

Estimation of parameters and interpretation of linear models of normal response

Validation of normal response models. Identification of unusual and influential data. Waste analysis

Inference of hypotheses on single and multiple parameters in normal response models

Estimation of parameters and interpretation of linear models of binary response

Validation of binary response models. Identification of unusual and influential data. Type of waste

Inference of hypotheses on single and multiple parameters in binary response models

Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response

Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes

Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models

Estimation of parameters and interpretation of linear models by counting

Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models

Inference of hypotheses on simple and multiple parameters in counting models

Theory and practice of factorial and fractional factorial experiment design

Partial Exam

Final Exam

Linear Model Assignment

Generalized Linear Model Assignment

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Where we are

Contact with us