Credits
6
Types
Compulsory
Requirements
This subject has not requirements
, but it has got previous capacities
Department
EIO
This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.
Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values.
This course addresses the basic knowledge and skills needed to start the process of Data Science, rigorously, using tools of traditional statistical inference and adapted to the new context of massive data on any type of data. This includes accessing, debugging, and preparing data for exploratory and modeling data analysis (statistics or machine learning). Relevantly, this subject places special emphasis on the fundamental concepts and the different stages of the underlying analytical process in any Data Science project.
Teachers
Person in charge
- Lidia Montero Mercadé ( lidia.montero@upc.edu )
Others
- Josep Franquet Fàbregas ( josep.franquet@upc.edu )
Weekly hours
Theory
1.8
Problems
0
Laboratory
1.8
Guided learning
0
Autonomous learning
6.4
Competences
Information literacy
Third language
Basic
Generic
Especifics
Objectives
-
Know how to perform inference processes based on data and in a traditional parametric way for decision making.
Related competences: CT5, CE6, CB6, CB9, -
Know how to make a report on data quality and pre-processed
Related competences: CT4, CT5, CG2, CB6, -
Determination of significant characteristics aimed at numerical and categorical targets in groups of individuals
Related competences: CT4, CT5, CG2, -
Estimation of parameters and interpretation of linear models of normal response
Related competences: CT4, CT5, CG1, CG2, CE10, CB6, -
Validation of normal response models. Identification of unusual and influential data. Residual analysis
Related competences: CT4, CT5, CG1, CG2, CE10, CB6, -
Inference of hypotheses on single and multiple parameters in normal response models
Related competences: CT5, CG2, CE6, CB6, -
Estimation of parameters and interpretation of linear models of binary response
Related competences: CT5, CE6, CB9, -
Validation of binary response models. Identification of unusual and influential data. Residual types
Related competences: CT4, CT5, CG1, CG2, CE6, CB6, -
Inference of hypotheses on single and multiple parameters in binary response models
Related competences: CG1, CE6, CB9, -
Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
Related competences: CT5, CG1, CE10, CB6, -
Validation of nominal and ordinal polytomous response models. Identification of unusual and influential data.
Related competences: CT5, CG2, CE10, CB6, -
Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
Related competences: CT5, CG1, CG2, CE6, CE10, -
Estimation of parameters and interpretation of linear models by counting
Related competences: CT5, CG1, CG2, CE10, CB9, -
Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
Related competences: CT5, CG1, CE6, CB6, -
Inference of hypotheses on simple and multiple parameters in counting models
Related competences: CT5, CE6, -
Know how to design factorial and fractional factorial experiments
Related competences: CT5, CG1, CE6, CB6, CB9,
Contents
-
Classical vs Fisherian inference
Classical Inference. Likelihood function. Properties of MLE. Likelihood ratio test.
Parametric vs non-parametric inferential procedures.
Using historical data for hypothesis testing. Links to Fisherian inference and bootstrapping. -
Data Quality
Univariate and multivariate outliers.
Missing data. Imputation procedures: deterministic, stochastic. -
Normal linear models
Description of the normal linear model. Estimation by least squares. Model comparison. Goodness of fit. Diagnostics: influential data and outliers. Use of categorical explanatory variables. Model selection. Prediction.
Neural network estimation of linear regression models. -
Generalized linear models
Statement of the generalized linear models. Models for binary response data. Models for count data. Overdispersion issues. Multinomial response data. Model comparison. Diagnostics: influential data and outliers. Model comparison and selection. -
Design of Experiments
Factorial and fractional factorial experimental designs.
Modern data analysis techniques for experimental design
Activities
Activity Evaluation act
Classical vs Fisherian Inference
Know how to differentiate the conditions of applicability of the different methods of inference and know how to choose the most appropriate to the process of Data Science in hand. Perform inference processes to draw conclusions about populations. Use p-values, confidence intervals, and permutation tests for decision-making and interpretation of analyzes in a recurring or one-time Data Science problem.Objectives: 1
Contents:
Theory
4h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
12h
Data quality
Problems in the quality of the data: It is a question of seeing in the Case Study the problems that present or can present the data: Inconsistencies, redundancy. Missing data. Outliers. How to make a Data Quality Report. What is the standardization of data.Objectives: 2
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
3h
Profiling and feature selection
Application of statistical inference to determine the relationships between variables present in a DB and a response variable (numerical or categorical)Objectives: 3
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
1h
Estimation of parameters and interpretation of linear models of normal response
Perspective of modeling by linear regression techniques: statistical components involved. Roles: response / explanatory variables. Estimation by least squares. Properties of estimators. Inferential processes involved.Objectives: 4
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h
Validation of normal response models. Identification of unusual and influential data. Waste analysis
Elements involved in the validation of regression modeling. Influential and / or atypical valuesObjectives: 5
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h
Inference of hypotheses on single and multiple parameters in normal response models
Inference on parameter estimators in linear models of normal response. Confidence intervals, confidence regions. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.Objectives: 6
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h
Estimation of parameters and interpretation of linear models of binary response
Maximum likelihood estimation. Role of the link function. Link function used. Properties of estimators. Inferential processes involved.Objectives: 7
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h
Validation of binary response models. Identification of unusual and influential data. Type of waste
Objectives: 8
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
2h
Inference of hypotheses on single and multiple parameters in binary response models
Inference on parameter estimators in linear models of a binary response. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.Objectives: 9
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
Maximum likelihood estimation. Nominal versus ordinal modelling. Link functions used. Properties of estimators. Inferential processes involved.Objectives: 10
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
2h
Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes
Deviance and Pearson residuals. Student residuals. Unusual and influential data indicators, by extending the indicators used in normal regression.Objectives: 11
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
Inference on parameter estimators in linear polytomous response models. Confidence intervals. Simple, multiple hypothesis tests, linear combinations. Inference about predictions and confidence interval calculations.Objectives: 12
Contents:
Theory
1h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Estimation of parameters and interpretation of linear models by counting
Maximum likelihood estimate. Poisson modeling, negative binomial. Overdispersion. Link functions used. Inferential processes involved.Objectives: 13
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
Unusual and influential data indicators. Overdispersion checking. How to overcome overdispersion.Objectives: 14
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Inference of hypotheses on simple and multiple parameters in counting models
Inference on parameter estimators in linear models by counts. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference on predictions and calculations of confidence intervals.Objectives: 15
Contents:
Theory
0.5h
Problems
0h
Laboratory
1h
Guided learning
0h
Autonomous learning
1h
Theory and practice of factorial and fractional factorial experiment design
Objectives: 16
Contents:
Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h
Teaching methodology
The learning of the subject consists of three different phases:1. Acquisition of specific knowledge through the study of the bibliography and the material provided by the teachers.
2. The acquisition of skills in specific techniques of data analysis, selection of the statistical modeling process and validation of the model and
3. Integration of knowledge, skills and competences (specific and transversal) through the resolution of real case studies.
In the Theory classes the fundamentals of the methodologies and techniques of the subject are exposed. Laboratory classes are used to learn the use of specific techniques for solving problems, using the appropriate computer tools, in this sense students must first repeat a problem solved by teachers and then solve one similar to the first. . While the Case Studies, solved in groups and in hours of self-learning, serve to put into practice the knowledge, skills and competencies in solving real cases.
Evaluation methodology
The evaluation of the subject integrates the three phases of learning described: knowledge, skills and competences.Knowledge is assessed by two exams conducted in the middle (T1, weight 1/3) and during the week of final exams of the course (T2, weight 2/3). In case of failing the partial exam, the student may repeat it as an extension of the final exam (note T).
The skills will be evaluated from the delivery of 2 practices, as well as the transversal competences. Blocks 1, 2 and 3 for the first practice (P1) and 4 and 5 for the second (P2). The practice has to be developed individually or in groups of maximum 3 people. Each practice will be assessed individually through a questionnaire. The average of the marks gives the mark P.
The Final Grade (NF) is calculated:
Partial Exam (T1, 1/3) and Final Exam (T2, 2/3).
Practice 1 (P1) and Practice 2 (P2)
P: Practice Note P = (P1 + P2) / 2.
T: Theory Note = Max (T2, (T1 + 2T2) / 3).
NF: Final Grade = 0.5T + 0.5P if T > 3.5 otherwise NF = T
Bibliography
Basic
-
Applied regression analysis and generalized linear models
- Fox, John,
SAGE,
2016.
ISBN: 9781452205663
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004150669706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
An R companion to applied regression
- Fox, J.; Weisberg, S,
SAGE Publications, Inc,
2019.
ISBN: 9781544336473
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004175439706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Ggplot2: elegant graphics for data analysis
- Wickham, H,
Springer,
2016.
ISBN: 9783319242774
http://cataleg.upc.edu/record=99100487437720671~S1*cat -
Design and Analysis of Experiments
- Montgomery, D,
Wiley,
2020.
ISBN: 9781119722106
http://cataleg.upc.edu/record=99100491634860671~S1*cat -
Statistics for experimenters : design, innovation, and discovery
- Box, George E. P; Hunter, J. Stuart; Hunter, William Gordon,
John Wiley & Sons,
2005.
ISBN: 9780471718130
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002902039706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Complementary
-
The Elements of statistical learning : data mining, inference, and prediction
- Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome,
Springer,
cop. 2009.
ISBN: 9780387952840
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003549679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Probability and statistics with reliability, queuing and computer science applications
- Trivedi, K.S,,
John Wiley and Sons,
2016.
ISBN: 1119285429
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002351769706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Mathematical Statistics with applications
- Mendenhall, W.; Wackerly, D.; Scheaffer, R,
Thomson Brooks/Cole,
2008.
ISBN: 9780495110811
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004874536506711&context=L&vid=34CSUC_UPC:VU1&lang=ca