Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.
This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.
Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values.
This course addresses the basic knowledge and skills needed to start the process of Data Science, rigorously, using tools of traditional statistical inference and adapted to the new context of massive data on any type of data. This includes accessing, debugging, and preparing data for exploratory and modeling data analysis (statistics or machine learning). Relevantly, this subject places special emphasis on the fundamental concepts and the different stages of the underlying analytical process in any Data Science project.
Teachers
Person in charge
Lidia Montero Mercadé (
)
Others
Josep Franquet Fàbregas (
)
Weekly hours
Theory
1.8
Problems
0
Laboratory
1.8
Guided learning
0
Autonomous learning
6.4
Competences
Transversal Competences
Information literacy
CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.
Third language
CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
Basic
CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
Generic Technical Competences
Generic
CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats
Technical Competences
Especifics
CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
Objectives
Know how to perform inference processes based on data and in a traditional parametric way for decision making.
Related competences:
CT5,
CE6,
CB6,
CB9,
Know how to make a report on data quality and pre-processed
Related competences:
CT4,
CT5,
CG2,
CB6,
Determination of significant characteristics aimed at numerical and categorical targets in groups of individuals
Related competences:
CT4,
CT5,
CG2,
Estimation of parameters and interpretation of linear models of normal response
Related competences:
CT4,
CT5,
CG1,
CG2,
CE10,
CB6,
Validation of normal response models. Identification of unusual and influential data. Residual analysis
Related competences:
CT4,
CT5,
CG1,
CG2,
CE10,
CB6,
Inference of hypotheses on single and multiple parameters in normal response models
Related competences:
CT5,
CG2,
CE6,
CB6,
Estimation of parameters and interpretation of linear models of binary response
Related competences:
CT5,
CE6,
CB9,
Validation of binary response models. Identification of unusual and influential data. Residual types
Related competences:
CT4,
CT5,
CG1,
CG2,
CE6,
CB6,
Inference of hypotheses on single and multiple parameters in binary response models
Related competences:
CG1,
CE6,
CB9,
Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
Related competences:
CT5,
CG1,
CE10,
CB6,
Validation of nominal and ordinal polytomous response models. Identification of unusual and influential data.
Related competences:
CT5,
CG2,
CE10,
CB6,
Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
Related competences:
CT5,
CG1,
CG2,
CE6,
CE10,
Estimation of parameters and interpretation of linear models by counting
Related competences:
CT5,
CG1,
CG2,
CE10,
CB9,
Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
Related competences:
CT5,
CG1,
CE6,
CB6,
Inference of hypotheses on simple and multiple parameters in counting models
Related competences:
CT5,
CE6,
Know how to design factorial and fractional factorial experiments
Related competences:
CT5,
CG1,
CE6,
CB6,
CB9,
Contents
Classical vs Fisherian inference
Classical Inference. Likelihood function. Properties of MLE. Likelihood ratio test.
Parametric vs non-parametric inferential procedures.
Using historical data for hypothesis testing. Links to Fisherian inference and bootstrapping.
Data Quality
Univariate and multivariate outliers.
Missing data. Imputation procedures: deterministic, stochastic.
Normal linear models
Description of the normal linear model. Estimation by least squares. Model comparison. Goodness of fit. Diagnostics: influential data and outliers. Use of categorical explanatory variables. Model selection. Prediction.
Neural network estimation of linear regression models.
Generalized linear models
Statement of the generalized linear models. Models for binary response data. Models for count data. Overdispersion issues. Multinomial response data. Model comparison. Diagnostics: influential data and outliers. Model comparison and selection.
Design of Experiments
Factorial and fractional factorial experimental designs.
Modern data analysis techniques for experimental design
Activities
ActivityEvaluation act
Classical vs Fisherian Inference
Know how to differentiate the conditions of applicability of the different methods of inference and know how to choose the most appropriate to the process of Data Science in hand.
Perform inference processes to draw conclusions about populations. Use p-values, confidence intervals, and permutation tests for decision-making and interpretation of analyzes in a recurring or one-time Data Science problem. Objectives:1 Contents:
Problems in the quality of the data: It is a question of seeing in the Case Study the problems that present or can present the data: Inconsistencies, redundancy. Missing data. Outliers. How to make a Data Quality Report. What is the standardization of data. Objectives:2 Contents:
Application of statistical inference to determine the relationships between variables present in a DB and a response variable (numerical or categorical) Objectives:3 Contents:
Estimation of parameters and interpretation of linear models of normal response
Perspective of modeling by linear regression techniques: statistical components involved. Roles: response / explanatory variables. Estimation by least squares. Properties of estimators. Inferential processes involved. Objectives:4 Contents:
Inference of hypotheses on single and multiple parameters in normal response models
Inference on parameter estimators in linear models of normal response. Confidence intervals, confidence regions. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations. Objectives:6 Contents:
Estimation of parameters and interpretation of linear models of binary response
Maximum likelihood estimation. Role of the link function. Link function used. Properties of estimators. Inferential processes involved. Objectives:7 Contents:
Inference of hypotheses on single and multiple parameters in binary response models
Inference on parameter estimators in linear models of a binary response. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations. Objectives:9 Contents:
Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
Maximum likelihood estimation. Nominal versus ordinal modelling. Link functions used. Properties of estimators. Inferential processes involved. Objectives:10 Contents:
Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes
Deviance and Pearson residuals. Student residuals. Unusual and influential data indicators, by extending the indicators used in normal regression. Objectives:11 Contents:
Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
Inference on parameter estimators in linear polytomous response models. Confidence intervals. Simple, multiple hypothesis tests, linear combinations. Inference about predictions and confidence interval calculations. Objectives:12 Contents:
Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
Unusual and influential data indicators. Overdispersion checking. How to overcome overdispersion. Objectives:14 Contents:
Inference of hypotheses on simple and multiple parameters in counting models
Inference on parameter estimators in linear models by counts. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference on predictions and calculations of confidence intervals. Objectives:15 Contents:
The learning of the subject consists of three different phases:
1. Acquisition of specific knowledge through the study of the bibliography and the material provided by the teachers.
2. The acquisition of skills in specific techniques of data analysis, selection of the statistical modeling process and validation of the model and
3. Integration of knowledge, skills and competences (specific and transversal) through the resolution of real case studies.
In the Theory classes the fundamentals of the methodologies and techniques of the subject are exposed. Laboratory classes are used to learn the use of specific techniques for solving problems, using the appropriate computer tools, in this sense students must first repeat a problem solved by teachers and then solve one similar to the first. . While the Case Studies, solved in groups and in hours of self-learning, serve to put into practice the knowledge, skills and competencies in solving real cases.
Evaluation methodology
The evaluation of the subject integrates the three phases of learning described: knowledge, skills and competences.
Knowledge is assessed by two exams conducted in the middle (T1, weight 1/3) and during the week of final exams of the course (T2, weight 2/3). In case of failing the partial exam, the student may repeat it as an extension of the final exam (note T).
The skills will be evaluated from the delivery of 2 practices, as well as the transversal competences. Each of the blocks 1, 2 and 3 for the first practice (P1) and 4 and 5 for the second (P2) will involve a practice that the student must do individually or in groups of 2. The average of the marks gives the mark P.
The Final Grade (NF) is calculated:
Partial Exam (T1, 1/3) and Final Exam (T2, 2/3).
Practice 1 (P1) and Practice 2 (P2)
P: Practice Note P = (P1 + P2) / 2.
T: Theory Note = Max (T2, (T1 + 2T2) / 3).
NF: Final Grade = 0.5T + 0.5P if T > 3.5 otherwise NF = T
Students must have sufficient knowledge of algebra and mathematical analysis to assimilate concepts related to set algebra, numerical series, functions of real variables of one or more dimensions, derivation, and integration. Students must have taken a course in probability and statistics