Skip to main content

Statistical Inference and Modelling

Credits
6
Types
Compulsory
Requirements
This subject has not requirements , but it has got previous capacities
Department
EIO
Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values.

This course addresses the basic knowledge and skills needed to start the process of Data Science, rigorously, using tools of traditional statistical inference and adapted to the new context of massive data on any type of data. This includes accessing, debugging, and preparing data for exploratory and modeling data analysis (statistics or machine learning). Relevantly, this subject places special emphasis on the fundamental concepts and the different stages of the underlying analytical process in any Data Science project.

Teachers

Person in charge

Others

Weekly hours

Theory
1.8
Problems
0
Laboratory
1.8
Guided learning
0
Autonomous learning
6.4

Competences

Information literacy

  • CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.
  • Third language

  • CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
  • Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
  • Generic

  • CG1 - Identify and apply the most appropriate data management methods and processes to manage the data life cycle, considering both structured and unstructured data
  • CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats
  • Especifics

  • CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
  • CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem
  • Objectives

    1. Know how to perform inference processes based on data and in a traditional parametric way for decision making.
      Related competences: CT5, CE6, CB6, CB9,
    2. Know how to make a report on data quality and pre-processed
      Related competences: CT4, CT5, CG2, CB6,
    3. Determination of significant characteristics aimed at numerical and categorical targets in groups of individuals
      Related competences: CT4, CT5, CG2,
    4. Estimation of parameters and interpretation of linear models of normal response
      Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
    5. Validation of normal response models. Identification of unusual and influential data. Residual analysis
      Related competences: CT4, CT5, CG1, CG2, CE10, CB6,
    6. Inference of hypotheses on single and multiple parameters in normal response models
      Related competences: CT5, CG2, CE6, CB6,
    7. Estimation of parameters and interpretation of linear models of binary response
      Related competences: CT5, CE6, CB9,
    8. Validation of binary response models. Identification of unusual and influential data. Residual types
      Related competences: CT4, CT5, CG1, CG2, CE6, CB6,
    9. Inference of hypotheses on single and multiple parameters in binary response models
      Related competences: CG1, CE6, CB9,
    10. Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response
      Related competences: CT5, CG1, CE10, CB6,
    11. Validation of nominal and ordinal polytomous response models. Identification of unusual and influential data.
      Related competences: CT5, CG2, CE10, CB6,
    12. Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models
      Related competences: CT5, CG1, CG2, CE6, CE10,
    13. Estimation of parameters and interpretation of linear models by counting
      Related competences: CT5, CG1, CG2, CE10, CB9,
    14. Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models
      Related competences: CT5, CG1, CE6, CB6,
    15. Inference of hypotheses on simple and multiple parameters in counting models
      Related competences: CT5, CE6,
    16. Know how to design factorial and fractional factorial experiments
      Related competences: CT5, CG1, CE6, CB6, CB9,

    Contents

    1. Classical vs Fisherian inference
      Classical Inference. Likelihood function. Properties of MLE. Likelihood ratio test.
      Parametric vs non-parametric inferential procedures.
      Using historical data for hypothesis testing. Links to Fisherian inference and bootstrapping.
    2. Data Quality
      Univariate and multivariate outliers.
      Missing data. Imputation procedures: deterministic, stochastic.
    3. Normal linear models
      Description of the normal linear model. Estimation by least squares. Model comparison. Goodness of fit. Diagnostics: influential data and outliers. Use of categorical explanatory variables. Model selection. Prediction.
      Neural network estimation of linear regression models.
    4. Generalized linear models
      Statement of the generalized linear models. Models for binary response data. Models for count data. Overdispersion issues. Multinomial response data. Model comparison. Diagnostics: influential data and outliers. Model comparison and selection.
    5. Design of Experiments
      Factorial and fractional factorial experimental designs.
      Modern data analysis techniques for experimental design

    Activities

    Activity Evaluation act


    Classical vs Fisherian Inference

    Know how to differentiate the conditions of applicability of the different methods of inference and know how to choose the most appropriate to the process of Data Science in hand. Perform inference processes to draw conclusions about populations. Use p-values, confidence intervals, and permutation tests for decision-making and interpretation of analyzes in a recurring or one-time Data Science problem.
    Objectives: 1
    Contents:
    Theory
    4h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    12h

    Data quality

    Problems in the quality of the data: It is a question of seeing in the Case Study the problems that present or can present the data: Inconsistencies, redundancy. Missing data. Outliers. How to make a Data Quality Report. What is the standardization of data.
    Objectives: 2
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    3h

    Profiling and feature selection

    Application of statistical inference to determine the relationships between variables present in a DB and a response variable (numerical or categorical)
    Objectives: 3
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    1h

    Estimation of parameters and interpretation of linear models of normal response

    Perspective of modeling by linear regression techniques: statistical components involved. Roles: response / explanatory variables. Estimation by least squares. Properties of estimators. Inferential processes involved.
    Objectives: 4
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    4h

    Validation of normal response models. Identification of unusual and influential data. Waste analysis

    Elements involved in the validation of regression modeling. Influential and / or atypical values
    Objectives: 5
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    4h

    Inference of hypotheses on single and multiple parameters in normal response models

    Inference on parameter estimators in linear models of normal response. Confidence intervals, confidence regions. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
    Objectives: 6
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    2h

    Estimation of parameters and interpretation of linear models of binary response

    Maximum likelihood estimation. Role of the link function. Link function used. Properties of estimators. Inferential processes involved.
    Objectives: 7
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    2h

    Validation of binary response models. Identification of unusual and influential data. Type of waste


    Objectives: 8
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    2h

    Inference of hypotheses on single and multiple parameters in binary response models

    Inference on parameter estimators in linear models of a binary response. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference about confidence interval predictions and calculations.
    Objectives: 9
    Contents:
    Theory
    1h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Estimation of parameters and interpretation of linear models of nominal and ordinal polytomous response

    Maximum likelihood estimation. Nominal versus ordinal modelling. Link functions used. Properties of estimators. Inferential processes involved.
    Objectives: 10
    Contents:
    Theory
    1h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    2h

    Validación de los modelos de respuesta politómica nominal y ordinal. Identificación de datos inusuales e influyentes

    Deviance and Pearson residuals. Student residuals. Unusual and influential data indicators, by extending the indicators used in normal regression.
    Objectives: 11
    Contents:
    Theory
    0.5h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Inference of hypotheses on simple and multiple parameters in nominal and ordinal polytomous response models

    Inference on parameter estimators in linear polytomous response models. Confidence intervals. Simple, multiple hypothesis tests, linear combinations. Inference about predictions and confidence interval calculations.
    Objectives: 12
    Contents:
    Theory
    1h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Estimation of parameters and interpretation of linear models by counting

    Maximum likelihood estimate. Poisson modeling, negative binomial. Overdispersion. Link functions used. Inferential processes involved.
    Objectives: 13
    Contents:
    Theory
    0.5h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Validation of counting models. Identification of unusual and influential data. Type of waste. Overdispersion diagnosis. Parametric probabilistic models

    Unusual and influential data indicators. Overdispersion checking. How to overcome overdispersion.
    Objectives: 14
    Contents:
    Theory
    0.5h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Inference of hypotheses on simple and multiple parameters in counting models

    Inference on parameter estimators in linear models by counts. Confidence intervals. Contrasts of simple, multiple hypotheses, linear combinations. Inference on predictions and calculations of confidence intervals.
    Objectives: 15
    Contents:
    Theory
    0.5h
    Problems
    0h
    Laboratory
    1h
    Guided learning
    0h
    Autonomous learning
    1h

    Theory and practice of factorial and fractional factorial experiment design


    Objectives: 16
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    4h

    Partial Exam


    Objectives: 1 2 3 4 5 6
    Week: 7
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Final Exam


    Objectives: 7 8 9 10 11 12 13 14 15 16
    Week: 14
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Linear Model Assignment


    Objectives: 2 3 4 5 6
    Week: 12
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Generalized Linear Model Assignment


    Objectives: 7 8 9 10 11 12 13 14 15
    Week: 14
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Teaching methodology

    The learning of the subject consists of three different phases:
    1. Acquisition of specific knowledge through the study of the bibliography and the material provided by the teachers.
    2. The acquisition of skills in specific techniques of data analysis, selection of the statistical modeling process and validation of the model and
    3. Integration of knowledge, skills and competences (specific and transversal) through the resolution of real case studies.

    In the Theory classes the fundamentals of the methodologies and techniques of the subject are exposed. Laboratory classes are used to learn the use of specific techniques for solving problems, using the appropriate computer tools, in this sense students must first repeat a problem solved by teachers and then solve one similar to the first. . While the Case Studies, solved in groups and in hours of self-learning, serve to put into practice the knowledge, skills and competencies in solving real cases.

    Evaluation methodology

    The evaluation of the subject integrates the three phases of learning described: knowledge, skills and competences.

    Knowledge is assessed by two exams conducted in the middle (T1, weight 1/3) and during the week of final exams of the course (T2, weight 2/3). In case of failing the partial exam, the student may repeat it as an extension of the final exam (note T).

    The skills will be evaluated from the delivery of 2 practices, as well as the transversal competences. Blocks 1, 2 and 3 for the first practice (P1) and 4 and 5 for the second (P2). The practice has to be developed individually or in groups of maximum 3 people. Each practice will be assessed individually through a questionnaire. The average of the marks gives the mark P.

    The Final Grade (NF) is calculated:

    Partial Exam (T1, 1/3) and Final Exam (T2, 2/3).
    Practice 1 (P1) and Practice 2 (P2)
    P: Practice Note P = (P1 + P2) / 2.
    T: Theory Note = Max (T2, (T1 + 2T2) / 3).
    NF: Final Grade = 0.5T + 0.5P if T > 3.5 otherwise NF = T

    Bibliography

    Basic

    Complementary

    Web links

    Previous capacities

    Students must have sufficient knowledge of algebra and mathematical analysis to assimilate concepts related to set algebra, numerical series, functions of real variables of one or more dimensions, derivation, and integration. Students must have taken a course in probability and statistics