Preprocessing and Advanced Models of Data Analysis

You are here

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
Mail
Preprocessing and advanced models of data analysis is the third subject in a sequence where the rudiments of Probability & Statistics have already been acquired, which includes Introduction to Statistics (IE, Semester 2) and the most basic statistical models (EM , Semester 3). In these previous subjects, the undergraduate AI student has been able to learn basic notions of exploratory and descriptive data analysis, probability and sampling theory, notions of statistical inference and design of experiments and linear simple regression models, respectively . Whereas, in Statistical Modeling, the student is introduced to more complex models that include, on the one hand, classification models, general and generalized linear model and an introduction to time series as supervised algorithms, and, on the other on the other hand, unsupervised models that include clustering and multivariate analysis techniques of the PCA type.

In this subject, the data preprocessing methodology will be worked on from the perspective of systematizing the process and addressing more complex scenarios, compositional data, multivalued variables, multilingual data,... and more complex methods of imputation will be studied of missing data or diagnosis and treatment of outliers that allow the data to be taken to make complex decisions in real applications. This subject will integrate the most complex data pre-processing techniques in a generic data science scenario to connect the cleaned data to either multivariate statistical or machine learning models.

Regarding advanced methods of data analysis, new multivariate analysis techniques will be seen, such as those that allow hierarchical clustering to scale, new ways of representing data (semantic variables) or generalize the topology of classes that can be recognized and 'automation of data post-processing, which helps to interpret representative patterns in classes. On the other hand, different multivariate statistical techniques will be explored to deal with spatio-temporal and textual data, as well as the extraction of topics

Teachers

Person in charge

  • Karina Gibert Oliveras ( )

Others

  • Dante Conti ( )
  • Sergi Ramirez Mitjans ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Transversal Competences

Transversals

  • CT3 [Avaluable] - Efficient oral and written communication. Communicate in an oral and written way with other people about the results of learning, thinking and decision making; Participate in debates on topics of the specialty itself.
  • CT4 [Avaluable] - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.
  • CT8 [Avaluable] - Gender perspective. An awareness and understanding of sexual and gender inequalities in society in relation to the field of the degree, and the incorporation of different needs and preferences due to sex and gender when designing solutions and solving problems.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.
  • CB4 - That the students can transmit information, ideas, problems and solutions to a specialized and non-specialized public.
  • CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy

Technical Competences

Especifics

  • CE09 - To ideate, design and integrate intelligent data analysis systems with their application in production and service environments.
  • CE17 - To develop and evaluate interactive systems and presentation of complex information and its application to solving human-computer and human-robot interaction design problems.
  • CE18 - To acquire and develop computational learning techniques and to design and implement applications and systems that use them, including those dedicated to the automatic extraction of information and knowledge from large volumes of data.
  • CE20 - To select and put to use techniques of statistical modeling and data analysis, assessing the quality of the models, validating and interpreting.

Generic Technical Competences

Generic

  • CG4 - Reasoning, analyzing reality and designing algorithms and formulations that model it. To identify problems and construct valid algorithmic or mathematical solutions, eventually new, integrating the necessary multidisciplinary knowledge, evaluating different alternatives with a critical spirit, justifying the decisions taken, interpreting and synthesizing the results in the context of the application domain and establishing methodological generalizations based on specific applications.
  • CG8 - Perform an ethical exercise of the profession in all its facets, applying ethical criteria in the design of systems, algorithms, experiments, use of data, in accordance with the ethical systems recommended by national and international organizations, with special emphasis on security, robustness , privacy, transparency, traceability, prevention of bias (race, gender, religion, territory, etc.) and respect for human rights.
  • CG9 - To face new challenges with a broad vision of the possibilities of a professional career in the field of Artificial Intelligence. Develop the activity applying quality criteria and continuous improvement, and act rigorously in professional development. Adapt to organizational or technological changes. Work in situations of lack of information and / or with time and / or resource restrictions.

Objectives

  1. Familiarize yourself with the tools and techniques of advanced data analysis to be able to treat data correctly and internalize the data and information obtained as a source of support for decision-making processes.
    Related competences: CG4, CB3, CE09, CE20,
  2. Select, treat and adapt the relevant data to support a specific question.
    Related competences: CG4, CG8, CT8, CB4, CE09, CE17,
  3. Perform advanced data preprocessing
    Related competences: CG4, CE20,
  4. Obtain profiles or patterns from mixed databases from advanced clustering techniques and interpret the results with profiling and post-processing tools
    Related competences: CG4, CB2, CB4, CB5, CE09, CE20,
  5. Apply multivariate data analysis, especially to categorical data, mixed data and unstructured data
    Related competences: CG4, CE20,
  6. Treat semi or unstructured data type text for text mining, sentiment analysis and Topic Modelling
    Related competences: CG4, CE09, CE18, CE20,
  7. Analyze spatiotemporal data. Model data or problems with latent variables.
    Related competences: CG4, CE20,
  8. Build the statistical models correctly from the data the context of the reference problem and present it publicly.
    Related competences: CG4, CG8, CT3, CB2, CE09, CE20,
  9. Develop practical work and projects with a gender perspective
    Related competences: CG8, CT8,
  10. Integrate teamwork mechanisms in the performance of practical work.
    Related competences: CT4,
  11. Handle with skill the computer tools necessary to solve the real problems raised with the techniques seen in class
    Related competences: CG4, CE09, CE20,
  12. Interpret and contextualize the models built from data
    Related competences: CG4, CT3, CT8,
  13. Validate the models obtained and make a critical interpretation of the results from a technical point of view, contextualizing the results in the framework, reference or understanding of the problem addressed
    Related competences: CG4, CG8, CE09, CE20,
  14. Make a report or final report with the practical assignments or subject project
    Related competences: CG4, CG8, CG9, CT3, CT4, CT7, CT8, CE17,
  15. Publicly present a report with the results of the project or practical assignment of the subject
    Related competences: CG4, CG8, CT3, CT4, CT7, CT8,

Contents

  1. Introduction
    Data quality, Importance of Data Preprocessing, Introduction to advanced data analysis techniques, Relationship between Multivariate Analysis, Automatic Learning and data science
  2. Preprocessing
    Data acquisition and homogenization, Selection of variables (feature Selection, feature weighting and reduction of variables), Lost data: MICE, MIMMI, Derivation of variables, Transformation of variables, Anomalous Dades (outliers)
  3. Advanced Clustering methods
    Scalability: CURE strategy, Mixed distances and metrics, Ontology-based distances, Clustering on mixed data, DBSCAN, OPTICS, Time series classification
  4. Multiple correspondence analysis and multiple factorial analysis
    CMA
  5. Data analysis - spatiotemporal models
    Conceptes bàsics, dades geolocalitzades, distància geodèsica, components dels models espai-temporals i mètodes bàsics
  6. Text mining
    Sentiment Analysis, Latent Semantic Analysis, Topic Modelling
  7. Modeling based on latent variables
    Modeling based on latent variables

Activities

Activity Evaluation act


Teamwork

The students organize themselves into groups and look for real data that meet certain requirements set by the teacher. They use them to apply the techniques and methodologies that are seen throughout the course. At the end, they present a report with the results and make an oral presentation with the most relevant results of the study.
Objectives: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Contents:
Theory
0h
Problems
0h
Laboratory
28h
Guided learning
0h
Autonomous learning
50h

Initial presentation of the practical work

Initial presentation of the practical work
Objectives: 2 3 4 5 6 9 14 15
Contents:
Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
4h

Quiz 1

Quiz 1
Objectives: 2 3
Week: 3
Type: theory exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Theory
30h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
30h

Quiz 2

During the course there will be short answer tests to set learning pieces. It will be done at the end of certain laboratory classes
Objectives: 4 5 8
Week: 7
Type: theory exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Practical work presentation

Practical work presentation
Objectives: 14 15
Week: 15 (Outside class hours)
Type: assigment
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Quiz 3



Week: 13
Type: theory exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Quiz 4

During the course there will be short answer tests to set learning pieces. It will be done at the end of certain laboratory classes

Week: 15 (Outside class hours)
Type: theory exam
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Teaching methodology

The 12 topics suggested will be developed in 12 theoretical class sessions (2 hours per week) with their respective practices or laboratory (also 2 hours per week). The 3 sessions that are missing from the 15 sessions per semester established in the FIB, will be used for theoretical evaluations (quiz or similar) and practical evaluations (defense of practical work in the middle of the semester and at the end of the semester), remembering also that there are a couple of weeks where there are no lectures to be a week of partial exams and/or final exams, during which advice, support and guidance can be offered to students as reinforcement or preparation for their assessments.

Evaluation methodology

Propose the following evaluation system:
- Workgroup realized at the end of the course 20%.
- Oral test of knowledge control 10% (discussion between the teacher and the oral presentation of the work in the team).
- Quality and performance of the work team. 10%
- Oral and written communication 10%.
- Ethics of the treball and treball team propiment dit 10%
- Gender perspective of the team and the treball 10%.
- Attendance and participation in classes and laboratories. 10%
- 4 Quiz at the end of the course 20%.

Reassessment

Only those students who had previously taken the final exam and failed it can take the reassessment exam.

Bibliography

Basic:

Complementary:

Web links

Previous capacities

The courses of Statistical Modeling and Probability and Statistics