Preprocessing and advanced models of data analysis is the third subject in a sequence where the rudiments of Probability & Statistics have already been acquired, which includes Introduction to Statistics (IE, Semester 2) and the most basic statistical models (EM , Semester 3). In these previous subjects, the undergraduate AI student has been able to learn basic notions of exploratory and descriptive data analysis, probability and sampling theory, notions of statistical inference and design of experiments and linear simple regression models, respectively . Whereas, in Statistical Modeling, the student is introduced to more complex models that include, on the one hand, classification models, general and generalized linear model and an introduction to time series as supervised algorithms, and, on the other on the other hand, unsupervised models that include clustering and multivariate analysis techniques of the PCA type.
In this subject, the data preprocessing methodology will be worked on from the perspective of systematizing the process and addressing more complex scenarios, compositional data, multivalued variables, multilingual data,... and more complex methods of imputation will be studied of missing data or diagnosis and treatment of outliers that allow the data to be taken to make complex decisions in real applications. This subject will integrate the most complex data pre-processing techniques in a generic data science scenario to connect the cleaned data to either multivariate statistical or machine learning models.
Regarding advanced methods of data analysis, new multivariate analysis techniques will be seen, such as those that allow hierarchical clustering to scale, new ways of representing data (semantic variables) or generalize the topology of classes that can be recognized and 'automation of data post-processing, which helps to interpret representative patterns in classes. On the other hand, different multivariate statistical techniques will be explored to deal with spatio-temporal and textual data, as well as the extraction of topics
Teachers
Person in charge
Dante Conti (
)
Others
Sergi Ramirez Mitjans (
)
Weekly hours
Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6
Competences
Transversal Competences
Transversals
CT3 [Avaluable] - Efficient oral and written communication. Communicate in an oral and written way with other people about the results of learning, thinking and decision making; Participate in debates on topics of the specialty itself.
CT4 [Avaluable] - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
CT8 [Avaluable] - Gender perspective. An awareness and understanding of sexual and gender inequalities in society in relation to the field of the degree, and the incorporation of different needs and preferences due to sex and gender when designing solutions and solving problems.
Basic
CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.
CB4 - That the students can transmit information, ideas, problems and solutions to a specialized and non-specialized public.
CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy
Technical Competences
Especifics
CE09 - To ideate, design and integrate intelligent data analysis systems with their application in production and service environments.
CE17 - To develop and evaluate interactive systems and presentation of complex information and its application to solving human-computer and human-robot interaction design problems.
CE18 - To acquire and develop computational learning techniques and to design and implement applications and systems that use them, including those dedicated to the automatic extraction of information and knowledge from large volumes of data.
CE20 - To select and put to use techniques of statistical modeling and data analysis, assessing the quality of the models, validating and interpreting.
Generic Technical Competences
Generic
CG4 - Reasoning, analyzing reality and designing algorithms and formulations that model it. To identify problems and construct valid algorithmic or mathematical solutions, eventually new, integrating the necessary multidisciplinary knowledge, evaluating different alternatives with a critical spirit, justifying the decisions taken, interpreting and synthesizing the results in the context of the application domain and establishing methodological generalizations based on specific applications.
CG8 - Perform an ethical exercise of the profession in all its facets, applying ethical criteria in the design of systems, algorithms, experiments, use of data, in accordance with the ethical systems recommended by national and international organizations, with special emphasis on security, robustness , privacy, transparency, traceability, prevention of bias (race, gender, religion, territory, etc.) and respect for human rights.
CG9 - To face new challenges with a broad vision of the possibilities of a professional career in the field of Artificial Intelligence. Develop the activity applying quality criteria and continuous improvement, and act rigorously in professional development. Adapt to organizational or technological changes. Work in situations of lack of information and / or with time and / or resource restrictions.
Objectives
Familiarize yourself with the tools and techniques of advanced data analysis to be able to treat data correctly and internalize the data and information obtained as a source of support for decision-making processes.
Related competences:
CG4,
CB3,
CE09,
CE20,
Select, treat and adapt the relevant data to support a specific question.
Related competences:
CG4,
CG8,
CT8,
CB4,
CE09,
CE17,
Perform advanced data preprocessing
Related competences:
CG4,
CE20,
Obtain profiles or patterns from mixed databases from advanced clustering techniques and interpret the results with profiling and post-processing tools
Related competences:
CG4,
CB2,
CB4,
CB5,
CE09,
CE20,
Apply multivariate data analysis, especially to categorical data, mixed data and unstructured data
Related competences:
CG4,
CE20,
Treat semi or unstructured data type text for text mining, sentiment analysis and Topic Modelling
Related competences:
CG4,
CE09,
CE18,
CE20,
Analyze spatiotemporal data. Model data or problems with latent variables.
Related competences:
CG4,
CE20,
Build the statistical models correctly from the data the context of the reference problem and present it publicly.
Related competences:
CG4,
CG8,
CT3,
CB2,
CE09,
CE20,
Develop practical work and projects with a gender perspective
Related competences:
CG8,
CT8,
Integrate teamwork mechanisms in the performance of practical work.
Related competences:
CT4,
Handle with skill the computer tools necessary to solve the real problems raised with the techniques seen in class
Related competences:
CG4,
CE09,
CE20,
Interpret and contextualize the models built from data
Related competences:
CG4,
CT3,
CT8,
Validate the models obtained and make a critical interpretation of the results from a technical point of view, contextualizing the results in the framework, reference or understanding of the problem addressed
Related competences:
CG4,
CG8,
CE09,
CE20,
Make a report or final report with the practical assignments or subject project
Related competences:
CG4,
CG8,
CG9,
CT3,
CT4,
CT8,
CE17,
Publicly present a report with the results of the project or practical assignment of the subject
Related competences:
CG4,
CG8,
CT3,
CT4,
CT8,
Contents
Introduction
Data quality, Importance of Data Preprocessing, Introduction to advanced data analysis techniques, Relationship between Multivariate Analysis, Automatic Learning and data science
Preprocessing
Data acquisition and homogenization, Selection of variables (feature Selection, feature weighting and reduction of variables), Lost data: MICE, MIMMI, Derivation of variables, Transformation of variables, Anomalous Dades (outliers)
Advanced Clustering methods
Scalability: CURE strategy, Mixed distances and metrics, Clustering on mixed data, DBSCAN, OPTICS, Time series classification
Factorial analysis
CMA and FMAD
Data analysis - spatiotemporal models
Basic concepts, geolocated data, visualization, distances in spatio-temporal analysis, components of spatio-temporal models and basic methods (Kriging)
Text mining
Sentiment Analysis, Latent Semantic Analysis, Topic Modelling
Modeling based on latent variables
Modeling based on latent variables
Activities
ActivityEvaluation act
Teamwork
The students organize themselves into groups and look for real data that meet certain requirements set by the teacher. They use them to apply the techniques and methodologies that are seen throughout the course. At the end, they present a report with the results and make an oral presentation with the most relevant results of the study. Objectives:123456789101112131415 Contents:
During the course there will be short answer tests to set learning pieces. It will be done at the end of certain laboratory classes Objectives:458 Week:
6
Theory
0h
Problems
0h
Laboratory
0.5h
Guided learning
0h
Autonomous learning
0.5h
Practical work presentation
Practical work presentation Objectives:1415 Week:
14
Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
10h
Quiz 3
Short answer tests will be given during the course to consolidate learning pieces. This will be done at the end of certain laboratory classes. Objectives:1581213 Week:
11
Theory
0h
Problems
0h
Laboratory
0.5h
Guided learning
0h
Autonomous learning
0.5h
Quiz 4
During the course there will be short answer tests to set learning pieces. It will be done at the end of certain laboratory classes Objectives:16781213 Week:
14
Theory
0h
Problems
0h
Laboratory
0.5h
Guided learning
0h
Autonomous learning
0.5h
Final Test
Final Test Objectives:12345678111213 Week:
15 (Outside class hours)
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
10h
Practical application of the subject syllabus
Execution of practical scripts in R on the concepts seen in theory.
Theory
0h
Problems
0h
Laboratory
13h
Guided learning
0h
Autonomous learning
0h
Teaching methodology
The 7 topics suggested will be developed in 12 theoretical class sessions (2 hours per week) with their respective practices or laboratory (also 2 hours per week). The 3 sessions that are missing from the 15 sessions per semester established in the FIB, will be used for theoretical evaluations (quiz or similar) and practical evaluations (defense of practical work in the middle of the semester and at the end of the semester), remembering also that there are a couple of weeks where there are no lectures to be a week of partial exams and/or final exams, during which advice, support and guidance can be offered to students as reinforcement or preparation for their assessments.
Ordinary Final Grade = 0.2 * Q + 0.4 * P + 0.4 * EF
Q: It consists of 4 tests of 5-10 questions with the same weight on the final grade.
Q = (Q1 + Q2 + Q3 + Q4)/4
P. Team project where the following skills will be assessed:
- (P1) Data collection, analysis and interpretation of results (30%);
- (P2) Transmission of results (20%)
- (P3) Oral and written communication (20%)
- (P4) Teamwork (10%)
- (P5) Gender perspective (10%)
- (P6) Autonomy (10%)
The students must obtain a minimum grade of 3.5 in the individual presencial tests, that is, 1/3 * Q + 2/3 * EF > 3.5 to pass the subject. On the other hand, the completion of the project will be mandatory in order to be approved during the ordinary evaluation.
Reassessment:
---------------------------------
(EE) Extraordinary Final Exam
In this call, only those people who have taken the exam and failed it will be able to appear. Therefore, those people who have not taken the Ordinary Assessment are excluded (NP).
There will be no minimum grade to pass. The highest grade in this call is a 7.
Bibliography
Basic:
A survey on pre-processing techniques: Relevant issues in the context of environmental data Mining -
Gibert, Karina; Sànchez-Marré, Mquel; Izquierdo, Joaquin,
AI communications: the european journal of artificial intelligence, 2016. https://upcommons.upc.edu/handle/2117/123530
Exploratory multivariate analysis by example using R -
Husson, François; Lê, Sébastien; Pagès, Jérôme, CRC Press, Taylor & Francis ,
2017.
ISBN: 9781315301860
Correspondence Analysis in Practice -
Greenacre, Michael, Chapman and Hall/CRC ,
2016.
ISBN: 9781315369983
Previous capacities
The courses of Statistical Modeling (ME) and Probability and Statistics (IE)