Credits
6
Types
Compulsory
Requirements
This subject has not requirements
, but it has got previous capacities
Department
EIO
Mail
dante.conti@upc.edu
In this subject, the data preprocessing methodology will be worked on from the perspective of systematizing the process and addressing more complex scenarios, compositional data, multivalued variables, multilingual data,... and more complex methods of imputation will be studied of missing data or diagnosis and treatment of outliers that allow the data to be taken to make complex decisions in real applications. This subject will integrate the most complex data pre-processing techniques in a generic data science scenario to connect the cleaned data to either multivariate statistical or machine learning models.
Regarding advanced methods of data analysis, new multivariate analysis techniques will be seen, such as those that allow hierarchical clustering to scale, new ways of representing data (semantic variables) or generalize the topology of classes that can be recognized and 'automation of data post-processing, which helps to interpret representative patterns in classes. On the other hand, different multivariate statistical techniques will be explored to deal with spatio-temporal and textual data, as well as the extraction of topics
Teachers
Person in charge
- Dante Conti ( dante.conti@upc.edu )
Others
- Sergi Ramirez Mitjans ( sergi.ramirez@upc.edu )
Weekly hours
Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6
Competences
Transversals
Basic
Especifics
Generic
Objectives
-
Familiarize yourself with the tools and techniques of advanced data analysis to be able to treat data correctly and internalize the data and information obtained as a source of support for decision-making processes.
Related competences: CG4, CB3, CE09, CE20, -
Select, treat and adapt the relevant data to support a specific question.
Related competences: CG4, CG8, CT8, CB4, CE09, CE17, -
Perform advanced data preprocessing
Related competences: CG4, CE20, -
Obtain profiles or patterns from mixed databases from advanced clustering techniques and interpret the results with profiling and post-processing tools
Related competences: CG4, CB2, CB4, CB5, CE09, CE20, -
Apply multivariate data analysis, especially to categorical data, mixed data and unstructured data
Related competences: CG4, CE20, -
Treat semi or unstructured data type text for text mining, sentiment analysis and Topic Modelling
Related competences: CG4, CE09, CE18, CE20, -
Analyze spatiotemporal data. Model data or problems with latent variables.
Related competences: CG4, CE20, -
Build the statistical models correctly from the data the context of the reference problem and present it publicly.
Related competences: CG4, CG8, CT3, CB2, CE09, CE20, -
Develop practical work and projects with a gender perspective
Related competences: CG8, CT8, -
Integrate teamwork mechanisms in the performance of practical work.
Related competences: CT4, -
Handle with skill the computer tools necessary to solve the real problems raised with the techniques seen in class
Related competences: CG4, CE09, CE20, -
Interpret and contextualize the models built from data
Related competences: CG4, CT3, CT8, -
Validate the models obtained and make a critical interpretation of the results from a technical point of view, contextualizing the results in the framework, reference or understanding of the problem addressed
Related competences: CG4, CG8, CE09, CE20, -
Make a report or final report with the practical assignments or subject project
Related competences: CG4, CG8, CG9, CT3, CT4, CT8, CE17, -
Publicly present a report with the results of the project or practical assignment of the subject
Related competences: CG4, CG8, CT3, CT4, CT8,
Contents
-
Introduction
Data quality, Importance of Data Preprocessing, Introduction to advanced data analysis techniques, Relationship between Multivariate Analysis, Automatic Learning and data science -
Preprocessing
Data acquisition and homogenization, Selection of variables (feature Selection, feature weighting and reduction of variables), Lost data: MICE, MIMMI, Derivation of variables, Transformation of variables, Anomalous Dades (outliers) -
Advanced Clustering methods
Scalability: CURE strategy, Mixed distances and metrics, Clustering on mixed data, DBSCAN, OPTICS, Time series classification -
Factorial analysis
CMA and FMAD -
Data analysis - spatiotemporal models
Basic concepts, geolocated data, visualization, distances in spatio-temporal analysis, components of spatio-temporal models and basic methods (Kriging) -
Text mining
Sentiment Analysis, Latent Semantic Analysis, Topic Modelling -
Modeling based on latent variables
Modeling based on latent variables
Activities
Activity Evaluation act
Teamwork
The students organize themselves into groups and look for real data that meet certain requirements set by the teacher. They use them to apply the techniques and methodologies that are seen throughout the course. At the end, they present a report with the results and make an oral presentation with the most relevant results of the study.Objectives: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Contents:
Theory
0h
Problems
0h
Laboratory
11h
Guided learning
0h
Autonomous learning
28h
Practical application of the subject syllabus
Execution of practical scripts in R on the concepts seen in theory.
Theory
0h
Problems
0h
Laboratory
13h
Guided learning
0h
Autonomous learning
0h
Teaching methodology
The 7 topics suggested will be developed in 12 theoretical class sessions (2 hours per week) with their respective practices or laboratory (also 2 hours per week). The 3 sessions that are missing from the 15 sessions per semester established in the FIB, will be used for theoretical evaluations (quiz or similar) and practical evaluations (defense of practical work in the middle of the semester and at the end of the semester), remembering also that there are a couple of weeks where there are no lectures to be a week of partial exams and/or final exams, during which advice, support and guidance can be offered to students as reinforcement or preparation for their assessments.Evaluation methodology
Ordinary Evaluation:---------------------
(Q) Quizs. 20%
(P) Project. 40%
(EF) Final Exam. 40%
Ordinary Final Grade = 0.2 * Q + 0.4 * P + 0.4 * EF
Q: It consists of 4 tests of 5-10 questions with the same weight on the final grade.
Q = (Q1 + Q2 + Q3 + Q4)/4
P. Team project where the following skills will be assessed:
- (P1) Data collection, analysis and interpretation of results (30%);
- (P2) Transmission of results (20%)
- (P3) Oral and written communication (20%)
- (P4) Teamwork (10%)
- (P5) Gender perspective (10%)
- (P6) Autonomy (10%)
P = 0.2 * P1 + 0.2 * P2 + 0.2 * P3 + 0.1 * P4 + 0.1 * P5 + 0.1 * P6
The students must obtain a minimum grade of 3.5 in the individual presencial tests, that is, 1/3 * Q + 2/3 * EF > 3.5 to pass the subject. On the other hand, the completion of the project will be mandatory in order to be approved during the ordinary evaluation.
Reassessment:
---------------------------------
(EE) Extraordinary Final Exam
Extraordinary Note = Minimum{7, Maximum{EE, 0.2 * Q + 0.4 * P + 0.4 * EE}}
In this call, only those people who have taken the exam and failed it will be able to appear. Therefore, those people who have not taken the Ordinary Assessment are excluded (NP).
There will be no minimum grade to pass. The highest grade in this call is a 7.
Bibliography
Basic
-
A survey on pre-processing techniques: Relevant issues in the context of environmental data Mining
- Gibert, Karina; Sànchez-Marré, Mquel; Izquierdo, Joaquin,
AI communications: the european journal of artificial intelligence,
2016.
https://upcommons.upc.edu/handle/2117/123530 -
Preprocessing and Artificial Intelligence for Increasing Explainability in Mental Health
- Angerri, X., & Gibert, K,
International Journal on Artificial Intelligence Tools,
https://www.worldscientific.com/doi/abs/10.1142/S0218213023400110 -
The Elements of statistical learning : data mining, inference, and prediction
- Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome,
Springer,
cop. 2009.
ISBN: 9780387952840
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003549679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Exploratory multivariate analysis by example using R
- Husson, François; Lê, Sébastien; Pagès, Jérôme,
CRC Press, Taylor & Francis Group,
2017.
ISBN: 9781315301860
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=4856173 -
Applied multivariate statistical analysis
- Johnson, Richard A; Wichern, Dean W,
Pearson,
[2014].
ISBN: 9781292024943
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=5174865 -
Statistics: the art and science of learning from data
- Agresti, Alan; Franklin, Christine,
Pearson Education,
2018.
ISBN: 9781292164779
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004951010406711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Practical statistics for data scientists: 50+ essential concepts using R and Python
- Bruce, Peter; Bruce, Andrew; Gedeck, Peter,
O'Reilly,
[2020].
ISBN: 9781492072942
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004946307706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Complementary
-
Análisis de datos multivariantes
- Peña, Daniel,
McGraw-Hill,
cop. 2002.
ISBN: 9788448136109
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002497609706711&context=L&vid=34CSUC_UPC:VU1&lang=ca -
Exploratory multivariate analysis by example using R
- Husson, François; Lê, Sébastien; Pagès, Jérôme,
CRC Press, Taylor & Francis,
2017.
ISBN: 9781315301860
-
Correspondence Analysis in Practice
- Greenacre, Michael,
Chapman and Hall/CRC,
2016.
ISBN: 9781315369983