Data Analysis and Knowledge Discovery

Credits

Types

Optional

Requirements

This subject has not requirements , but it has got previous capacities

Department

Web

https://raco.fib.upc.edu/home/assignatura?espai=270650

Mail

assig-DAKD-MIRI@fib.upc.edu

This exciting course broaches the hot topic of Data Analysis and Knowledge Discovery (DAKD) from the viewpoint of Data Mining.
Most areas in science, engineering and business are becoming increasingly data dependent. Clear examples of this are, to name a few, bioinformatics, medicine, or electronic commerce.
Data analysis techniques are needed to deal with these data and generate usable knowledge out of them. Amongst them, DAKD techniques are one of the most promising approaches. This theme is at the core of the contents of this course.

Teachers

Person in charge

Alfredo Vellido Alcacena (avellido@cs.upc.edu)

Others

Carlos Cano Domingo (carlos.cano.domingo@upc.edu)
Caroline König (caroline.leonore.konig@upc.edu)

Weekly hours

Theory

Problems

Laboratory

Guided learning

0.6

Autonomous learning

6.4

Competences

Transversal Competences

Information literacy

CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.

Third language

CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.

CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.

CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.

Generic Technical Competences

Generic

CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats

Technical Competences

Especifics

CE2 - Apply the fundamentals of data management and processing to a data science problem

CE5 - Model, design, and implement complex data systems, including data visualization

CE8 - Extract information from structured and unstructured data by considering their multivariate nature.

CE10 - Identify machine learning and statistical modeling methods to use and apply them rigorously in order to solve a specific data science problem

CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation

CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats

Objectives

Presenting DM as a process that should involve a methodology applied at its best.
Related competences: CT4, CT5, CG2, CE2, CE8, CE10, CB10,
Subcompetences
- Técnicas de búsqueda y tratamiento de la información en entornos heterogéneos
- Limpieza de datos
- Derivación de datos
Introducing the students to the new concept of DM for processes, called Process Mining.
Related competences: CT4, CT5, CG2, CE2, CE5, CE8, CE10, CE12, CB6,
Subcompetences
- Algoritmos de análisis de flujos continuos de datos
Delving into some detail in one of the stages of DM: data exploration.
Related competences: CT4, CT5, CG2, CE2, CE5, CE8, CE10, CB10,
Subcompetences
- Exploración y visualización de datos en minería de datos
Dealing in detail with the problem of data visualization for exploration as a key issue in DM.
Related competences: CT4, CT5, CE5,
Subcompetences
- Exploración y visualización de datos en minería de datos
Introducing the students to the basics of probability theory as applied in Data Analysis and Knowledge Discovery (DAKD)
Related competences: CT5, CE8, CE10, CB7, CB10,
Subcompetences
- Estadística bayesiana
Introducing the students to the probabilistic variant of DAKD in the form of Statistical Machine Learning, both for supervised and unsupervised learning models.
Related competences: CT5, CE8, CE10, CB7, CB10,
Subcompetences
- Estadística bayesiana
- Modelización a partir de factores latentes
Dealing in detail with different unsupervised models for data visualization, including case studies.
Related competences: CT5, CG2, CE2, CE5, CE8, CE10, CE12, CB6, CB10,
Subcompetences
- Algoritmos avanzados para minería de datos
- Exploración y visualización de datos en minería de datos
- Diseño e implementación de sistemas de visualización
Approaching the multi-faceted concept of data mining (DM) from different perspectives.
Related competences: CT4, CT5, CE12, CE13, CB6, CB7,
Subcompetences
- Técnicas de búsqueda y tratamiento de la información en entornos heterogéneos

Introduction to the concept of data mining (DM).
DM is a multi-faceted concept that requires discussion and clarification. We will do this at the beginning of the course.
DM as a methodology.
We argue that DM should not be focused on the concept of data analysis/modeling, but, instead, should be treated as a methodology with diverse inter-related stages.
DM for processes: Process Mining.
A new development in DM methodologies is that which deals with one specifically suited for processes. It is called Process Mining and will be described and discussed in this course.
Data exploration in DM.
One of the main stages of well-structures DM methodologies is Data exploration. It will be discussed as a preamble to data visualization.
Data visualization for exploration.
One of the aspects of the problem of data exploration is data visualization. It has a research 'life' of its own as it involves not only computer-based mathematical models, but also natural perception and processing.
Basics of probability theory in Data Analysis and Knowledge Discovery (DAKD)
For a long time in the last half-century, multivariate statistics and artificial intelligence (mostly in the field of machine learning) have developed in parallel without fully meeting. Statistical machine learning has bridged that field over the last two decades. We introduce it by first providing some basic principles of probability theory (Bayesian inference).
Statistical Machine Learning for DAKD: supervised models.
Once the basics of Bayesian inference are set, we will delve into the field of Statistical Machine Learning for IDA, starting with supervised learning models, with an emphasis on feed-forward artificial neural networks.
Statistical Machine Learning for DAKD: unsupervised models.
Once the basics of Bayesian inference and of Statistical Machine Learning for IDA in supervised models are set, we will continue with unsupervised models, focusing on self-organizing maps and related models.
Unsupervised models for data visualization, with case studies.
In the final item of the contents of the course, we will bring statistical machine learning and data visualization together by discussing some probabilistic unsupervised learning models for data visualization, including some case studies as an example.

Activities

Activity Evaluation act

Essay on DAKD for DM

Students will have to write a research essay on the topic of DAKD for DM, with different options: 1. State of the art on an specific DAKD-DM topic 2. Evaluation of an DAKD-DM software tool with original experiments 3. Pure research essay, with original experimental content
Objectives: 1 3 5 7 2 4 6 8
Week: 18

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Introduction to Data Mining and its Methodologies

Introduction to Data Mining as a general concept and to its methodologies for practical implementation
Objectives: 1
Contents:

1 . Introduction to the concept of data mining (DM).
2 . DM as a methodology.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

13h

Process Mining

Introduction to the novel concept of Process Mining and its application within the DM framework.
Objectives: 2
Contents:

3 . DM for processes: Process Mining.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Data Visualization

As part of the DM stage of Data Exploration, we focus in the problem of Data Visualization.
Objectives: 3 4
Contents:

4 . Data exploration in DM.
5 . Data visualization for exploration.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Basics of probability theory for intelligent data analysis

Introduction to probability theory for intelligent data analysis, with a focus on Bayesian statistics
Objectives: 5
Contents:

6 . Basics of probability theory in Data Analysis and Knowledge Discovery (DAKD)

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Statistical Machine Learning methods

The meeting of statistics and machine learning: Statistical Machine Learning methods, from the point of view of both supervised and supervised learning
Objectives: 5 6
Contents:

7 . Statistical Machine Learning for DAKD: supervised models.
8 . Statistical Machine Learning for DAKD: unsupervised models.

Theory

12h

Problems

Laboratory

Guided learning

Autonomous learning

18h

SML in data visualization, with case studies

We merge the topics of SML and data visualization, illustrating its use with some real case studies
Objectives: 7 4 8
Contents:

9 . Unsupervised models for data visualization, with case studies.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

15h

Teaching methodology

This course will build on different teaching methodology (TM) aspects, including:
TM1: Expositive seminars
TM2: Expositive-participative seminars
TM3: Orientation for individual assignments (essays)
TM4: Individual tutorization

Evaluation methodology

The course will include two evaluation tasks:
The first one will be a data science purely analytical task performed according to data mining principles.
The second one will involve writing an essay according to one of these three modalities:
1. State of the art on an specific IDA-DM topic
2. Evaluation of an IDA-DM software tool with original experiments
3. Pure research essay, with original experimental content

Bibliography

Basic

Information theory, inference, and learning algorithms - MacKay, D.J.C, Cambridge University Press, 2003. ISBN: 0521642981
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002876809706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Principles of data mining - Hand, D.; Mannila, H.; Smyth, P, MIT Press, 2001. ISBN: 026208290X
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002287109706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Pattern recognition and machine learning - Bishop, C.M, Springer, 2006. ISBN: 0387310738
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003157379706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Complementary

Statistics: a very short introduction - Hand, D.J, Oxfrod University Press, 2008. ISBN: 9780199233564
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003868839706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Information visualization: design for interaction - Spence, R, Pearson/Prentice Hall, 2007. ISBN: 9780132065504
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003948629706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Visualize this: the flowing data guide to design, visualization, and statistics - Yau, N, Wiley, 2011. ISBN: 9780470944882
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003948649706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Previous capacities

Students are expected to have at least some basic background in the area of artificial intelligence and, more specifically, with the areas of Machine Leaning and Computational Intelligence.
Some basic knowledge of probability theory and statistics would be beneficial.
Other than this, the course is open to students and researchers of all types of background.