Masters in Computer Science and Engineering

FACULTAT D'INFORMÀTICA
DE BARCELONA

Versió en Català | Versión en Castellano

Data Mining (MD)

Credits	Dept.
7.5 (6.0 ECTS)	EIO-CS

Instructors

Person in charge:	(-)
Others:	(-)

General goals

The philosophy behind Data Mining is the conversion of data into knowledge for decision making. Data mining comprises the central phase of the process of extraction of knowledge using KDD processes (Knowledge Discovery in Databases). In this sense, data mining represents a meeting point of different disciplines: statistics, machine learning, database techniques and decision making systems. These allow one to solve the kind of information processing problems encountered by organisations today. The subject is conceptually divided into three parts, which focus on the topics of association, classification and prediction, that together conform the vast majority of problems faced in Data Mining. A parallel objective is the use of a free programming environment for solving data mining problems, as well as getting to know different spheres in the professional arena.

On the interest of this course in a computer science Curriculum

Data Mining is a discipline devoted to process big data from complex information systems of big organizations, to extract relevant, new,
understandable, useful knowledge for decision making, in all kind of contexts, from e-comerce, to social nets, including environmental
systems monitoring, customer fidelization cards, consume in general, public health, banca, finances or industrial production.
Data Mining is an umbrella where it is required to combine techniques and methodologies from several computer science areas (data
warehouses desing, machine learning, statistical modelling, multivariate data analysis, data visualization, intensive computing, software
engineering) to provide answer to the area complexity.
Currently, it is clear that the value of organizations is directly related to the information that can be extracted from the available data, and
there still is a lack of the professional profile suitable to do that. Data Mining is the science that transforms data in value for the
organizations and acquiring skills on this matter is an excellent complement for the computer science professional, whatever specialization he/she follows.
Regarding an information systems profile, this course provides skills to complete the data processing: too often an excellent
information system design is not sufficiently used due to a lack of a good exploitation service with the suitable mining. Also, knowing what
can be extracted from data is an important referent to take into account in the design of the data structure itself. Regarding software engineering it provides useful criteria to identify and standardize data mining services to include in the big computer applications to support the
organization, by deciding and planning data consumption services to be provided.
In information technologies, it might be interesting the relationship between the real-time monitoring of fix or mobile
systems and the data mining to reduce signals to relevant features, to detect events to communicate or tu extract relevant information in
an incremental process (data stream mining). Knowledge extraction from distributed data or from the cloud is an area with extreme
projection in the near future.
This matter also provides very interesting challenges related with development of new knowledge extraction
algorithms more eficients and/or scalable to deal with big datasets or with other less classical structures, like graphs (social nets mining)
or documents (web mining).

Please have a look of this master lesson to catch a general idea of the discipline
http://videolectures.net/learning06_gibert_dmtae/

Specific goals

Knowledges

Automatic statistical description of databases.

Tools for reducing dimensionality, and multivariant visualisation.

Generation of association rules.

Tools for defining clusters.

The generation of statistical forecasting models.

The general classification rules.

The workings of multi-layer perceptrons and the support vector machine.

Ability to use the general public license R programming environment for Data Mining.

Abilities

Identifying the problems of Data Mining problems in a professional setting.

Identifying the most appropriate statistical techniques and/or AI techniques for solving a problem.

Implementing simple learning algorithms.

Using Data Mining systems for solving real-life problems.

Assessing the quality of the results.

Build a Data Mining system for integrating different areas of learning and focused on decision-making.

Acquaintance with the most commonly used professional Data Mining systems.

Competences

Teamwork.

Ability to solve quantitative problems in a computing environment.

Drawing up reports and orally defending them.

Ability to critically appraise data mining tools and results.

Estimated time (hours):

T	P	L	Alt	Ext. L	Stu	A. time
Theory	Problems	Laboratory	Other activities	External Laboratory	Study	Additional time

1.	INTRODUCTION TO DATA MINING

Alt

Ext. L

Stu

A. time

Total

1,0

1. The data learning process.

2. Data Mining problems.

3. Data Mining techniques.

4. The data. Data types. Pre-processing

2.	AUTOMATIC STATISTICAL DESCRIPTION OF DATABASES.

Alt

Ext. L

Stu

A. time

Total

3,0

2,0

1,0

3,0

9,0

1. Concept of hypothesis proof.

2. Description of a continuous variable.

3. Description of a variable category.

Laboratory
Practical session 1.1. Automatic description of a Database
Additional laboratory activities:
Preparation of a report on the practice session 1.1. Automatic description of a Database

3.	MULTIVARIANT VISUALISATION OF DATA

Alt

Ext. L

Stu

A. time

Total

4,0

2,0

1,0

4,0

11,0

1. Principal Components Analysis.

2. Multiple Correspondence Analysis (MCA).

3. Projection of supplementary information.

Laboratory
Practical session 1.2. Multivariant Visualisation
Additional laboratory activities:
Practical session 1.2. Multivariant Visualisation

4.	GENERATION OF ASSOCIATION RULES

Alt

Ext. L

Stu

A. time

Total

3,0

2,0

1,0

3,0

9,0

1. Market basket analysis.

2. Rule generation algorithms.

3. Example of association rules.

Laboratory
Practical session 1.3. Generation of association rules.
Additional laboratory activities:
Preparation of a report on the practice session. Generation of association rules.

5.	CLUSTERING TECHNIQUES

Alt

Ext. L

Stu

A. time

Total

4,0

2,0

4,0

12,0

1. Direct partition methods: K-means algorithm.

2. Accelerated K-means algorithm.

3. Ascendant methods.

4. Mixed methods.

5. EM algorithm.

6. Example of classification.

Laboratory
Practical session 2.2. Programming a clustering algorithm.
Additional laboratory activities:
Practical session 2.2. Programming a clustering algorithm.

6.	PREDICTION MODELS EMPLOYING CONTINUOUS VARIABLES

Alt

Ext. L

Stu

A. time

Total

4,0

2,0

1,0

4,0

11,0

1. Linear regression.

2. Additive models.

3. Evaluating the quality of the results.

4. Regression of uncorrelated components.

7.	GENERALISED LINEAR MODELS

Alt

Ext. L

Stu

A. time

Total

2,0

1,0

2,0

7,0

1. MLG formulation.

2. Logistic regression.

3. Example of logistic regression

Laboratory
Practical session 3. Prediction model for logistic regression.
Additional laboratory activities:
Practice 3. Prediction model for logistic regression.

8.	PARAMETRIC DISCRIMINATION METHODS

Alt

Ext. L

Stu

A. time

Total

3,0

2,0

1,0

3,0

9,0

1. Linear discrimination and quadratic discrimination.

2. Naive Bayes.

3. Example of parametric discrimination.

Laboratory
Practical session 3. Linear discriminant prediction model.
Additional laboratory activities:
Practice 3. Prediction model for linear discrimination.

9.	NON-PARAMETRIC DISCRIMINATION

Alt

Ext. L

Stu

A. time

Total

3,0

2,0

1,0

3,0

9,0

1. KNN (K-Nearest Neighbor) local discrimination.
2. Example of local discrimination.

Laboratory
Practical session 3. KNN prediction models.
Additional laboratory activities:
Practice 3. KNN prediction models.

10.	DECISION TREES

Alt

Ext. L

Stu

A. time

Total

3,0

2,0

3,0

10,0

1. CART.

2. Other decision trees.

3. Example of a decision tree.

Laboratory
Practical session 3. Tree-based prediction model.
Additional laboratory activities:
Practice 3. Tree-based prediction model.

11.	NEURAL NETWORKS

Alt

Ext. L

Stu

A. time

Total

5,0

4,0

3,0

5,0

17,0

1. Formulation on neural networks.

2. Single and multi-layer perceptrons.

3. Example of a neural network.

4. Kohonen maps

Laboratory
Practical session 3. Prediction model for a neural network.
Additional laboratory activities:
Practice session 3. Prediction model for a neural network.

12.	FLEXIBLE DISCRIMINATION METHODS

Alt

Ext. L

Stu

A. time

Total

3,0

6,0

FLEXIBLE METHODS

1. Support Vector Machines (SVMs).

13.	COMBINING MODELS AND APPLICATIONS

Alt

Ext. L

Stu

A. time

Total

1,0

2,0

1. Bagging and boosting.

2. Web mining and text mining.

14.	USING AN INTEGRATED DATA MINING SYSTEM.

Alt

Ext. L

Stu

A. time

Total

2,0

1. R

2. Weka

Laboratory
1. Introduction to R
2. Introduction to Weka
Additional laboratory activities:
Students will learn R through various lab sessions throughout the course.

15.	PROFESSIONAL DATA MINING SYSTEMS

Alt

Ext. L

Stu

A. time

Total

4,0

1. SPAD.

2. Clementine.

3. Enterprise Miner.

Laboratory
Presentation of SPAD, Clementine, and Enterprise Miner.

16.	PRESENTATION OF RESULTS

Alt

Ext. L

Stu

A. time

Total

10,0

Additional laboratory activities:
Preparation of a presentation for practice session 3.

Total per kind	T	P	L	Alt	Ext. L	Stu	A. time	Total
Total per kind	39,0	0	28,0	0	24,0	38,0	0	129,0
Avaluation additional hours								10,0
Total work hours for student								139,0

Docent Methodolgy

Students will learn through case studies and analysis of sets of complex data from real-life problems. The problems will be used to develop scientific concepts in the theory classes, and their application in the lab classes. Programming activities and/or the incorporation of data mining will help students assimilate the various concepts taught in the course. System R will be used for this purpose.

R is a freely distributed, open programming system and can use other software available at the FIB: WEKA, Minitab, Saad, Excel, Matlab, etc. Given the purpose of this course, stress will also be laid on professional data mining systems, such as SPAD, Clementine, and Enterprise Miner.

Documents in pdf format are provided on the course Web site and set out information on the schedule and content of the theory classes.

Evaluation Methodgy

Academic assessment will be based on the grades obtained in the three practical sessions held during the course, plus a small test. The first practical work is based on solving a data pre-processing, visualization and clustering on a Data Base.
The second practical work consists in the generation of association rules on commercial transaction data.
The third practical work covers a full prediction problem, which can be chosen freely by the students among the alternatives available. This last practical work includes elements from previous works. Its purpose is to give students the opportunity to solve a prediction problem by using and critically comparing various models. This practical work must be defended orally and publicly. Students will also have to answer technical questions on the models and methods used in their solution. R system will be used for the practical works.
The test will take place the last day of the course. It aims at evaluating in a simple manner the degree of acquaintance with the foundations of the course.
The relative importance of these three practical works are 15%, 15% and 50%, respectively and the exam a 20%. Students will write a report on each practical assignment. The report may be jointly written by pairs of students.

Basic Bibliography

Tomàs Aluja Banet, Alain Morineau Aprender de los datos : el análisis de componentes principales : una aproximación desde el Data Mining, EUB, 1999.
D.J. Hand. Construction and assessment of classification rules, Wiley, 1997.
Trevor Hastie, Robert Tibshirani, Jerome Friedman The Elements of statistical learning : data mining, Springer, 2001.
José Hernández Orallo, Mª José Ramírez Quintana, Cèsar Ferri Ramírez Introducción a la minería de datos, Pearson, 2004.
Ian H. Witten, Eibe Frank Data mining : practical machine learning tools and techniques with java implementations, Morgan Kaufmann Publishers, 1999.

Complementary Bibliography

Michael J. A. Berry, Gordon Linoff Data mining techniques : for marketing, sales, and customer relationship management, Wiley, 2004.
David Hand, Heikki Mannila Padrhraic Smyth Principles of data mining, MIT Press, 2001.
Ludovic Lebart, Alain Morineau, Marie Piron Statistique exploratoire multidimensionnelle, Dunod, 1997.
Daniel Peña Regresión y diseño de experimentos, Alianza, 2002.
B. D. Ripley Pattern recognition and neural networks, Cambridge University Press, 1996.
Christopher M. Bishop Neural networks for pattern recognition, Clarendon Press, 1995.
Leo Breiman ... [et al.]. Classification and regression trees, Chapman & Hall : ITP International Thomson Publishing, 1994.
Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski Data mining methods for knowledge discovery, Kluwer Academic, 1998.
Maria L. Rizzo Statistical Computing with R, Chapman and Hall, 2008.

Web links

http://www.cran.es.r-project.org

http://www.kdnuggets.com/

http://www.cs.waikako.ac.nz

Previous capacities

The course is self-contained but students should be familiar with the following concepts:
- Concept of the mean, matrix of co-variances and correlations.
- Concept of hypothesis proof.
- Decomposition of singular values in a matrix
- Programming algorithms.
- Multiple linear regression

The prerequisite courses are: Statistics, Programming, and Mathematics.

News
Agenda

RSS
This website uses cookies to offer you the best experience and service. If you continue browsing, it is understood that you accept our cookies policy.
Classic version Mobile version

Data Mining (MD)

Instructors

General goals

Specific goals

Knowledges

Abilities

Competences

Contents

Docent Methodolgy

Evaluation Methodgy

Basic Bibliography

Complementary Bibliography

Web links

Previous capacities