Saltar al contingut Menu
Map
  • Home
  • Information
  • Contact
  • Map

Data Mining (MD)

Credits Dept.
7.5 (6.0 ECTS) EIO-CS

Instructors

Person in charge:  (-)
Others:(-)

General goals

The philosophy behind Data Mining is the conversion of data into knowledge for decision making. Data mining comprises the central phase of the process of extraction of knowledge using KDD processes (Knowledge Discovery in Databases). In this sense, data mining represents a meeting point of different disciplines: statistics, machine learning, database techniques and decision making systems. These allow one to solve the kind of information processing problems encountered by organisations today. The subject is conceptually divided into three parts, which focus on the topics of association, classification and prediction, that together conform the vast majority of problems faced in Data Mining. A parallel objective is the use of a free programming environment for solving data mining problems, as well as getting to know different spheres in the professional arena.

On the interest of this course in a computer science Curriculum

Data Mining is a discipline devoted to process big data from complex information systems of big organizations, to extract relevant, new,
understandable, useful knowledge for decision making, in all kind of contexts, from e-comerce, to social nets, including environmental
systems monitoring, customer fidelization cards, consume in general, public health, banca, finances or industrial production.
Data Mining is an umbrella where it is required to combine techniques and methodologies from several computer science areas (data
warehouses desing, machine learning, statistical modelling, multivariate data analysis, data visualization, intensive computing, software
engineering) to provide answer to the area complexity.
Currently, it is clear that the value of organizations is directly related to the information that can be extracted from the available data, and
there still is a lack of the professional profile suitable to do that. Data Mining is the science that transforms data in value for the
organizations and acquiring skills on this matter is an excellent complement for the computer science professional, whatever specialization he/she follows.
Regarding an information systems profile, this course provides skills to complete the data processing: too often an excellent
information system design is not sufficiently used due to a lack of a good exploitation service with the suitable mining. Also, knowing what
can be extracted from data is an important referent to take into account in the design of the data structure itself. Regarding software engineering it provides useful criteria to identify and standardize data mining services to include in the big computer applications to support the
organization, by deciding and planning data consumption services to be provided.
In information technologies, it might be interesting the relationship between the real-time monitoring of fix or mobile
systems and the data mining to reduce signals to relevant features, to detect events to communicate or tu extract relevant information in
an incremental process (data stream mining). Knowledge extraction from distributed data or from the cloud is an area with extreme
projection in the near future.
This matter also provides very interesting challenges related with development of new knowledge extraction
algorithms more eficients and/or scalable to deal with big datasets or with other less classical structures, like graphs (social nets mining)
or documents (web mining).

Please have a look of this master lesson to catch a general idea of the discipline
http://videolectures.net/learning06_gibert_dmtae/

Specific goals

Knowledges

  1. Automatic statistical description of databases.
  2. Tools for reducing dimensionality, and multivariant visualisation.
  3. Generation of association rules.
  4. Tools for defining clusters.
  5. The generation of statistical forecasting models.
  6. The general classification rules.
  7. The workings of multi-layer perceptrons and the support vector machine.
  8. Ability to use the general public license R programming environment for Data Mining.

Abilities

  1. Identifying the problems of Data Mining problems in a professional setting.
  2. Identifying the most appropriate statistical techniques and/or AI techniques for solving a problem.
  3. Implementing simple learning algorithms.
  4. Using Data Mining systems for solving real-life problems.
  5. Assessing the quality of the results.
  6. Build a Data Mining system for integrating different areas of learning and focused on decision-making.
  7. Acquaintance with the most commonly used professional Data Mining systems.

Competences

  1. Teamwork.
  2. Ability to solve quantitative problems in a computing environment.
  3. Drawing up reports and orally defending them.
  4. Ability to critically appraise data mining tools and results.

Contents

Estimated time (hours):

T P L Alt Ext. L Stu A. time
Theory Problems Laboratory Other activities External Laboratory Study Additional time

1. INTRODUCTION TO DATA MINING
T      P      L      Alt    Ext. L Stu    A. time Total 
1,0 0 0 0 0 0 0 1,0




1. The data learning process.



2. Data Mining problems.



3. Data Mining techniques.



4. The data. Data types. Pre-processing

2. AUTOMATIC STATISTICAL DESCRIPTION OF DATABASES.
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 2,0 0 1,0 3,0 0 9,0




1. Concept of hypothesis proof.



2. Description of a continuous variable.



3. Description of a variable category.







  • Laboratory
    Practical session 1.1. Automatic description of a Database



  • Additional laboratory activities:
    Preparation of a report on the practice session 1.1. Automatic description of a Database

3. MULTIVARIANT VISUALISATION OF DATA
T      P      L      Alt    Ext. L Stu    A. time Total 
4,0 0 2,0 0 1,0 4,0 0 11,0




1. Principal Components Analysis.



2. Multiple Correspondence Analysis (MCA).



3. Projection of supplementary information.







  • Laboratory
    Practical session 1.2. Multivariant Visualisation
  • Additional laboratory activities:
    Practical session 1.2. Multivariant Visualisation

4. GENERATION OF ASSOCIATION RULES
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 2,0 0 1,0 3,0 0 9,0




1. Market basket analysis.



2. Rule generation algorithms.



3. Example of association rules.











  • Laboratory
    Practical session 1.3. Generation of association rules.
  • Additional laboratory activities:
    Preparation of a report on the practice session. Generation of association rules.

5. CLUSTERING TECHNIQUES
T      P      L      Alt    Ext. L Stu    A. time Total 
4,0 0 2,0 0 2,0 4,0 0 12,0




1. Direct partition methods: K-means algorithm.



2. Accelerated K-means algorithm.



3. Ascendant methods.



4. Mixed methods.



5. EM algorithm.



6. Example of classification.







  • Laboratory
    Practical session 2.2. Programming a clustering algorithm.
  • Additional laboratory activities:
    Practical session 2.2. Programming a clustering algorithm.

6. PREDICTION MODELS EMPLOYING CONTINUOUS VARIABLES
T      P      L      Alt    Ext. L Stu    A. time Total 
4,0 0 2,0 0 1,0 4,0 0 11,0




1. Linear regression.



2. Additive models.



3. Evaluating the quality of the results.



4. Regression of uncorrelated components.

7. GENERALISED LINEAR MODELS
T      P      L      Alt    Ext. L Stu    A. time Total 
2,0 0 2,0 0 1,0 2,0 0 7,0




1. MLG formulation.



2. Logistic regression.



3. Example of logistic regression







  • Laboratory
    Practical session 3. Prediction model for logistic regression.
  • Additional laboratory activities:
    Practice 3. Prediction model for logistic regression.

8. PARAMETRIC DISCRIMINATION METHODS
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 2,0 0 1,0 3,0 0 9,0




1. Linear discrimination and quadratic discrimination.



2. Naive Bayes.



3. Example of parametric discrimination.







  • Laboratory
    Practical session 3. Linear discriminant prediction model.
  • Additional laboratory activities:
    Practice 3. Prediction model for linear discrimination.

9. NON-PARAMETRIC DISCRIMINATION
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 2,0 0 1,0 3,0 0 9,0
1. KNN (K-Nearest Neighbor) local discrimination.
2. Example of local discrimination.
  • Laboratory
    Practical session 3. KNN prediction models.
  • Additional laboratory activities:
    Practice 3. KNN prediction models.

10. DECISION TREES
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 2,0 0 2,0 3,0 0 10,0




1. CART.



2. Other decision trees.



3. Example of a decision tree.







  • Laboratory
    Practical session 3. Tree-based prediction model.
  • Additional laboratory activities:
    Practice 3. Tree-based prediction model.

11. NEURAL NETWORKS
T      P      L      Alt    Ext. L Stu    A. time Total 
5,0 0 4,0 0 3,0 5,0 0 17,0




1. Formulation on neural networks.



2. Single and multi-layer perceptrons.



3. Example of a neural network.



4. Kohonen maps







  • Laboratory
    Practical session 3. Prediction model for a neural network.



  • Additional laboratory activities:
    Practice session 3. Prediction model for a neural network.

12. FLEXIBLE DISCRIMINATION METHODS
T      P      L      Alt    Ext. L Stu    A. time Total 
3,0 0 0 0 0 3,0 0 6,0
FLEXIBLE METHODS



1. Support Vector Machines (SVMs).

13. COMBINING MODELS AND APPLICATIONS
T      P      L      Alt    Ext. L Stu    A. time Total 
1,0 0 0 0 0 1,0 0 2,0




1. Bagging and boosting.



2. Web mining and text mining.

14. USING AN INTEGRATED DATA MINING SYSTEM.
T      P      L      Alt    Ext. L Stu    A. time Total 
0 0 2,0 0 0 0 0 2,0




1. R



2. Weka







  • Laboratory
    1. Introduction to R
    2. Introduction to Weka
  • Additional laboratory activities:
    Students will learn R through various lab sessions throughout the course.

15. PROFESSIONAL DATA MINING SYSTEMS
T      P      L      Alt    Ext. L Stu    A. time Total 
0 0 4,0 0 0 0 0 4,0




1. SPAD.



2. Clementine.



3. Enterprise Miner.







  • Laboratory
    Presentation of SPAD, Clementine, and Enterprise Miner.

16. PRESENTATION OF RESULTS
T      P      L      Alt    Ext. L Stu    A. time Total 
0 0 0 0 10,0 0 0 10,0
  • Additional laboratory activities:
    Preparation of a presentation for practice session 3.


Total per kind T      P      L      Alt    Ext. L Stu    A. time Total 
39,0 0 28,0 0 24,0 38,0 0 129,0
Avaluation additional hours 10,0
Total work hours for student 139,0

Docent Methodolgy

Students will learn through case studies and analysis of sets of complex data from real-life problems. The problems will be used to develop scientific concepts in the theory classes, and their application in the lab classes. Programming activities and/or the incorporation of data mining will help students assimilate the various concepts taught in the course. System R will be used for this purpose.



R is a freely distributed, open programming system and can use other software available at the FIB: WEKA, Minitab, Saad, Excel, Matlab, etc. Given the purpose of this course, stress will also be laid on professional data mining systems, such as SPAD, Clementine, and Enterprise Miner.



Documents in pdf format are provided on the course Web site and set out information on the schedule and content of the theory classes.

Evaluation Methodgy

Academic assessment will be based on the grades obtained in the three practical sessions held during the course, plus a small test. The first practical work is based on solving a data pre-processing, visualization and clustering on a Data Base.
The second practical work consists in the generation of association rules on commercial transaction data.
The third practical work covers a full prediction problem, which can be chosen freely by the students among the alternatives available. This last practical work includes elements from previous works. Its purpose is to give students the opportunity to solve a prediction problem by using and critically comparing various models. This practical work must be defended orally and publicly. Students will also have to answer technical questions on the models and methods used in their solution. R system will be used for the practical works.
The test will take place the last day of the course. It aims at evaluating in a simple manner the degree of acquaintance with the foundations of the course.
The relative importance of these three practical works are 15%, 15% and 50%, respectively and the exam a 20%. Students will write a report on each practical assignment. The report may be jointly written by pairs of students.

Basic Bibliography

  • Tomàs Aluja Banet, Alain Morineau Aprender de los datos : el análisis de componentes principales : una aproximación desde el Data Mining, EUB, 1999.
  • D.J. Hand. Construction and assessment of classification rules, Wiley, 1997.
  • Trevor Hastie, Robert Tibshirani, Jerome Friedman The Elements of statistical learning : data mining, Springer, 2001.
  • José Hernández Orallo, Mª José Ramírez Quintana, Cèsar Ferri Ramírez Introducción a la minería de datos, Pearson, 2004.
  • Ian H. Witten, Eibe Frank Data mining : practical machine learning tools and techniques with java implementations, Morgan Kaufmann Publishers, 1999.

Complementary Bibliography

  • Michael J. A. Berry, Gordon Linoff Data mining techniques : for marketing, sales, and customer relationship management, Wiley, 2004.
  • David Hand, Heikki Mannila Padrhraic Smyth Principles of data mining, MIT Press, 2001.
  • Ludovic Lebart, Alain Morineau, Marie Piron Statistique exploratoire multidimensionnelle, Dunod, 1997.
  • Daniel Peña Regresión y diseño de experimentos, Alianza, 2002.
  • B. D. Ripley Pattern recognition and neural networks, Cambridge University Press, 1996.
  • Christopher M. Bishop Neural networks for pattern recognition, Clarendon Press, 1995.
  • Leo Breiman ... [et al.]. Classification and regression trees, Chapman & Hall : ITP International Thomson Publishing, 1994.
  • Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski Data mining methods for knowledge discovery, Kluwer Academic, 1998.
  • Maria L. Rizzo Statistical Computing with R, Chapman and Hall, 2008.

Web links

  1. http://www.cran.es.r-project.org


  2. http://www.kdnuggets.com/


  3. http://www.cs.waikako.ac.nz


Previous capacities

The course is self-contained but students should be familiar with the following concepts:
- Concept of the mean, matrix of co-variances and correlations.
- Concept of hypothesis proof.
- Decomposition of singular values in a matrix
- Programming algorithms.
- Multiple linear regression

The prerequisite courses are: Statistics, Programming, and Mathematics.


Compartir

 
logo FIB © Barcelona school of informatics - Contact - RSS
This website uses cookies to offer you the best experience and service. If you continue browsing, it is understood that you accept our cookies policy.
Classic version Mobile version