Person in charge:  () 
Others:  () 
Credits  Dept. 

7.5 (6.0 ECTS)  EIOCS 
Person in charge:  () 
Others:  () 
The philosophy behind Data Mining is the conversion of data into knowledge for decision making. Data mining comprises the central phase of the process of extraction of knowledge using KDD processes (Knowledge Discovery in Databases). In this sense, data mining represents a meeting point of different disciplines: statistics, machine learning, database techniques and decision making systems. These allow one to solve the kind of information processing problems encountered by organisations today. The subject is conceptually divided into three parts, which focus on the topics of association, classification and prediction, that together conform the vast majority of problems faced in Data Mining. A parallel objective is the use of a free programming environment for solving data mining problems, as well as getting to know different spheres in the professional arena.
On the interest of this course in a computer science Curriculum
Data Mining is a discipline devoted to process big data from complex information systems of big organizations, to extract relevant, new,
understandable, useful knowledge for decision making, in all kind of contexts, from ecomerce, to social nets, including environmental
systems monitoring, customer fidelization cards, consume in general, public health, banca, finances or industrial production.
Data Mining is an umbrella where it is required to combine techniques and methodologies from several computer science areas (data
warehouses desing, machine learning, statistical modelling, multivariate data analysis, data visualization, intensive computing, software
engineering) to provide answer to the area complexity.
Currently, it is clear that the value of organizations is directly related to the information that can be extracted from the available data, and
there still is a lack of the professional profile suitable to do that. Data Mining is the science that transforms data in value for the
organizations and acquiring skills on this matter is an excellent complement for the computer science professional, whatever specialization he/she follows.
Regarding an information systems profile, this course provides skills to complete the data processing: too often an excellent
information system design is not sufficiently used due to a lack of a good exploitation service with the suitable mining. Also, knowing what
can be extracted from data is an important referent to take into account in the design of the data structure itself. Regarding software engineering it provides useful criteria to identify and standardize data mining services to include in the big computer applications to support the
organization, by deciding and planning data consumption services to be provided.
In information technologies, it might be interesting the relationship between the realtime monitoring of fix or mobile
systems and the data mining to reduce signals to relevant features, to detect events to communicate or tu extract relevant information in
an incremental process (data stream mining). Knowledge extraction from distributed data or from the cloud is an area with extreme
projection in the near future.
This matter also provides very interesting challenges related with development of new knowledge extraction
algorithms more eficients and/or scalable to deal with big datasets or with other less classical structures, like graphs (social nets mining)
or documents (web mining).
Please have a look of this master lesson to catch a general idea of the discipline
http://videolectures.net/learning06_gibert_dmtae/
Estimated time (hours):
T  P  L  Alt  Ext. L  Stu  A. time 
Theory  Problems  Laboratory  Other activities  External Laboratory  Study  Additional time 

T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  2,0  0  1,0  3,0  0  9,0  
1. Concept of hypothesis proof. 2. Description of a continuous variable. 3. Description of a variable category.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

4,0  0  2,0  0  1,0  4,0  0  11,0  
1. Principal Components Analysis. 2. Multiple Correspondence Analysis (MCA). 3. Projection of supplementary information.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  2,0  0  1,0  3,0  0  9,0  
1. Market basket analysis. 2. Rule generation algorithms. 3. Example of association rules.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

4,0  0  2,0  0  2,0  4,0  0  12,0  
1. Direct partition methods: Kmeans algorithm. 2. Accelerated Kmeans algorithm. 3. Ascendant methods. 4. Mixed methods. 5. EM algorithm. 6. Example of classification.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

4,0  0  2,0  0  1,0  4,0  0  11,0  
1. Linear regression. 2. Additive models. 3. Evaluating the quality of the results. 4. Regression of uncorrelated components. 

T  P  L  Alt  Ext. L  Stu  A. time  Total  

2,0  0  2,0  0  1,0  2,0  0  7,0  
1. MLG formulation. 2. Logistic regression. 3. Example of logistic regression


T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  2,0  0  1,0  3,0  0  9,0  
1. Linear discrimination and quadratic discrimination. 2. Naive Bayes. 3. Example of parametric discrimination.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  2,0  0  1,0  3,0  0  9,0  
1. KNN (KNearest Neighbor) local discrimination.
2. Example of local discrimination.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  2,0  0  2,0  3,0  0  10,0  
1. CART. 2. Other decision trees. 3. Example of a decision tree.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

5,0  0  4,0  0  3,0  5,0  0  17,0  
1. Formulation on neural networks. 2. Single and multilayer perceptrons. 3. Example of a neural network. 4. Kohonen maps


T  P  L  Alt  Ext. L  Stu  A. time  Total  

3,0  0  0  0  0  3,0  0  6,0  
FLEXIBLE METHODS
1. Support Vector Machines (SVMs). 

T  P  L  Alt  Ext. L  Stu  A. time  Total  

1,0  0  0  0  0  1,0  0  2,0  
1. Bagging and boosting. 2. Web mining and text mining. 

T  P  L  Alt  Ext. L  Stu  A. time  Total  

0  0  2,0  0  0  0  0  2,0  
1. R 2. Weka


T  P  L  Alt  Ext. L  Stu  A. time  Total  

0  0  4,0  0  0  0  0  4,0  
1. SPAD. 2. Clementine. 3. Enterprise Miner.


T  P  L  Alt  Ext. L  Stu  A. time  Total  

0  0  0  0  10,0  0  0  10,0  

Total per kind  T  P  L  Alt  Ext. L  Stu  A. time  Total 
39,0  0  28,0  0  24,0  38,0  0  129,0  
Avaluation additional hours  10,0  
Total work hours for student  139,0 
Students will learn through case studies and analysis of sets of complex data from reallife problems. The problems will be used to develop scientific concepts in the theory classes, and their application in the lab classes. Programming activities and/or the incorporation of data mining will help students assimilate the various concepts taught in the course. System R will be used for this purpose.
R is a freely distributed, open programming system and can use other software available at the FIB: WEKA, Minitab, Saad, Excel, Matlab, etc. Given the purpose of this course, stress will also be laid on professional data mining systems, such as SPAD, Clementine, and Enterprise Miner.
Documents in pdf format are provided on the course Web site and set out information on the schedule and content of the theory classes.
Academic assessment will be based on the grades obtained in the three practical sessions held during the course, plus a small test. The first practical work is based on solving a data preprocessing, visualization and clustering on a Data Base.
The second practical work consists in the generation of association rules on commercial transaction data.
The third practical work covers a full prediction problem, which can be chosen freely by the students among the alternatives available. This last practical work includes elements from previous works. Its purpose is to give students the opportunity to solve a prediction problem by using and critically comparing various models. This practical work must be defended orally and publicly. Students will also have to answer technical questions on the models and methods used in their solution. R system will be used for the practical works.
The test will take place the last day of the course. It aims at evaluating in a simple manner the degree of acquaintance with the foundations of the course.
The relative importance of these three practical works are 15%, 15% and 50%, respectively and the exam a 20%. Students will write a report on each practical assignment. The report may be jointly written by pairs of students.
The course is selfcontained but students should be familiar with the following concepts:
 Concept of the mean, matrix of covariances and correlations.
 Concept of hypothesis proof.
 Decomposition of singular values in a matrix
 Programming algorithms.
 Multiple linear regression
The prerequisite courses are: Statistics, Programming, and Mathematics.