Person in charge: | (-) |
Others: | (-) |
Credits | Dept. |
---|---|
7.5 (6.0 ECTS) | EIO-CS |
Person in charge: | (-) |
Others: | (-) |
The philosophy behind Data Mining is the conversion of data into knowledge for decision making. Data mining comprises the central phase of the process of extraction of knowledge using KDD processes (Knowledge Discovery in Databases). In this sense, data mining represents a meeting point of different disciplines: statistics, machine learning, database techniques and decision making systems. These allow one to solve the kind of information processing problems encountered by organisations today. The subject is conceptually divided into three parts, which focus on the topics of association, classification and prediction, that together conform the vast majority of problems faced in Data Mining. A parallel objective is the use of a free programming environment for solving data mining problems, as well as getting to know different spheres in the professional arena.
On the interest of this course in a computer science Curriculum
Data Mining is a discipline devoted to process big data from complex information systems of big organizations, to extract relevant, new,
understandable, useful knowledge for decision making, in all kind of contexts, from e-comerce, to social nets, including environmental
systems monitoring, customer fidelization cards, consume in general, public health, banca, finances or industrial production.
Data Mining is an umbrella where it is required to combine techniques and methodologies from several computer science areas (data
warehouses desing, machine learning, statistical modelling, multivariate data analysis, data visualization, intensive computing, software
engineering) to provide answer to the area complexity.
Currently, it is clear that the value of organizations is directly related to the information that can be extracted from the available data, and
there still is a lack of the professional profile suitable to do that. Data Mining is the science that transforms data in value for the
organizations and acquiring skills on this matter is an excellent complement for the computer science professional, whatever specialization he/she follows.
Regarding an information systems profile, this course provides skills to complete the data processing: too often an excellent
information system design is not sufficiently used due to a lack of a good exploitation service with the suitable mining. Also, knowing what
can be extracted from data is an important referent to take into account in the design of the data structure itself. Regarding software engineering it provides useful criteria to identify and standardize data mining services to include in the big computer applications to support the
organization, by deciding and planning data consumption services to be provided.
In information technologies, it might be interesting the relationship between the real-time monitoring of fix or mobile
systems and the data mining to reduce signals to relevant features, to detect events to communicate or tu extract relevant information in
an incremental process (data stream mining). Knowledge extraction from distributed data or from the cloud is an area with extreme
projection in the near future.
This matter also provides very interesting challenges related with development of new knowledge extraction
algorithms more eficients and/or scalable to deal with big datasets or with other less classical structures, like graphs (social nets mining)
or documents (web mining).
Please have a look of this master lesson to catch a general idea of the discipline
http://videolectures.net/learning06_gibert_dmtae/
Estimated time (hours):
T | P | L | Alt | Ext. L | Stu | A. time |
Theory | Problems | Laboratory | Other activities | External Laboratory | Study | Additional time |
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 2,0 | 0 | 1,0 | 3,0 | 0 | 9,0 | |||
1. Concept of hypothesis proof. 2. Description of a continuous variable. 3. Description of a variable category.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 0 | 2,0 | 0 | 1,0 | 4,0 | 0 | 11,0 | |||
1. Principal Components Analysis. 2. Multiple Correspondence Analysis (MCA). 3. Projection of supplementary information.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 2,0 | 0 | 1,0 | 3,0 | 0 | 9,0 | |||
1. Market basket analysis. 2. Rule generation algorithms. 3. Example of association rules.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 0 | 2,0 | 0 | 2,0 | 4,0 | 0 | 12,0 | |||
1. Direct partition methods: K-means algorithm. 2. Accelerated K-means algorithm. 3. Ascendant methods. 4. Mixed methods. 5. EM algorithm. 6. Example of classification.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 0 | 2,0 | 0 | 1,0 | 4,0 | 0 | 11,0 | |||
1. Linear regression. 2. Additive models. 3. Evaluating the quality of the results. 4. Regression of uncorrelated components. |
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
2,0 | 0 | 2,0 | 0 | 1,0 | 2,0 | 0 | 7,0 | |||
1. MLG formulation. 2. Logistic regression. 3. Example of logistic regression
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 2,0 | 0 | 1,0 | 3,0 | 0 | 9,0 | |||
1. Linear discrimination and quadratic discrimination. 2. Naive Bayes. 3. Example of parametric discrimination.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 2,0 | 0 | 1,0 | 3,0 | 0 | 9,0 | |||
1. KNN (K-Nearest Neighbor) local discrimination.
2. Example of local discrimination.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 2,0 | 0 | 2,0 | 3,0 | 0 | 10,0 | |||
1. CART. 2. Other decision trees. 3. Example of a decision tree.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
5,0 | 0 | 4,0 | 0 | 3,0 | 5,0 | 0 | 17,0 | |||
1. Formulation on neural networks. 2. Single and multi-layer perceptrons. 3. Example of a neural network. 4. Kohonen maps
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
3,0 | 0 | 0 | 0 | 0 | 3,0 | 0 | 6,0 | |||
FLEXIBLE METHODS
1. Support Vector Machines (SVMs). |
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
1,0 | 0 | 0 | 0 | 0 | 1,0 | 0 | 2,0 | |||
1. Bagging and boosting. 2. Web mining and text mining. |
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2,0 | 0 | 0 | 0 | 0 | 2,0 | |||
1. R 2. Weka
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 4,0 | 0 | 0 | 0 | 0 | 4,0 | |||
1. SPAD. 2. Clementine. 3. Enterprise Miner.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 10,0 | 0 | 0 | 10,0 | |||
|
Total per kind | T | P | L | Alt | Ext. L | Stu | A. time | Total |
39,0 | 0 | 28,0 | 0 | 24,0 | 38,0 | 0 | 129,0 | |
Avaluation additional hours | 10,0 | |||||||
Total work hours for student | 139,0 |
Students will learn through case studies and analysis of sets of complex data from real-life problems. The problems will be used to develop scientific concepts in the theory classes, and their application in the lab classes. Programming activities and/or the incorporation of data mining will help students assimilate the various concepts taught in the course. System R will be used for this purpose.
R is a freely distributed, open programming system and can use other software available at the FIB: WEKA, Minitab, Saad, Excel, Matlab, etc. Given the purpose of this course, stress will also be laid on professional data mining systems, such as SPAD, Clementine, and Enterprise Miner.
Documents in pdf format are provided on the course Web site and set out information on the schedule and content of the theory classes.
Academic assessment will be based on the grades obtained in the three practical sessions held during the course, plus a small test. The first practical work is based on solving a data pre-processing, visualization and clustering on a Data Base.
The second practical work consists in the generation of association rules on commercial transaction data.
The third practical work covers a full prediction problem, which can be chosen freely by the students among the alternatives available. This last practical work includes elements from previous works. Its purpose is to give students the opportunity to solve a prediction problem by using and critically comparing various models. This practical work must be defended orally and publicly. Students will also have to answer technical questions on the models and methods used in their solution. R system will be used for the practical works.
The test will take place the last day of the course. It aims at evaluating in a simple manner the degree of acquaintance with the foundations of the course.
The relative importance of these three practical works are 15%, 15% and 50%, respectively and the exam a 20%. Students will write a report on each practical assignment. The report may be jointly written by pairs of students.
The course is self-contained but students should be familiar with the following concepts:
- Concept of the mean, matrix of co-variances and correlations.
- Concept of hypothesis proof.
- Decomposition of singular values in a matrix
- Programming algorithms.
- Multiple linear regression
The prerequisite courses are: Statistics, Programming, and Mathematics.