Massive Information Search and Analysis

You are here

Credits
6
Types
Specialization complementary (Information Systems)
Requirements
  • Prerequisite: BD
  • Prerequisite: PE
  • Corequisite: PROP
Department
CS
Mail
The amount of information stored digitally in organizations, or collectively on the web, is today large enough to make searching this information a generally complicated task. The field known as "Information Retrieval" finds methods to organize information in such a way that finding information afterwards can be done simply and efficiently. We will cover basic keyword-based techniques to search in textual information. Then, we will examine search in the web, where hyperlinks can be used not only to direct the search but to assess the interest value of each page - as is the case with the well-known PageRank algorithm. We will see extensions of these techniques to the case of Social Networks where interactions among users can provide very useful information. Finally, we will study ways in which these techniques can be exploited for the benefit of specific organizations.

Teachers

Person in charge

  • Ramon Ferrer Cancho ( )

Others

  • Albert Calvo Ibañez ( )
  • Ignasi Gómez Sebastià ( )
  • Javier Béjar Alonso ( )
  • Jose Luis Balcázar Navarro ( )
  • Marta Arias Vicente ( )

Weekly hours

Theory
1.5
Problems
0.5
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Technical Competences of each Specialization

Information systems specialization

  • CSI2 - To integrate solutions of Information and Communication Technologies, and business processes to satisfy the information needs of the organizations, allowing them to achieve their objectives effectively.
    • CSI2.3 - To demonstrate knowledge and application capacity of extraction and knowledge management systems .
    • CSI2.6 - To demonstrate knowledge and capacity to apply decision support and business intelligence systems.

Computer science specialization

  • CCO2 - To develop effectively and efficiently the adequate algorithms and software to solve complex computation problems.
    • CCO2.5 - To implement information retrieval software.

Transversal Competences

Autonomous learning

  • G7 [Avaluable] - To detect deficiencies in the own knowledge and overcome them through critical reflection and choosing the best actuation to extend this knowledge. Capacity for learning new methods and technologies, and versatility to adapt oneself to new situations.
    • G7.3 - Autonomous learning: capacity to plan and organize personal work. To apply the acquired knowledge when performing a task, in function of its suitability and importance, decide how to perform it and the needed time, and select the most adequate information sources. To identify the importance of establishing and maintaining contacts with students, teacher staff and professionals (networking). To identify information forums about ICT engineering, its advances and its impact in the society (IEEE, associations, etc.).

Objectives

  1. Understand the problems associated with storage and information retrieval, in particular with information in textual form.
    Related competences: CCO2.5,
  2. Understand that effective search and information retrieval is closely related to the organization and description of this information.
    Related competences: CCO2.5, G7.3,
  3. To know and understand the structure, architecture and functioning of the web, and elements related to it: indices, search engines, crawlers, among others.
    Related competences: CSI2.3, G7.3,
  4. To know and understand the descriptive parameters of complex networks and the algorithms to analyze their structure.
    Related competences: CSI2.3, CSI2.6, G7.3,
  5. Recognizing the opportunities for using massive information to an organization's goals, and choose the most appropriate methods, tools, and procedures.
    Related competences: CSI2.6, G7.3,
  6. Be able to decide the information retrieval techniques that may be effective in a specific information system, especially those of textual type.
    Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
  7. Be able to evaluate the effectiveness and usefulness of an information retrieval system, according to several criteria.
    Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
  8. To implement themain techniques learned during the course.
    Related competences: CCO2.5, G7.3,
    Subcompetences:
    • Be able to implement the basic techniques (algorithms and data structures) for information retrieval.
    • Be able to implement basic algorithms for network analysis.
  9. Know how to use, adapt and extend open-source software.
    Related competences: G7.3,
    Subcompetences:
    • For example: Lucene, Dex database, WIRE crawler, among others.

Contents

  1. Introduction
    Need of search and analysis techniques of massive information. Search and analysis vs. databases. Information retrieval process. Preprocessing and lexical analysis.
  2. Models of information retrieval
    Formal definition and basic concepts: abstract models of documents and query languages. Boolean model. Vector model. Latent Semantic Indexing.
  3. Implementation: Indexing and searching
    Inverse and signature files. Index compression. Example: Efficient implementation of the rule of the cosine measure with tf-idf. Example: Lucene.
  4. Evaluation in information retrieval
    Recall and precision. Other performance measures. Reference collections. Relevance feedback and query expansion.
  5. Web search
    Ranking and relevance in the web. The PageRank algorithm. Crawling. Architecture of a simple web search system.
  6. Architecture of massive information processing systems
    Scalability, high performance, and fault tolerance: the case of massive web searchers. Distributed architectures. Example: Hadoop.
  7. Network analysis
    Descriptive parameters and characteristics of networks: degree, diameter, small-world networks, among others. Algorithms on networks: clustering, community detection and detection of influential nodes, reputation, among others.
  8. Information Systems based on massive information analysis. Combination with other technologies.
    Search Engine Optimization. Joint use of IR techniques with Data Mining and Machine Learning. Recommender Systems.

Activities

Activity Evaluation act


Introduction and models of information retrieval

2 theory hours, 2 problem hours, and 4 lab hours on the topics "Introduction" and "Models of information retrieval". See the description in the Teaching Methodology section.
Objectives: 1 2 6
Contents:
Theory
4h
Problems
2h
Laboratory
6h
Guided learning
0h
Autonomous learning
13.5h

Implementation and evaluation

2 theory hours, 2 problem hours, and 4 lab hours on the topics "Implementation" and "Evaluation". See the description in the Teaching Methodology section.
Objectives: 2 7 8 9
Contents:
Theory
2.5h
Problems
1h
Laboratory
4h
Guided learning
0h
Autonomous learning
6.5h

Searching the web

2 theory hours, 2 problem hours, and 4 lab hours on the topic "Web search". See the description in the Teaching Methodology section.
Objectives: 3 5 9
Contents:
Theory
2.5h
Problems
1h
Laboratory
6h
Guided learning
0h
Autonomous learning
12.5h

First partial exam

Partial exam of the first part of the course.
Objectives: 1 2 3 5 6 7
Week: 9
Type: theory exam
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
8h

Architecture of web search systems

2 theory hours and 6 lab hours on the topic "Architecture". See the description in the Teaching Methodology section.
Objectives: 3 6 8 9
Theory
3.5h
Problems
1.5h
Laboratory
6h
Guided learning
0h
Autonomous learning
12.5h

Network analysis

4 theory hours and 6 lab hours on the topic "Network Analysis". See the description in the Teaching Methodology section.
Objectives: 4 6 7 8 9
Contents:
Theory
3.5h
Problems
1.5h
Laboratory
4h
Guided learning
0h
Autonomous learning
12.5h

Information systems based on massive information analysis

Theory, problems, and labs on this topic. The emphasis is on practical cases in the problems and lab sessions. See the description in the Teaching Methodology section.
Objectives: 5 6 7 9
Contents:
Theory
1.5h
Problems
0.5h
Laboratory
4h
Guided learning
0h
Autonomous learning
12.5h

Second partial exam or final exam

The student chooses between an exam of the second part of the course or an exam on the whole course.
Objectives: 1 2 3 4 5 6 7 8 9
Week: 15 (Outside class hours)
Type: theory exam
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
12h

Teaching methodology

- Theory lectures. Before each class, students must have read the notes and materials on the topic to be discussed in class, which will be announced with enough time to prepare. Students will also have at their disposal a questionnaire with basic questions to see if a basic degree of understanding has been reached. In class, the teacher will present the main points, assuming that the student has done the job indicated and has tried to answer the questionnaire; difficulties found by students will be discussed in class collectively.

- Problem-solving sessions. Teachers and students will discuss and compare the solutions to problems provided by the teacher with sufficient time before each class. Discussions can be made collectively in class or individually between teacher and student. The teacher will assume that the students have spent a reasonable amount of time trying to solve these exercises, and priority will be given to those who have done so.

- Laboratory sessions. Before each class, students are assumed to have read the script of practical work to be developed during the session. During class, students will do the work specified in the script with the guidance of the teacher. In many cases, students will probably need extra time to finish the work. For most lab sessions the students will have to write a short report and/or deliver files associated with it (output files and code).

- Personal work. Every type of classroom activity involves a certain amount of personal work. Additionally, some topic or topics of the course could have no theory classes or exercises associated; students must study these on their own, and can take advantage the directed activities' sessions to assess whether they have learnt them sufficiently or not.

Given that the course appears as part of two different degree specialties, the activities proposed (in theory, problem-solving or lab sessions) may slightly differ for different types of students, always ensuring fairness regarding difficulty or workload.

Evaluation methodology

The course will include the following evaluation events:

- Reports of laboratory sessions, which will be delivered within a time limit for each session (generally around 2 weeks). We will compute a weighted average from the grade of these laboratory reports, which we refer to as L.

- A mid-term exam, covering material seen until the exam is done. Let P1 be the grade obtained in this exam.

- The day of the final exam, each student will choose between two options mutually exclusive: 1) do a second partial exam covering what was not covered in the mid-term exam (we refer to the grade of this exam as P2), or 2) do a final exam covering the whole course (whose grade we refer as F). There is no requirement in the grade of P1 to chose between options 1) and 2).

The four grades L, P1, P2, and F are between 0 and 10. The final grade is computed as:

0.4*L + maximum(0.3*P1+0.3*P2, 0.6*F).

The choice of final exam implies P2 = 0. The choice of second parcial implies F = 0. 
No attending the day of the final / second partial exam implies that the student has chosen second partial exam.

As to the competency grade associated to Autonomous Learning, it will be computed as follows:

- For the i-th laboratory report submitted: the value Ri will be 1 if the report has been submitted within the deadline and enough effort has been put into the report. Ri will be 0 otherwise. Let Rsum be the sum of all the individual Ri values (which can add up to k if there are k lab sessions).

- Some questions in partial or final exams, marked appropriately, will focus on topics that are only partially covered during theory and problem-solving sessions, and that therefore students must prepare
on their own. Let E be the weighted average of such questions, scaled to the interval [0.1].

Let S be the value of (Rsum/k+E)/2, which lies between 0 and 1.

The competency grade is:

- D if S is less than 0.5
- C if S lies between 0.5 and 0.599
- B if S lies between 0.6 and 0.799
- A if S is 0.8 or more.

Bibliography

Basic:

Web links

Previous capacities

In general, all those that are acquired in the required prior courses.

Specifically:

- To know and use comfortably basic concepts of linear algebra, discrete mathematics, probability and statistics.

- To program comfortably in object-oriented languages, including inheritance between classes.

- To know the main data structures to access information efficiently and their implementations (lists, hashing, trees, graphs, heaps). To be able to use them to build efficient programs. To be able to analyze the execution time and memory used by an algorithm of average difficulty. To have an idea of the difference in time to access main memory and disk.

- To know the main elements of a relational database and SQL-like access language.