Enlarging letters   Home   Information   Contacting   Map
Catalā   Castellano

Information Retrieval (RI)

Credits Dept. Type Requirements
7.5 (6.0 ECTS) LSI
  • Elective for DIE
  • Elective for DCSFW
  • Elective for DCSYS
PRED - Prerequisite for DIE , DCSYS
PS - Prerequisite for DCSFW

Instructors

Person in charge:  Ricard Gavaldā Mestre (gavalda@lsi.upc.edu)
Others:Xavier Messeguer Peypoch (peypoch@lsi.upc.edu)

General goals

Understanding the problem of retrieving information. Understanding the different components of a RI system, the factors and techniques that can optimize the process and know how to use and adapt them. Know some applications of these systems, at least in bioinformatics and the Web.

Specific goals

Knowledges

  1. Understand the problems associated with information storage and retrieval, particularly text information.
  2. Understand that effective information search and retrieval is closely linked with the way the information is organised and described.
  3. Understand the most important algorithms and techniques for searching for patterns in textual information.
  4. Describe information retrieval techniques in biocomputing and Web applications.

Abilities

  1. Ability to decide which information recovery techniques are likely to be the most effective in a given information system (particularly in relation to text information).
  2. In particular, ability to decide which information search and retrieval techniques are likely to be the most effective in simple applications for biocomputing and Web fields.
  3. Ability to employ various criteria for assessing the effectiveness and usefulness of an information retrieval system.
  4. Ability to implement basic techniques (algorithms and data structures) for retrieving information.
  5. Ability to use and adapt a tool to manage textual information, like Lucene.

Competences

  1. Ability to apply mathematical knowledge and logic in solving problems.
  2. Ability to create and use models of reality.
  3. Ability to design and carry out experiments and analyse the results.
  4. Ability to design systems, components and processes meeting certain needs, using the most appropriate methods, techniques and tools in each case.
  5. Ability to take take decisions when faced with uncertainty or contradictory requirements
  6. Ability to act independently: Know how to work on one"s own with just the bare minimum of knowledge and guidance.
  7. Ability to learn on one"s own.

Contents

Estimated time (hours):

T P L Alt Ext. L Stu A. time
Theory Problems Laboratory Other activities External Laboratory Study Additional time

1. Introduction
T      P      L      Alt    Ext. L Stu    A. time Total 
2,0 0 2,0 0 2,0 2,0 0 8,0
Browsing versus search. Documents. Logical vision. Process for retrieving information. Lexical analysis.

2. Information retrieval models
T      P      L      Alt    Ext. L Stu    A. time Total 
4,0 4,0 2,0 0 2,0 10,0 0 22,0
Formal characterisation and basic concepts. Boolean model.
Vector model. Probabilistic model.
Query languages. Main components and models.

3. Indexing and searching, implementation
T      P      L      Alt    Ext. L Stu    A. time Total 
2,0 2,0 0 0 0 4,0 0 8,0
Inverted files and signature files. Index compression. Example: Efficient implementation of the cosine rule with tf-idf. Example: Lucene.

4. Assessment
T      P      L      Alt    Ext. L Stu    A. time Total 
2,0 2,0 2,0 0 4,0 4,0 0 14,0
Precision and Recall. Other measures of performance. Reference collections.
Relevance feedback and query expansion.

5. Clustering and classification
T      P      L      Alt    Ext. L Stu    A. time Total 
1,0 1,0 0 0 0 2,0 0 4,0
Notion of clustering. Basic algorithms: k-means, hierarchical clustering.
Notion of classification and some algorithms. Applications to searching.
Latent Semantic Indexing (LSI).

6. Applications to the Web
T      P      L      Alt    Ext. L Stu    A. time Total 
4,0 4,0 0 0 0 4,0 0 12,0
Ranking and relevance in Web models. PageRank and HITS algorithms. Web Crawling. XML retrieval.

7. Sequential search and indexed search
T      P      L      Alt    Ext. L Stu    A. time Total 
8,0 8,0 3,0 0 2,0 16,0 0 37,0
Pattern search. Algorithms for approximate search and exact search. Hidden Markov models.
Tries. Inverted files, suffix tree. Construction algorithms, use and analysis.

8. Applications to biocomputing
T      P      L      Alt    Ext. L Stu    A. time Total 
6,0 6,0 3,0 0 2,0 15,0 0 32,0
DNA chain patterns. Sequence similarity.
DNA sequencing. DNA databases.


Total per kind T      P      L      Alt    Ext. L Stu    A. time Total 
29,0 27,0 12,0 0 12,0 57,0 0 137,0
Avaluation additional hours 6,0
Total work hours for student 143,0

Docent Methodolgy

The lab classes will implement variations of the algorithms seen in the theory and problem sessions, or will apply them to search for information in lifelike situations.







Students may be required to prepare for some of the lab sessions. Some of them will require the drafting of a short report or submission of code. This work will count towards student assessments.



Nowadays we are using the Lucene package.

Evaluation Methodgy

There will be a first part exam at about half the course, and at the end the students can choose whether to take a second part exam or a final exam of the whole course.



The lab grade will be based on the reports or the programmes submitted after the lab sessions.







The course note is calculated as follows:





Students who choose to take the second part exam:



0.2 * lab note

+ 0.4 * 1st part exam

+ 0.4 * 2nd part exam



Students who choose to take the final exam:



0.2 * lab note

+ max (0.2 * 1st part exam + 0.6 * final exam, 0.8 * final exam)

Basic Bibliography

  • BAEZA-YATES, Ricardo; RIBEIRO-NIETO, Berthier Modern Information Retrieval, Addison Wesley , 1998.
  • WITTEN, Ian. H; MOFFAT, Alistair.; BELL, Timothy C. Managing Gigabytes, Morgan Kaufmann, 1999.
  • NAVARRO, Gonzalo.; RAFFINOT, Matthieu Flexible Pattern Matching in Strings: Practical on-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, 2002.
  • GUSFIELD, Dan Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997.
  • MARKOV, Zdravko; LAROSE, Daniel T. Data mining the web , Wiley Interscience, 2007.

Complementary Bibliography

  • BELEW, Richard K. Finding Out About: Perspective on Search Engine Technology and the WWW, Cambridge University Press, 2001.
  • GOSPODNETIC, Otis.; HATCHER, Erik Lucene in Action , Manning Publications Co., 2005.

Previous capacities

Ability to produce medium-sized programmes, preferably of object-oriented nature.

Ability to design and analyze simple data structures.

Know the difference between main memory and secondary memory and its impact on the program's efficiency.



 
logo FIB © Barcelona school of informatics - webmaster@fib.upc.edu - RSS RSS