The amount of information stored digitally in organizations, or collectively on the web, is today large enough to make searching this information a generally complicated task. The field known as "Information Retrieval" finds methods to organize information in such a way that finding information afterwards can be done simply and efficiently. We will cover basic keyword-based techniques to search in textual information. Then, we will examine search in the web, where hyperlinks can be used not only to direct the search but to assess the interest value of each page - as is the case with the well-known PageRank algorithm. We will see extensions of these techniques to the case of Social Networks where interactions among users can provide very useful information. Finally, we will study ways in which these techniques can be exploited for the benefit of specific organizations.
Person in charge
Marta Arias Vicente (
Ramon Ferrer Cancho (
Javier Béjar Alonso (
Jose Luis Balcázar Navarro (
Ricard Gavaldà Mestre (
Technical Competences of each Specialization
Information systems specialization
CSI2 - To integrate solutions of Information and Communication Technologies, and business processes to satisfy the information needs of the organizations, allowing them to achieve their objectives effectively.
- To demonstrate knowledge and application capacity of extraction and knowledge management systems .
- To demonstrate knowledge and capacity to apply decision support and business intelligence systems.
Computer science specialization
CCO2 - To develop effectively and efficiently the adequate algorithms and software to solve complex computation problems.
- To implement information retrieval software.
G7 - To detect deficiencies in the own knowledge and overcome them through critical reflection and choosing the best actuation to extend this knowledge. Capacity for learning new methods and technologies, and versatility to adapt oneself to new situations.
- Autonomous learning: capacity to plan and organize personal work. To apply the acquired knowledge when performing a task, in function of its suitability and importance, decide how to perform it and the needed time, and select the most adequate information sources. To identify the importance of establishing and maintaining contacts with students, teacher staff and professionals (networking). To identify information forums about ICT engineering, its advances and its impact in the society (IEEE, associations, etc.).
Understand the problems associated with storage and information retrieval, in particular with information in textual form.
Understand that effective search and information retrieval is closely related to the organization and description of this information.
To know and understand the structure, architecture and functioning of the web, and elements related to it: indices, search engines, crawlers, among others.
To know and understand the descriptive parameters of complex networks and the algorithms to analyze their structure.
Recognizing the opportunities for using massive information to an organization's goals, and choose the most appropriate methods, tools, and procedures.
Be able to decide the information retrieval techniques that may be effective in a specific information system, especially those of textual type.
Be able to evaluate the effectiveness and usefulness of an information retrieval system, according to several criteria.
To implement themain techniques learned during the course.
Be able to implement the basic techniques (algorithms and data structures) for information retrieval.
Be able to implement basic algorithms for network analysis.
Know how to use, adapt and extend open-source software.
For example: Lucene, Dex database, WIRE crawler, among others.
Need of search and analysis techniques of massive information. Search and analysis vs. databases. Information retrieval process. Preprocessing and lexical analysis.
Models of information retrieval
Formal definition and basic concepts: abstract models of documents and query languages. Boolean model. Vector model. Latent Semantic Indexing.
Implementation: Indexing and searching
Inverse and signature files. Index compression. Example: Efficient implementation of the rule of the cosine measure with tf-idf. Example: Lucene.
Evaluation in information retrieval
Recall and precision. Other performance measures. Reference collections. Relevance feedback and query expansion.
Ranking and relevance in the web. The PageRank algorithm. Crawling. Architecture of a simple web search system.
Architecture of massive information processing systems
Scalability, high performance, and fault tolerance: the case of massive web searchers. Distributed architectures. Example: Hadoop.
Descriptive parameters and characteristics of networks: degree, diameter, small-world networks, among others. Algorithms on networks: clustering, community detection and detection of influential nodes, reputation, among others.
Information Systems based on massive information analysis. Combination with other technologies.
Search Engine Optimization. Joint use of IR techniques with Data Mining and Machine Learning. Recommender Systems.
Introduction and models of information retrieval
2 theory hours, 2 problem hours, and 4 lab hours on the topics "Introduction" and "Models of information retrieval". See the description in the Teaching Methodology section. Objectives:126 Contents:
The student chooses between an exam of the second part of the course or an exam on the whole course. Objectives:123456789 Week:
15 (Outside class hours) Type:
- Theory lectures. Before each class, students must have read the notes and materials on the topic to be discussed in class, which will be announced with enough time to prepare. Students will also have at their disposal a questionnaire with basic questions to see if a basic degree of understanding has been reached. In class, the teacher will present the main points, assuming that the student has done the job indicated and has tried to answer the questionnaire; difficulties found by students will be discussed in class collectively.
- Problem-solving sessions. Teachers and students will discuss and compare the solutions to problems provided by the teacher with sufficient time before each class. Discussions can be made collectively in class or individually between teacher and student. The teacher will assume that the students have spent a reasonable amount of time trying to solve these exercises, and priority will be given to those who have done so.
- Laboratory sessions. Before each class, students are assumed to have read the script of practical work to be developed during the session. During class, students will do the work specified in the script with the guidance of the teacher. In many cases, students will probably need extra time to finish the work. For most lab sessions the students will have to write a short report and/or deliver files associated with it (output files and code).
- Personal work. Every type of classroom activity involves a certain amount of personal work. Additionally, some topic or topics of the course could have no theory classes or exercises associated; students must study these on their own, and can take advantage the directed activities' sessions to assess whether they have learnt them sufficiently or not.
Given that the course appears as part of two different degree specialties, the activities proposed (in theory, problem-solving or lab sessions) may slightly differ for different types of students, always ensuring fairness regarding difficulty or workload.
The course will include the following evaluation events:
- Reports of laboratory sessions, which will be delivered within a time limit for each session (generally around 2 weeks). We will compute a weighted average from the grade of these laboratory reports, which we refer to as L.
- A mid-term exam, covering material seen until the exam is done. Let P1 be the grade obtained in this exam.
- The day of the final exam, each student will choose between two options mutually exclusive: 1) do a second partial exam covering what was not covered in the mid-term exam (we refer to the grade of this exam as P2), or 2) do a final exam covering the whole course (whose grade we refer as F). There is no requirement in the grade of P1 to chose between options 1) and 2).
The four grades L, P1, P2, and F are between 0 and 10. The final grade is computed as:
0.4*L + maximum(0.3*P1+0.3*P2, 0.6*F).
As to the competency grade associated to Autonomous Learning, it will be computed as follows:
- For the i-th laboratory report submitted: the value Ri will be 1 if the report has been submitted within the deadline and enough effort has been put into the report. Ri will be 0 otherwise. Let Rsum be the sum of all the individual Ri values (which can add up to k if there are k lab sessions).
- Some questions in partial or final exams, marked appropriately, will focus on topics that are only partially covered during theory and problem-solving sessions, and that therefore students must prepare
on their own. Let E be the weighted average of such questions, scaled to the interval [0.1].
Let S be the value of (Rsum/k+E)/2, which lies between 0 and 1.
The competency grade is:
- D if S is less than 0.5
- C if S lies between 0.5 and 0.599
- B if S lies between 0.6 and 0.799
- A if S is 0.8 or more.
In general, all those that are acquired in the required prior courses.
- To know and use comfortably basic concepts of linear algebra, discrete mathematics, probability and statistics.
- To program comfortably in object-oriented languages, including inheritance between classes.
- To know the main data structures to access information efficiently and their implementations (lists, hashing, trees, graphs, heaps). To be able to use them to build efficient programs. To be able to analyze the execution time and memory used by an algorithm of average difficulty. To have an idea of the difference in time to access main memory and disk.
- To know the main elements of a relational database and SQL-like access language.
Where we are
B6 Building Campus Nord
C/Jordi Girona Salgado,1-3
08034 BARCELONA Spain
Tel: (+34) 93 401 70 00