Information Retrieval and Analysis

You are here

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
CS
Els grans respositoris de dades semi-estructurades com ara text, i en especial la web, necessiten tècniques especialitzades per ser cercades i analitzades eficientment. L'assignatura cobreix tècniques de cerca i anàlisi de textos i altra informació semi-estructurada, d'estructurs enllaçades i en particular la web i les xarxes socials, els sistemes recomanadors com a manera de complementar la cerca a iniciativa de l'usuari, i algunes tècniques algorísmiques i estructures de dades particularment útils amb dades massives.

Teachers

Person in charge

  • Marta Arias Vicente ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Technical Competences

Technical competencies

  • CE1 - Skillfully use mathematical concepts and methods that underlie the problems of science and data engineering.
  • CE4 - Use current computer systems, including high performance systems, for the process of large volumes of data from the knowledge of its structure, operation and particularities.
  • CE6 - Build or use systems of processing and comprehension of written language, integrating it into other systems driven by the data. Design systems for searching textual or hypertextual information and analysis of social networks.
  • CE7 - Demonstrate knowledge and ability to apply the necessary tools for the storage, processing and access to data.

Transversal Competences

Transversals

  • CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT5 - Solvent use of information resources. Manage the acquisition, structuring, analysis and visualization of data and information in the field of specialty and critically evaluate the results of such management.
  • CT6 [Avaluable] - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.
  • CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.

Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB3 - That students have the ability to gather and interpret relevant data (usually within their area of ??study) to make judgments that include a reflection on relevant social, scientific or ethical issues.
  • CB4 - That the students can transmit information, ideas, problems and solutions to a specialized and non-specialized public.

Generic Technical Competences

Generic

  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.
  • CG3 - Work in multidisciplinary teams and projects related to the processing and exploitation of complex data, interacting fluently with engineers and professionals from other disciplines.
  • CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.
  • CG5 - To be able to draw on fundamental knowledge and sound work methodologies acquired during the studies to adapt to the new technological scenarios of the future.

Objectives

  1. Describe different models for evaluating similarity between texts, and how they apply to textual search. Decide which of the models is best suited to a specific scenario involving text search. Implement the models from scratch (in a very basic system) or on a highly scalable text indexing system.
    Related competences: CE1, CE4, CE6, CE7, CT5, CT6, CT7, CG2, CG4, CG5, CB2, CB3,
  2. Describe the advantages, in order to carry out effective searches, of using the information given by links in hyperlink structures, such as the web, digital social networks, and the semantic web. Describe the main parameters used to characterize these linked structures. Reproduce the most commonly used algorithms to establish importance in these structures (e.g. pagerank), to discover structure in them (e.g. community discovery) and to improve search results proposed by a user. Implement these algorithms from scratch in a very basic system, or on top of massive data processing systems so that they can scale.

    Translated with www.DeepL.com/Translator
    Related competences: CE1, CE4, CE6, CE7, CT5, CT6, CT7, CG2, CG4, CG5, CB2, CB3,
  3. Evaluate the effectiveness of search systems in complex systems, describing it in terms of hard measures such as "recall" and "accuracy" but also in terms of soft measures such as user satisfaction, novelty and task completion. Adapt the operation and presentation of information search systems with feedback from the user experience methodically collected.
    Related competences: CE1, CT4, CT5, CT6, CT7, CG3, CG4, CG5, CB2, CB3, CB4,
  4. Define the problem of the recommendation and the differences with other problems related to information previously stored (search, learning, ...). Describe the main approaches to the problem of item recommendations and the advantages and disadvantages of each one. Describe the main algorithms of each of the approaches. Be able to implement basic versions from scratch, or advanced versions on top of massive data processing systems. Evaluate the effectiveness of recommendation systems, both in terms of hard measures and soft measures such as user satisfaction. Decide on the most appropriate forms of recommendation to simple real scenarios, including the characterization of potential users.

    Translated with www.DeepL.com/Translator
    Related competences: CE1, CE4, CE7, CT5, CT6, CT7, CG2, CG4, CG5, CB2, CB3, CB4,
  5. Use known algorithmic paradigms to deal with data problems characterized by high volume and high speed. They include: streaming algorithms that treat data flows with little time per element, and little memory. Algorithms to answer proximity questions, particularly with geolocalized information. Algorithms that use sampling to draw reliable conclusions about large volumes of data. Integration of the techniques seen in the rest of the course with algorithmic techniques of other subjects, such as "machine learning", "clustering" and "pattern mining". Techniques for dealing with sensitive data, such as anonymization and privacy-preserving machine learning. "Consistent and distributed caching.

    Translated with www.DeepL.com/Translator
    Related competences: CE1, CE4, CE7, CT5, CT6, CT7, CG2, CG4, CG5, CB2, CB3,
  6. Integrate the techniques described in the previous objectives into a small but realistic project. Have the ability to design the architecture of a complex system and choose the techniques and technologies previously seen during the course to be applied. The objective is not to finalize the implementation of the system, but to arrive at a level of design detail that would allow a programming team to commission its completion.
    Related competences: CE1, CE4, CE6, CE7, CT4, CT5, CT7, CG2, CG3, CG4, CG5, CB2, CB4,
  7. To evaluate in a basic way the implications of the systems that are learned to build in the subject in terms of privacy, security, ethics and people's rights. It is understood by "in an elementary way" to be able to detect that these implications are significant enough to seek the opinion of an expert in the matter, particularly in relation to the RGPD and the need to carry out risk and impact analysis.
    Related competences: CE7, CT5, CG4, CB2, CB3, CB4,

Contents

  1. Search and analyisis of text information
    Models booleà i vectorial. Cerca basada en paraules clau. Preprocés dels textos. Indexació. Avaluació d'estratègies de cerca. Formació de grups i classificació de textos. Models generatius (LSI, LDA).
  2. Search and analysis in linked structures
    La web: Algorísmes d'avaluació en estructures hiperenllaçades. "Crawling" i "scraping". Xarxes socials: Mesures de centralitat. Comunitats. Influència. Web semàntica.
  3. Recommendation
    Sistemes recomanadors. Recomanació basada en contingut i recomanació basada en la comunitat ("collaborative filtering"). Consideracions pràctiques.
  4. Massive data algorithms
    Resums (sketches) i fluxos de dades (streaming). Mostratge (sampling). Preguntes de proximitat. Dades geolocalitzades. "Caching" consistent i distribuït. Tractament de dades sensibles: anonimització, "end-to-end encryption" i "privacy-preserving machine learning"

Activities

Activity Evaluation act


Activitat sobre el contingut "Cerca i anàlisi d'informació textual"

A teoria, el professor presenta les motivacions i principals conceptes, i en acabar professor i estudiants resolen conjuntament 2-3 problemes de consolidació. A laboratori, els estudiants resolen un cas relacionat amb el contingut.
Objectives: 1 3 6 7
Contents:
Theory
6h
Problems
0h
Laboratory
6h
Guided learning
0h
Autonomous learning
12h

Activitat sobre el contingut "Cerca i anàlisi en estructures enllaçades"

A teoria, el professor presenta les motivacions i principals conceptes, i en acabar professor i estudiants resolen conjuntament 2-3 problemes de consolidació. A laboratori, els estudiants resolen un cas relacionat amb el contingut.
  • Theory: Format classe magistral + resolució grupal de problemes
Objectives: 2 6 7
Theory
6h
Problems
0h
Laboratory
6h
Guided learning
0h
Autonomous learning
12h

Activitat sobre el tema "Recomanació"

A teoria, el professor presenta les motivacions i principals conceptes, i en acabar professor i estudiants resolen conjuntament 2-3 problemes de consolidació. A laboratori, els estudiants resolen un cas relacionat amb el contingut.
Objectives: 4 6 7
Contents:
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
8h

Activitat sobre el contingut "Algorismes per a dades massives"

A teoria, el professor presenta les motivacions i principals conceptes, i en acabar professor i estudiants resolen conjuntament 2-3 problemes de consolidació. A laboratori, els estudiants resolen un cas relacionat amb el contingut.
Objectives: 5 6 7
Contents:
Theory
8h
Problems
0h
Laboratory
8h
Guided learning
0h
Autonomous learning
18h

Integració. Construcció de sistemes reals. Implicacions en privacitat, seguretat i drets de les persones.

A teoria, el professor presenta les motivacions i principals conceptes, i en acabar professor i estudiants resolen conjuntament 2-3 problemes de consolidació. A laboratori, els estudiants resolen un cas relacionat amb el contingut.
Objectives: 6 7
Theory
4h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
8h

Partial exam

Resolució de l'examen final fora de classe. L'estudiant tindrà 1 setmana per lliurar la seva solució des que es publica l'enunciat.
Objectives: 1 2 3
Week: 7
Type: theory exam
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
10h

Final exam

Resolució de l'examen final, un cop acabat el període de docència, en una aula i un temps determinat.
Objectives: 1 2 3 4 5 7
Week: 15 (Outside class hours)
Type: theory exam
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
12h

Teaching methodology

Classes "de teoria" expositives per part del professor. Es proposaran un cert nombre d'exercicis a resoldre fora de classe per a la propera sessió.

Classes "de teoria" dedicades a la resolució. Es comentaran en comú les solucions dels exercicis proposats a la/les sessions precedents. S'esperarà que els estudiants hagin intentat resoldre'ls.

Classes "de laboratori": A partir d'un guió que rebran al principi de la sessió, els estudiants duran a terme alguna tasca amb ordinador per consolidar els conceptes vistos a les classes de "teoria". Típicament serà l'implementació i experimentació amb algun algorisme, o l'anàlisi d'algun conjunt de dades.

Evaluation methodology

P = partial take-home exam mark, mid term.
F = final exam mark.
L = lab session reports mark.

Final will be computed as 20% P + 40% F + 40% L.

The grade assigned to the "competencia transversal" CT6 (autonomous learning) will be computed from exam responses and/or information reflected on project reports from a topic proposed by the instructor that students will have to learn on their own.

Bibliography

Basic:

Previous capacities

Les donades per les assignatures dels Quadrimestres 1 a 4 del grau. L'assignatura és en bona part "comprensiva" de molt el fet anteriorment, i especialment conceptes de matemàtica (discreta, àlgebra, una mica de càlcul), probabilitat i estadística, algorísmia general (en particular, de grafs), aprenentatge automàtic i anàlisi de dades, bases de dades, i sistemes de computadors distribuïts i paral·lels. Els laboratoris es fan en el llenguatge de programació python.