Skip to main content

Massive Information Search and Analysis

Credits
6
Types
Specialization complementary (Information Systems)
Requirements
Department
CS
Mail
caim@cs.upc.edu
The amount of information stored digitally in organizations, or collectively on the web, is today large enough to make searching this information a generally complicated task. The field known as "Information Retrieval" finds methods to organize information in such a way that finding information afterwards can be done simply and efficiently. We will cover basic keyword-based techniques to search in textual information. Then, we will examine search in the web, where hyperlinks can be used not only to direct the search but to assess the interest value of each page - as is the case with the well-known PageRank algorithm. We will see extensions of these techniques to the case of Social Networks where interactions among users can provide very useful information. Finally, we will study ways in which these techniques can be exploited for the benefit of specific organizations.

Teachers

Person in charge

Others

Weekly hours

Theory
1.5
Problems
0.5
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Information systems specialization

  • CSI2 - To integrate solutions of Information and Communication Technologies, and business processes to satisfy the information needs of the organizations, allowing them to achieve their objectives effectively.
    • CSI2.3 - To demonstrate knowledge and application capacity of extraction and knowledge management systems .
    • CSI2.6 - To demonstrate knowledge and capacity to apply decision support and business intelligence systems.
  • Computer science specialization

  • CCO2 - To develop effectively and efficiently the adequate algorithms and software to solve complex computation problems.
    • CCO2.5 - To implement information retrieval software.
  • Autonomous learning

  • G7 [Avaluable] - To detect deficiencies in the own knowledge and overcome them through critical reflection and choosing the best actuation to extend this knowledge. Capacity for learning new methods and technologies, and versatility to adapt oneself to new situations.
    • G7.3 - Autonomous learning: capacity to plan and organize personal work. To apply the acquired knowledge when performing a task, in function of its suitability and importance, decide how to perform it and the needed time, and select the most adequate information sources. To identify the importance of establishing and maintaining contacts with students, teacher staff and professionals (networking). To identify information forums about ICT engineering, its advances and its impact in the society (IEEE, associations, etc.).
  • Objectives

    1. Understand the problems associated with storage and information retrieval, in particular with information in textual form.
      Related competences: CCO2.5,
    2. Understand that effective search and information retrieval is closely related to the organization and description of this information.
      Related competences: CCO2.5, G7.3,
    3. To know and understand the structure, architecture and functioning of the web, and elements related to it: indices, search engines, crawlers, among others.
      Related competences: CSI2.3, G7.3,
    4. To know and understand the descriptive parameters of complex networks and the algorithms to analyze their structure.
      Related competences: CSI2.3, CSI2.6, G7.3,
    5. Recognizing the opportunities for using massive information to an organization's goals, and choose the most appropriate methods, tools, and procedures.
      Related competences: CSI2.6, G7.3,
    6. Be able to decide the information retrieval techniques that may be effective in a specific information system, especially those of textual type.
      Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
    7. Be able to evaluate the effectiveness and usefulness of an information retrieval system, according to several criteria.
      Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
    8. To implement themain techniques learned during the course.
      Related competences: CCO2.5, G7.3,
      Subcompetences
      • Be able to implement the basic techniques (algorithms and data structures) for information retrieval.
      • Be able to implement basic algorithms for network analysis.
    9. Know how to use, adapt and extend open-source software.
      Related competences: G7.3,
      Subcompetences
      • For example: Lucene, Dex database, WIRE crawler, among others.

    Contents

    1. Introduction
      Need of search and analysis techniques of massive information. Search and analysis vs. databases. Information retrieval process. Preprocessing and lexical analysis.
    2. Models of information retrieval
      Formal definition and basic concepts: abstract models of documents and query languages. Boolean model. Vector model. Latent Semantic Indexing.
    3. Implementation: Indexing and searching
      Inverse and signature files. Index compression. Example: Efficient implementation of the rule of the cosine measure with tf-idf. Example: Lucene.
    4. Evaluation in information retrieval
      Recall and precision. Other performance measures. Reference collections. Relevance feedback and query expansion.
    5. Web search
      Ranking and relevance in the web. The PageRank algorithm. Crawling. Architecture of a simple web search system.
    6. Architecture of massive information processing systems
      Scalability, high performance, and fault tolerance: the case of massive web searchers. Distributed architectures. Example: Hadoop.
    7. Network analysis
      Descriptive parameters and characteristics of networks: degree, diameter, small-world networks, among others. Algorithms on networks: clustering, community detection and detection of influential nodes, reputation, among others.
    8. Information Systems based on massive information analysis. Combination with other technologies.
      Search Engine Optimization. Joint use of IR techniques with Data Mining and Machine Learning. Recommender Systems.

    Activities

    Activity Evaluation act


    Introduction and models of information retrieval

    2 theory hours, 2 problem hours, and 4 lab hours on the topics "Introduction" and "Models of information retrieval". See the description in the Teaching Methodology section.
    Objectives: 1 2 6
    Contents:
    Theory
    4h
    Problems
    2h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    13.5h

    Implementation and evaluation

    2 theory hours, 2 problem hours, and 4 lab hours on the topics "Implementation" and "Evaluation". See the description in the Teaching Methodology section.
    Objectives: 2 7 8 9
    Contents:
    Theory
    2.5h
    Problems
    1h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    6.5h

    Searching the web

    2 theory hours, 2 problem hours, and 4 lab hours on the topic "Web search". See the description in the Teaching Methodology section.
    Objectives: 3 5 9
    Contents:
    Theory
    2.5h
    Problems
    1h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    12.5h

    First partial exam

    Partial exam of the first part of the course.
    Objectives: 1 2 3 5 6 7
    Week: 9
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Architecture of web search systems

    2 theory hours and 6 lab hours on the topic "Architecture". See the description in the Teaching Methodology section.
    Objectives: 3 6 8 9
    Theory
    3.5h
    Problems
    1.5h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    12.5h

    Network analysis

    4 theory hours and 6 lab hours on the topic "Network Analysis". See the description in the Teaching Methodology section.
    Objectives: 4 6 7 8 9
    Contents:
    Theory
    3.5h
    Problems
    1.5h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12.5h

    Information systems based on massive information analysis

    Theory, problems, and labs on this topic. The emphasis is on practical cases in the problems and lab sessions. See the description in the Teaching Methodology section.
    Objectives: 5 6 7 9
    Contents:
    Theory
    1.5h
    Problems
    0.5h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12.5h

    Second partial exam or final exam

    The student chooses between an exam of the second part of the course or an exam on the whole course.
    Objectives: 1 2 3 4 5 6 7 8 9
    Week: 15 (Outside class hours)
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Teaching methodology

    - Theory lectures. Before each class, students must have read the notes and materials on the topic to be discussed in class, which will be announced with enough time to prepare. Students will also have at their disposal a questionnaire with basic questions to see if a basic degree of understanding has been reached. In class, the teacher will present the main points, assuming that the student has done the job indicated and has tried to answer the questionnaire; difficulties found by students will be discussed in class collectively.

    - Problem-solving sessions. Teachers and students will discuss and compare the solutions to problems provided by the teacher with sufficient time before each class. Discussions can be made collectively in class or individually between teacher and student. The teacher will assume that the students have spent a reasonable amount of time trying to solve these exercises, and priority will be given to those who have done so.

    - Laboratory sessions. Before each class, students are assumed to have read the script of practical work to be developed during the session. During class, students will do the work specified in the script with the guidance of the teacher. In many cases, students will probably need extra time to finish the work. For most lab sessions the students will have to write a short report and/or deliver files associated with it (output files and code).

    - Personal work. Every type of classroom activity involves a certain amount of personal work. Additionally, some topic or topics of the course could have no theory classes or exercises associated; students must study these on their own, and can take advantage the directed activities' sessions to assess whether they have learnt them sufficiently or not.

    Given that the course appears as part of two different degree specialties, the activities proposed (in theory, problem-solving or lab sessions) may slightly differ for different types of students, always ensuring fairness regarding difficulty or workload.

    Evaluation methodology

    The subject will include the following assessment components:

    - Laboratory report submissions, which must be turned in within a specified period for each session (approximately 2 weeks). A laboratory grade, L, will be calculated based on a weighted average of these reports.

    - A first mid-term exam, taken around the middle of the course, covering material learned up to that point. Let P1 be the grade earned on this exam.

    - A final exam, focused on the second half of the course but able to include any part of the subject. Let P2 be the grade earned on this exam.

    The three grades, L, P1, and P2, are each between 0 and 10. The final grade for the subject will be the average of these three grades.

    Regarding the grade for the competence related to Autonomous Learning, a numerical grade will be calculated as follows:

    For each lab report, the value Ri will be 1 if it was submitted on time and at the professor's discretion, demonstrates reasonable effort to complete the work; otherwise, it will be 0. Let Rsum be the sum of all Ri values (which can reach k if k reports are required).

    Some questions on the final or partial exams, especially marked, will cover topics that students must prepare on their own¿with little or no class coverage¿indicated during the course. Let E be the weighted average of these questions' scores on the exams, scaled to the interval [0,1].

    Let S be calculated as (Rsum / k + E) / 2, which will also be between 0 and 1.

    The grade for the competence will be:
    - D if S is less than 0.5
    - C if S is between 0.5 and 0.599
    - B if S is between 0.6 and 0.799
    - A if S is 0.8 or higher.

    Bibliography

    Basic

    Web links

    Previous capacities

    In general, all those that are acquired in the required prior courses.

    Specifically:

    - To know and use comfortably basic concepts of linear algebra, discrete mathematics, probability and statistics.

    - To program comfortably in object-oriented languages, including inheritance between classes.

    - To know the main data structures to access information efficiently and their implementations (lists, hashing, trees, graphs, heaps). To be able to use them to build efficient programs. To be able to analyze the execution time and memory used by an algorithm of average difficulty. To have an idea of the difference in time to access main memory and disk.

    - To know the main elements of a relational database and SQL-like access language.