Massive Information Search and Analysis

Credits

Types

Specialisation complementary (Information Systems)

Requirements

Prerequisite: BD
Prerequisite: PE
Corequisite: PROP

Department

Mail

caim@cs.upc.edu

The amount of information stored digitally in many organizations, or collectively on the web, is now large enough to make finding what you are looking for generally complicated. The field known as "Information Retrieval" deals with methods for organizing information and then allowing users to find it in a comfortable and efficient way. We will cover the basic techniques for searching textual documentation based on keywords. We will then examine the case of web search, where the presence of hyperlinks can be used not only to direct the search but also to assess the interest of each page - this is the case of the well-known PageRank algorithm. We will see the extension of these techniques to the case of social networks where the graph of interactions between users provides a lot of information about what may be of interest to each one. Finally, we will study efficient random algorithms for massive data flows.

Teachers

Person in charge

David Garcia Soriano (david.garcia.soriano@upc.edu)
Marta Arias Vicente (marias@cs.upc.edu)

Weekly hours

Theory

1.5

Problems

0.5

Laboratory

Guided learning

Autonomous learning

Competences

Technical Competences of each Specialization

Information systems specialization

CSI2 - To integrate solutions of Information and Communication Technologies, and business processes to satisfy the information needs of the organizations, allowing them to achieve their objectives effectively.

CSI2.3 - To demonstrate knowledge and application capacity of extraction and knowledge management systems .
CSI2.6 - To demonstrate knowledge and capacity to apply decision support and business intelligence systems.

Computer science specialization

CCO2 - To develop effectively and efficiently the adequate algorithms and software to solve complex computation problems.

CCO2.5 - To implement information retrieval software.

Transversal Competences

Autonomous learning

G7 [Avaluable] - To detect deficiencies in the own knowledge and overcome them through critical reflection and choosing the best actuation to extend this knowledge. Capacity for learning new methods and technologies, and versatility to adapt oneself to new situations.

G7.3 - Autonomous learning: capacity to plan and organize personal work. To apply the acquired knowledge when performing a task, in function of its suitability and importance, decide how to perform it and the needed time, and select the most adequate information sources. To identify the importance of establishing and maintaining contacts with students, teacher staff and professionals (networking). To identify information forums about ICT engineering, its advances and its impact in the society (IEEE, associations, etc.).

Objectives

Understand the problems associated with storage and information retrieval, in particular with information in textual form.
Related competences: CCO2.5,
Understand that effective search and information retrieval is closely related to the organization and description of this information.
Related competences: CCO2.5, G7.3,
To know and understand the structure, architecture and functioning of the web, and elements related to it: indices, search engines, crawlers, among others.
Related competences: CSI2.3, G7.3,
To know and understand the descriptive parameters of complex networks and the algorithms to analyze their structure.
Related competences: CSI2.3, CSI2.6, G7.3,
Recognizing the opportunities for using massive information to an organization's goals, and choose the most appropriate methods, tools, and procedures.
Related competences: CSI2.6, G7.3,
Be able to decide the information retrieval techniques that may be effective in a specific information system, especially those of textual type.
Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
Be able to evaluate the effectiveness and usefulness of an information retrieval system, according to several criteria.
Related competences: CSI2.3, CSI2.6, CCO2.5, G7.3,
To implement themain techniques learned during the course.
Related competences: CCO2.5, G7.3,
Subcompetences
- Be able to implement the basic techniques (algorithms and data structures) for information retrieval.
- Be able to implement basic algorithms for network analysis.
Know how to use, adapt and extend open-source software.
Related competences: G7.3,
Subcompetences
- For example: Lucene, Dex database, WIRE crawler, among others.

Introduction
Need for techniques for searching and analyzing massive information. Searching and analyzing vs. databases. Information retrieval process. Preprocessing and lexical analysis
Search in large volumes of data
Ranking and relevance for web models. PageRank algorithm. Crawling. Architecture of a simple web search system. Techniques based on locality-sensitive hash tables (LSH).
Models of information retrieval
Formal definition and basic concepts: Abstract document models and query languages. Boolean model. Vector model. Inverted files and signature files. Index compression. Example: Efficient implementation of the cosine rule with tf-idf measure. Recall and precision. Other performance measures. Reference collections. "Relevance feedback" and "query expansion".
Architecture of massive information processing systems
Scalability, high performance, and fault tolerance: the case of massive web searchers. Distributed architectures. Example: Hadoop.
Network analysis
Descriptive parameters and characteristics of networks: degree, diameter, small-world networks, among others. Algorithms on networks: clustering, community detection and detection of influential nodes, reputation, among others.
Algorithms for big data streams
Summaries (sketches) and data flows (streaming). Sampling. You will see algorithms like RESERVOIR SAMPLING, count-min sketch, hyper-log-log, etc.

Activities

Activity Evaluation act

Introduction and models of information retrieval

Objectives: 1 2 6
Contents:

3 . Models of information retrieval
1 . Introduction

Theory

4.5h

Problems

Laboratory

10h

Guided learning

Autonomous learning

16h

Search in large volumes of data

Objectives: 3 5 9
Contents:

2 . Search in large volumes of data

Theory

3.5h

Problems

Laboratory

Guided learning

Autonomous learning

12.5h

First partial exam

Partial exam of the first part of the course.
Objectives: 1 2 3 5 6 7
Week: 9

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Architecture of web search systems

Objectives: 3 6 8 9
Contents:

2 . Search in large volumes of data
5 . Network analysis

Theory

Problems

Laboratory

Guided learning

Autonomous learning

8.5h

Network analysis

Objectives: 4 6 7 8 9
Contents:

5 . Network analysis

Theory

3.5h

Problems

1.5h

Laboratory

Guided learning

Autonomous learning

12.5h

Algorithms for big data

Objectives: 5 8
Contents:

6 . Algorithms for big data streams

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Second partial exam or final exam

Objectives: 1 2 3 4 5 6 7 8 9
Week: 15 (Outside class hours)

Theory

Problems

Laboratory

Guided learning

Autonomous learning

First lab assessment

Objectives: 1 2 3 4 5 6 7 8 9
Week: 6

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Second lab assessment

Objectives: 1 2 3 4 5 6 7 8 9
Week: 14

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Teaching methodology

- Theory lectures. Before each class, students must have read the notes and materials on the topic to be discussed in class, which will be announced with enough time to prepare. Students will also have at their disposal a questionnaire with basic questions to see if a basic degree of understanding has been reached. In class, the teacher will present the main points, assuming that the student has done the job indicated and has tried to answer the questionnaire; difficulties found by students will be discussed in class collectively.

- Problem-solving sessions. Teachers and students will discuss and compare the solutions to problems provided by the teacher with sufficient time before each class. Discussions can be made collectively in class or individually between teacher and student. The teacher will assume that the students have spent a reasonable amount of time trying to solve these exercises, and priority will be given to those who have done so.

- Laboratory sessions. Before each class, students are assumed to have read the script of practical work to be developed during the session. During class, students will do the work specified in the script with the guidance of the teacher. In many cases, students will probably need extra time to finish the work. For most lab sessions the students will have to write a short report and/or deliver files associated with it (output files and code).

- Personal work. Every type of classroom activity involves a certain amount of personal work. Additionally, some topic or topics of the course could have no theory classes or exercises associated; students must study these on their own, and can take advantage the directed activities' sessions to assess whether they have learnt them sufficiently or not.

Evaluation methodology

The subject will include the following assessment activities:

- A first partial exam, held halfway through the course, on the material covered up to that point. Let P1 be the grade obtained in this exam.

- A second partial exam, focused on the second half of the course, but which can include any part of the subject. Let P2 be the grade obtained in this exam.

- Two in-person laboratory tests. Let L be the average grade obtained from these two tests.

The three grades L, P1 and P2 are between 0 and 10.
The final grade for the subject will be the result of the formula 20% L + 40% P1 + 40% P2.

Regarding the grade for the competency associated with Autonomous Learning, a numerical grade will be calculated as follows:

- Some of the questions in the face-to-face assessment tests, specially marked, will be totally or partially about topics that the student will have to prepare on their own, with little or no coverage in class of theory and problems, which will have been indicated during the course. Let S be the average of these questions in the exams applicable to the student, and scaled to the interval [0,1].

The competency grade will be:
- D if S is less than 0.3
- C if S is between 0.3 and 0.499
- B if S is between 0.5 and 0.699
- A if S is 0.7 or more.

Bibliography

Basic

Mining of massive datasets - Leskovec, J; Rajaraman, A; Ullman, J.D., Cambridge University Press, 2020. ISBN: 9781108476348
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004193679706711
Modern information retrieval: the concepts and technology behind search - Baeza-Yates, R.; Ribeiro-Neto, B, Addison-Wesley / Pearson, 2011 . ISBN: 9780321416919
https://discovery.upc.edu/permalink/34CSUC_UPC/l60p4r/alma991003938679706711

Complementary

Introduction to information retrieval - Manning, C.D.; Raghavan, P; Schütze, H, Cambridge University Press, 2008. ISBN: 9780521865715
https://discovery.upc.edu/permalink/34CSUC_UPC/i7glq6/alma991003641259706711
Mining the social web: data mining Facebook, Twitter, LinkedIn, Instagram, Github, and more - Russell, Matthew A; Klassen, Mikhail, O'Reilly Media, 2018. ISBN: 9781491973509
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991001686489706711&context=L&vid=34CSUC_UPC:VU1
Search engines : information retrieval in practice - Croft, W. Bruce; Metzler, Donald; Strohman, Trevor, Pearson, 2010. ISBN: 9780131364899
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003969369706711&context=L&vid=34CSUC_UPC:VU1

Previous capacities

In general, all those that are acquired in the required prior courses.

Specifically:

- To know and use comfortably basic concepts of linear algebra, discrete mathematics, probability and statistics.

- To program comfortably in object-oriented languages, including inheritance between classes.

- To know the main data structures to access information efficiently and their implementations (lists, hashing, trees, graphs, heaps). To be able to use them to build efficient programs. To be able to analyze the execution time and memory used by an algorithm of average difficulty. To have an idea of the difference in time to access main memory and disk.

- To know the main elements of a relational database and SQL-like access language.