Techniques and Tools for Bioinformatics | FIB - Barcelona School of Informatics

Teachers

Person in charge

Santiago Marco Sola (santiago.marco@upc.edu)

Weekly hours

Theory

2

Problems

1

Laboratory

1

Guided learning

0

Autonomous learning

8.5

Competences

Technical Competences of each Specialization

Direcció i gestió

CDG1 - Capability to integrate technologies, applications, services and systems of Informatics Engineering, in general and in broader and multicisciplinary contexts.

Especifics

CTE7 - Capability to understand and to apply advanced knowledge of high performance computing and numerical or computational methods to engineering problems.

CTE9 - Capability to apply mathematical, statistical and artificial intelligence methods to model, design and develop applications, services, intelligent systems and knowledge-based systems.

Generic Technical Competences

Generic

CG4 - Capacity for mathematical modeling, calculation and simulation in technology and engineering companies centers, particularly in research, development and innovation tasks in all areas related to Informatics Engineering.

Transversal Competences

Information literacy

CTR4 - Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

Basic

CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.

Objectives

Understand pattern matching algorithms and data structures and their implementation in a modern programming language, and apply them to solve practical problems in bioinformatics.
Related competences: CB6, CTR4, CDG1, CTE7, CTE9, CG4,

Sequencing Technologies and Genome Analysis
The course gives an overview of modern DNA sequencing technologies and the computational characteristics of genomic data. We introduce the main sources of error and variability in sequencing data and review fundamental genome analysis tasks, including read mapping, variant calling, and assembly. These problems motivate the abstraction of biological sequences as strings and highlight the algorithmic and computational challenges posed by genome-scale datasets.
Exact String Matching Algorithms
We study the exact string matching problem as a foundational algorithmic task. Classical algorithms are introduced to illustrate different design paradigms, including prefix-based preprocessing, heuristic skipping strategies, hashing, and bit-parallelism. Emphasis is placed on correctness, time complexity, and practical performance considerations.
Text-Indexing Data Structures
This block focuses on data structures that support efficient pattern matching over large-scale texts, with special attention to genomic applications. We cover hash-based and k-mer indexing strategies, lexicographic tree structures, suffix trees and suffix arrays, and compressed full-text indexes based on the Burrows-Wheeler Transform.
Approximate String Matching
We extend pattern matching to the approximate setting, where mismatches and indels are allowed. The block introduces distance and error models, followed by filtering and seeding techniques to reduce the search space. Index-assisted methods, sketch-based similarity estimation, verification algorithms, and chaining strategies are presented as key components of modern approximate matching algorithms and tools.
Pairwise Sequence Alignment
This block addresses sequence alignment as a fundamental task in bioinformatics and genome analysis. We introduce dynamic programming techniques for computing edit distance and other alignment models under different scoring schemes, including affine gap penalties. Both classical alignment modalities and advanced algorithmic optimizations are studied, with attention to space efficiency, sparsity, and output-sensitive methods such as wavefront alignment.
Sequence Graphs and Pangenomics
The course concludes with an introduction to graph-based representations of genomic variation. We discuss sequence graphs, pangenome models, and their relationship to linear references. The sequence-to-graph alignment problem is introduced, highlighting the algorithmic challenges and opportunities that arise when extending alignment techniques to graph-based genome representations.

Activities

Activity Evaluation act

Sequencing Technologies and Genome Analysis

Objectives: 1
Contents:

1 . Sequencing Technologies and Genome Analysis

Theory

2h

Problems

1h

Laboratory

0h

Guided learning

0h

Autonomous learning

5h

Exact String Matching Algorithms

Objectives: 1
Contents:

2 . Exact String Matching Algorithms

Theory

2h

Problems

1h

Laboratory

0h

Guided learning

0h

Autonomous learning

5h

Text-indexing data structures

Objectives: 1
Contents:

3 . Text-Indexing Data Structures

Theory

3h

Problems

1h

Laboratory

2h

Guided learning

0h

Autonomous learning

5h

Approximate string matching

Objectives: 1
Contents:

4 . Approximate String Matching

Theory

2h

Problems

1h

Laboratory

1h

Guided learning

0h

Autonomous learning

5h

Pairwise sequence alignment

Objectives: 1
Contents:

5 . Pairwise Sequence Alignment

Theory

2h

Problems

1h

Laboratory

2h

Guided learning

0h

Autonomous learning

5h

Sequence graphs and pangenomics

Objectives: 1
Contents:

6 . Sequence Graphs and Pangenomics

Theory

1h

Problems

1h

Laboratory

1h

Guided learning

0h

Autonomous learning

5h

Programming Project

Objectives: 1
Week: 7 (Outside class hours)

Theory

0h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

0h

Presentation of advanced algorithms and data structures

Objectives: 1
Week: 7 (Outside class hours)

Theory

0h

Problems

0h

Laboratory

0h

Guided learning

0h

Autonomous learning

0h

Teaching methodology

The course combines lectures, problem-solving sessions, and independent work to develop both theoretical understanding and practical skills. During lecture sessions, the instructor introduces algorithmic concepts, data structures, and core techniques for genomics and pangenomics, supported by illustrative examples and discussion of complexity and performance trade-offs.

Problem-solving sessions are devoted to the guided resolution of exercises, in which students work on algorithmic problems related to the course content. These sessions emphasize reasoning, correctness, and analysis, and students receive continuous instructor supervision and feedback.

Independent study and project work play a central role in the course. Students are expected to read and analyze research literature, prepare an oral presentation of a selected paper, and develop an applied programming project that consolidates the techniques studied throughout the course.

Evaluation methodology

Assessment in this course is based on continuous evaluation.

The main component, accounting for 80% of the final grade, consists of an oral presentation and defence of a research paper on topics covered in the course. Students are expected to independently read, analyze, and critically present a research-level contribution that demonstrates understanding of the underlying algorithmic ideas, methodological choices, and their relevance to genome-scale sequence analysis.

The second component, accounting for the remaining 20% of the final grade, consists of a programming project, in which students autonomously design and implement an applied, experimental solution that puts into practice the algorithmic techniques studied throughout the course.

Bibliography

Basic

Genome-scale algorithm design : bioinformatics in the era of high-throughput sequencing - Mäkinen, Veli; Belazzougui, Djamal; Cunial, Fabio; Tomescu, Alexandru I, Cambridge University Press, 2023. ISBN: 9781009341233
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991005219277706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Algorithms on strings, trees, and sequences : computer science and computational biology - Gusfield, Dan, Cambridge University Press, 1997. ISBN: 0521585198
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991001989459706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Biological sequence analysis : probabilistic models of proteins and nucleic acids - Durbin, Richard, Cambridge University Press, 1998. ISBN: 0521629713
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=320915

Complementary

Bioinformatics : sequence and genome analysis - Mount, David W, Cold Spring Harbor Laboratory Press, cop. 2004. ISBN: 0879696877
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002934579706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
An Introduction to bioinformatics algorithms - Jones, Neil C; Pevzner, Pavel A, MIT Press, cop. 2004. ISBN: 0262101068
https://ebookcentral-proquest-com.recursos.biblioteca.upc.edu/lib/upcatalunya-ebooks/detail.action?pq-origsite=primo&docID=3338847

Previous capacities

No prior knowledge in bioinformatics is required. Basic concepts in molecular biology, genetics, or genomics may be helpful, but all necessary biological background will be introduced during the course.

Students are expected to have basic knowledge of computer science concepts, including basic algorithms and data structures. Moreover, students should be comfortable with programming in at least one mainstream programming language (e.g., Python, C, C++, Java, or Rust). Prior exposure to high-performance computing, computer architecture, or performance engineering is not required, although a basic understanding of these topics may be advantageous for the course's implementation and optimization components.

Students must have a strong interest and motivation in algorithms and data structures.