Techniques and Tools for Bioinformatics

You are here

Credits
3
Types
  • MEI: Elective
  • MIRI: Elective
  • MDS: Elective
Requirements
This subject has not requirements, but it has got previous capacities
Department
CS
This course presents the main algorithmic techniques and data structures used in modern genomics and pangenomics. It focuses on exact and approximate string matching, pairwise sequence alignment, and alignment over graph-based genome representations. Emphasis is placed on understanding the underlying algorithmic principles, complexity trade-offs, and practical considerations required to process genome-scale data efficiently and correctly.

Teachers

Person in charge

  • Santiago Marco Sola ( )

Weekly hours

Theory
2
Problems
1
Laboratory
1
Guided learning
0
Autonomous learning
8.5

Competences

Technical Competences of each Specialization

Direcció i gestió

  • CDG1 - Capability to integrate technologies, applications, services and systems of Informatics Engineering, in general and in broader and multicisciplinary contexts.

Especifics

  • CTE7 - Capability to understand and to apply advanced knowledge of high performance computing and numerical or computational methods to engineering problems.
  • CTE9 - Capability to apply mathematical, statistical and artificial intelligence methods to model, design and develop applications, services, intelligent systems and knowledge-based systems.

Generic Technical Competences

Generic

  • CG4 - Capacity for mathematical modeling, calculation and simulation in technology and engineering companies centers, particularly in research, development and innovation tasks in all areas related to Informatics Engineering.

Transversal Competences

Information literacy

  • CTR4 - Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.

Objectives

  1. Understand pattern matching algorithms and data structures and their implementation in a modern programming language, and apply them to solve practical problems in bioinformatics.
    Related competences: CB6, CTR4, CDG1, CTE7, CTE9, CG4,

Contents

  1. Sequencing Technologies and Genome Analysis
    The course gives an overview of modern DNA sequencing technologies and the computational characteristics of genomic data. We introduce the main sources of error and variability in sequencing data and review fundamental genome analysis tasks, including read mapping, variant calling, and assembly. These problems motivate the abstraction of biological sequences as strings and highlight the algorithmic and computational challenges posed by genome-scale datasets.
  2. Exact String Matching Algorithms
    We study the exact string matching problem as a foundational algorithmic task. Classical algorithms are introduced to illustrate different design paradigms, including prefix-based preprocessing, heuristic skipping strategies, hashing, and bit-parallelism. Emphasis is placed on correctness, time complexity, and practical performance considerations.
  3. Text-Indexing Data Structures
    This block focuses on data structures that support efficient pattern matching over large-scale texts, with special attention to genomic applications. We cover hash-based and k-mer indexing strategies, lexicographic tree structures, suffix trees and suffix arrays, and compressed full-text indexes based on the Burrows-Wheeler Transform.
  4. Approximate String Matching
    We extend pattern matching to the approximate setting, where mismatches and indels are allowed. The block introduces distance and error models, followed by filtering and seeding techniques to reduce the search space. Index-assisted methods, sketch-based similarity estimation, verification algorithms, and chaining strategies are presented as key components of modern approximate matching algorithms and tools.
  5. Pairwise Sequence Alignment
    This block addresses sequence alignment as a fundamental task in bioinformatics and genome analysis. We introduce dynamic programming techniques for computing edit distance and other alignment models under different scoring schemes, including affine gap penalties. Both classical alignment modalities and advanced algorithmic optimizations are studied, with attention to space efficiency, sparsity, and output-sensitive methods such as wavefront alignment.
  6. Sequence Graphs and Pangenomics
    The course concludes with an introduction to graph-based representations of genomic variation. We discuss sequence graphs, pangenome models, and their relationship to linear references. The sequence-to-graph alignment problem is introduced, highlighting the algorithmic challenges and opportunities that arise when extending alignment techniques to graph-based genome representations.

Activities

Activity Evaluation act


Sequencing Technologies and Genome Analysis


Objectives: 1
Contents:
Theory
2h
Problems
1h
Laboratory
0h
Guided learning
0h
Autonomous learning
5h

Exact String Matching Algorithms


Objectives: 1
Contents:
Theory
2h
Problems
1h
Laboratory
0h
Guided learning
0h
Autonomous learning
5h

Text-indexing data structures


Objectives: 1
Contents:
Theory
3h
Problems
1h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Approximate string matching


Objectives: 1
Contents:
Theory
2h
Problems
1h
Laboratory
1h
Guided learning
0h
Autonomous learning
5h

Pairwise sequence alignment


Objectives: 1
Contents:
Theory
2h
Problems
1h
Laboratory
2h
Guided learning
0h
Autonomous learning
5h

Sequence graphs and pangenomics


Objectives: 1
Contents:
Theory
1h
Problems
1h
Laboratory
1h
Guided learning
0h
Autonomous learning
5h

Programming Project


Objectives: 1
Week: 7 (Outside class hours)
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
15h

Presentation of advanced algorithms and data structures


Objectives: 1
Week: 7 (Outside class hours)
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Teaching methodology

The course combines lectures, problem-solving sessions, and independent work to develop both theoretical understanding and practical skills. During lecture sessions, the instructor introduces algorithmic concepts, data structures, and core techniques for genomics and pangenomics, supported by illustrative examples and discussion of complexity and performance trade-offs.

Problem-solving sessions are devoted to the guided resolution of exercises, in which students work on algorithmic problems related to the course content. These sessions emphasize reasoning, correctness, and analysis, and students receive continuous instructor supervision and feedback.

Independent study and project work play a central role in the course. Students are expected to read and analyze research literature, prepare an oral presentation of a selected paper, and develop an applied programming project that consolidates the techniques studied throughout the course.

Evaluation methodology

Assessment in this course is based on continuous evaluation.

The main component, accounting for 80% of the final grade, consists of an oral presentation and defence of a research paper on topics covered in the course. Students are expected to independently read, analyze, and critically present a research-level contribution that demonstrates understanding of the underlying algorithmic ideas, methodological choices, and their relevance to genome-scale sequence analysis.

The second component, accounting for the remaining 20% of the final grade, consists of a programming project, in which students autonomously design and implement an applied, experimental solution that puts into practice the algorithmic techniques studied throughout the course.

Bibliography

Basic:

Complementary:

Previous capacities

No prior knowledge in bioinformatics is required. Basic concepts in molecular biology, genetics, or genomics may be helpful, but all necessary biological background will be introduced during the course.

Students are expected to have basic knowledge of computer science concepts, including basic algorithms and data structures. Moreover, students should be comfortable with programming in at least one mainstream programming language (e.g., Python, C, C++, Java, or Rust). Prior exposure to high-performance computing, computer architecture, or performance engineering is not required, although a basic understanding of these topics may be advantageous for the course's implementation and optimization components.

Students must have a strong interest and motivation in algorithms and data structures.