Bioinformatics and Statistical Genetics

You are here

Specialization complementary (Data Science)
This subject has not requirements, but it has got previous capacities
Bioinformatics and Statistical Genetics


Person in charge

  • Gabriel Valiente Feruglio ( )


  • Jan Graffelman ( )

Weekly hours

Guided learning
Autonomous learning


Generic Technical Competences


  • CG3 - Capacity for mathematical modeling, calculation and experimental designing in technology and companies engineering centers, particularly in research and innovation in all areas of Computer Science.

Transversal Competences


  • CTR6 - Capacity for critical, logical and mathematical reasoning. Capability to solve problems in their area of study. Capacity for abstraction: the capability to create and use models that reflect real situations. Capability to design and implement simple experiments, and analyze and interpret their results. Capacity for analysis, synthesis and evaluation.


  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB7 - Ability to integrate knowledges and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Technical Competences of each Specialization


  • CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
  • CEC2 - Capacity for mathematical modelling, calculation and experimental design in engineering technology centres and business, particularly in research and innovation in all areas of Computer Science.
  • CEC3 - Ability to apply innovative solutions and make progress in the knowledge that exploit the new paradigms of Informatics, particularly in distributed environments.


  1. Introduce the student to the algorithmic, computational, and statistical problems that arise in the analysis of biological data.
    Related competences: CB6, CB7, CB9, CTR6, CEC1, CEC2, CEC3, CG3,
  2. Reinforce the knowledge of discrete structures, algorithmic techniques, and statistical techniques that the student may have from previous courses.
    Related competences: CB6, CB7, CB9, CTR6, CEC1, CEC2, CEC3, CG3,


  1. How do we assemble genomes?
    A short history of DNA sequencing technologies.
    Tractable and intractable problems.
    From Euler to Hamilton to de Bruijn.
    The string reconstruction problem.
    String reconstruction as a walk in the overlap graph.
    Another graph for string reconstruction.
    Walking in the de Bruijn graph.
    Assembling genomes from read-pairs.
  2. How do we compare biological sequences?
    Cracking the non-ribosomal code.
    Introduction to sequence alignment.
    The Manhattan tourist problem.
    Sequence alignment is the Manhattan tourist problem in disguise.
    An introduction to dynamic programming: The change problem.
    The Manhattan tourist problem revisited.
    From Manhattan to an arbitrary directed acyclic graph.
    Backtracking in the alignment graph.
    Scoring alignments.
    From global to local alignment.
    The changing faces of sequence alignment.
    Penalizing insertions and deletions in sequence alignment.
    Space-efficient sequence alignment.
  3. Are there fragile regions in the human genome?
    The random breakage model of chromosome evolution.
    Sorting by reversals.
    A greedy heuristic for sorting by reversals.
    Rearrangements in tumor genomes.
    From unichromosomal to multichromosomal genomes.
    Breakpoint graphs.
    Computing the 2-break distance.
    Rearrangement hotspots in the human genome.
  4. Which animal gave us SARS?
    Transforming distance matrices into evolutionary trees.
    Toward an algorithm for distance-based phylogeny construction.
    Using least squares to construct approximate distance-based phylogenies.
    Ultrametric evolutionary trees.
    The neighbor-joining algorithm.
    Character-based tree reconstruction.
    The small parsimony problem.
    The large parsimony problem.
  5. How did yeast become a wine maker?
    An evolutionary history of wine making.
    Identifying genes responsible for the diauxic shift.
    Introduction to clustering.
    The good clustering principle.
    Clustering as an optimization problem.
    Farthest first traversal.
    k-means clustering.
    The Lloys algorithm.
    Clustering genes implicated in the diauxic shift.
    Limitations of k-means clustering.
    From coin flipping to k-means clustering.
    Making soft decisions in coin flipping.
    Soft k-means clustering.
    Hierarchical clustering.
  6. How do we locate disease-causing mutations?
    Introduction to multiple pattern matching.
    Herding patterns into a trie.
    Preprocessing the genome instead.
    Suffix trees.
    Suffix arrays.
    The Burrows-Wheeler transform.
    Inverting the Burrows-Wheeler transform.
    Pattern matching with the Burrows-Wheeler transform.
    Speeding up Burrows-Wheeler pattern matching.
  7. How do we analyze microbial communities?
    Sequence quality.
    Single-read and paired-end sequencing.
    Demultiplexing reads from multiple metagenomic samples.
    Clustering reads into operational taxonomic units.
    Chimeric sequences.
    Taxonomy assignment and taxonomic assignment.
    Analysing alpha and beta diversity of microbial communities.
  8. Introduction to statistical genetics
    Basic genetic terminology. Population-based and family-based studies. Traits, markers and polymorphisms. Single nucleotide polymorphisms and microsatellites. R-package genetics.
  9. Hardy-Weinberg equilibrium
    Hardy-Weinberg law. Hardy-Weinberg assumptions. Multiple alleles. Statistical tests for Hardy-Weinberg equilibrium: chi-square, exact and likelihood-ratio tests. Graphical representations. Disequilibrium coefficients: the inbreeding coefficient, Weir's D. R-package HardyWeinberg.
  10. Linkage disequilibrium
    Definition of linkage disequilibrium (LD). Measures for LD. Estimation of LD by maximum likelihood. Haplotypes. The HapMap project. Graphics for LD. The LD heatmap.
  11. Phase estimation
    Phase ambiguity for double heterozygotes. Phase estimation with the EM algorithm. Estimation of haplotype frequencies. R-package haplo.stats.
  12. Population substructure
    Definition of population substructure. Population substructure and Hardy-Weinberg equilibrium. Population substructure and LD. Statistical methods for detecting substructure. Multidimensional scaling. Metric and non-metric multidimensional scaling. Euclidean distance matrices. Stress. Graphical representations.
  13. Genetic association analysis
    Disease-marker association studies. Genetic models: dominant, co-dominant and recessive models. Testing models with chi-square tests. The alleles test and the Cochran-Armitage trend test. Genome-wide assocation tests.
  14. Family relationships and allele sharing
    Identity by state (IBS) and Identity by descent (IBD). Kinship coefficients. Allele sharing. Detection of family relationships. Graphical representations.


Teaching methodology

All classes consist of a theoretical session (a lecture in which the professor introduces new concepts or techniques and detailed examples illustrating them) followed by a practical session (in which the students work on the examples and exercises proposed in the lecture). On the average, two hours a week are dedicated to theory and one hour a week to practice, and the professor allocates them according to the subject matter. Students are required to take an active part in class and to submit the exercises at the end of each class.

Evaluation methodology

Students are evaluated during class, and in a final exam. Every student is required to submit one exercise each week, graded from 0 to 10, and the final grade consists of 50% for the exercises and 50% for the final exam, also graded from 0 to 10.




  • Algorithms on trees and graphs - Valiente, Gabriel, Springer , 2002. ISBN: 3-540-43550-6*cat
  • Analysis of phylogenetics and evolution with R - Paradis, Emmanuel, Springer , 2006. ISBN: 0-387-32914-5*cat
  • Bioinformatics with R - Gentleman, Robert, Chapman & Hall/CRC , 2008. ISBN: 1-420-06367-7
  • R programming for bioinformatics - Gentleman, Robert, Chapman & Hall/CRC , 2008. ISBN: 978-1-420-06367-7
  • Genetic data analysis II: Methods for discrete population genetic data - Weir, Bruce S., Sinauer Associates , 1996. ISBN: 978-0878939022

Web links

Previous capacities

Basic knowledge of algorithms and data structures.
Basic knowledge of statistics.
Basic knowledge of the Python programming language.
Basic knowledge of the R programming language.