Bioinformatics I

Course Aim

The aim of this course is to provide students with a practical foundation in bioinformatics for analyzing modern sequencing datasets. It is intended to help them become confident, critical, and independent users of common bioinformatics tools, rather than relying on black-box workflows or fragmented online tutorials. They will learn how bioinformatics analyses are designed, implemented, checked, documented, and interpreted in real research contexts. The course emphasizes reproducibility, data quality, critical evaluation of automated results, and the production of clear, publication-ready outputs.

Student Learning Outcomes

The successful student will be able to:

(1) Understand research articles that use algorithms and computational tools for analyzing sequencing data.

(2) Design and carry out bioinformatics analyses relevant to their research project.

(3) Assemble reproducible bioinformatics pipelines.

(4) Prepare publication-grade figures of bioinformatics analyses.

Course Description

Rapid decreases in sequencing costs have revolutionized modern biology: even small laboratories can now generate large-scale genome, transcriptome, metagenome, epigenomic, spatial, or single-cell datasets. As a result, bioinformatics has become an essential research skill across many fields represented at OIST. A hands-on Bioinformatics course will give students the foundation to analyze their own sequencing data critically, reproducibly, and independently, rather than relying on black-box pipelines, fragmented or outdated online tutorials, or unverified AI-generated code.

The course will consist of relatively short lectures focused on understanding core algorithms and concepts used in the field (mostly for sequencing, but with an introduction to imaging and mass spectrometry data), followed by extensive hands-on computer labs. Given the growing role of AI in bioinformatics, particular attention will be paid to critical thinking, data curation, reproducibility, handling real-world noisy datasets, generating publication-grade figures, and exercises where AI gives misleading answers or fails completely. For example, students will be provided with results generated by AI from raw data (including all commands and code) and asked to reanalyze them to identify what went wrong and explain why. The course is designed to be very hands-on, based on popular 2-week workshops such as the MBL Workshop on Molecular Evolution [https://www.mbl.edu/education/advanced-research-training-courses/course…], Evomics workshops [https://evomics.org/], or Software Carpentry workshops [https://software-carpentry.org/workshops/workshops-upcoming/]. During the computer labs, the students will reuse the same raw data for multiple exercises whenever possible (e.g., most genomics and transcriptomics) to ensure they get familiar with different workflows using a familiar data set.

Course Contents

Lectures

1. Introduction and sequencing technologies

2. Sequence alignment/mapping algorithms (Smith-Waterman, BLAST, Needleman–Wunsch, Burrows-Wheeler, HMM, HHpred, Phyre2, Foldseek, AlphaFold, etc.) and common databases (SRA, NR/NT/RefSeq, GTDB, UniProt, PFAM, EukProt, KEGG, COG, EggNOG, etc.)

3. Genome and transcriptome assembly algorithms (de Bruijn, OLC, and hybrid assemblers)

4. The importance of being critical of input data, data curation, and common issues that make automated workflows produce reproducible, but invalid results (orthology assignment, non-standard genetic codes, non-standard genome architectures, data quality and contamination in public repositories, biases toward specific groups of organisms, etc.)

5. Programming reproducible pipelines (Nextflow, etc.) and publication-grade figures in Python, R, Processing, etc.

6. Gene prediction, functional genome annotation, and comparative genomics

7. Ortholog predictions and phylogenomics with probability-based methods (maximum likelihood and Bayesian inference)

8. Transcriptome annotations and digital expression values (TPM, FPKM, RPKM)

9. Differential gene expression analyses and epigenetics (Hi-C, ChIP-Seq, ATAC-Seq, bisulfite sequencing, etc.)

10. Genome-resolved vs. single-cell vs. spatial omics

11. A brief introduction to population genomics data and basic algorithms

The computer labs (and two project assignments) will include exercises to get familiar with:

1. Unix terminal and using Deigo HPC (to be coordinated with SCDA)

2. Bash as a simple programming language

3. Text editors, bioinformatics files, formats, and data QC (fastQC, MultiQC, fastp, etc.)

4. Standalone Blast/Diamond, Bowtie-2, minimap2, HMMER, and Samtools

5. Software environments (Conda or Docker) and workflow managers (Nextflow/nf-core or Snakemake)

6. Version control with Git and GitHub

7. Understanding and editing bioinformatics code (Python and Biopython)

8. Understanding and editing bioinformatics code (R and ggplot2)

9. Genome assembly and QC (SPAdes, Flye/hifiasm, Quast, BUSCO)

10. Genome annotation (Bakta, BRAKER3, EggNOG-mapper) and browsers (e.g., Tablet, Artemis, IGV, IGB)

11. Transcriptome assembly and downstream analyses (Trinity/Trinotate ecosystem, STAR/HISAT2, Salmon/Kallisto/RSEM, and DESeq2/edgeR)

12. Metagenomics and binning (Phyloflash, Blobtools, CheckM2, and other tools)

13. Comparative genomics and gene vs. species trees (Orthofinder, Mafft, IQTree, PhyloBayes)

Assessment

Participation and practical exercises: 40%
Students will complete hands-on exercises using test datasets during computer labs and as homework. They will keep a record of all their work and notes using their preferred approach (Jupyter notebooks, Quarto, RStudio, VS Code) and plain scripts.

Mid-term bioinformatics project: 20%
Students will complete a project assigned by the lecturer, starting from raw or minimally processed data and producing interpretable results and figures.

Final student-designed bioinformatics project: 40%
Students will design and carry out a bioinformatics analysis relevant to their research interests (using either their own data or data downloaded from public databases), including reproducible code, documented workflow steps, and publication-quality figures.

Prerequisites or Prior Knowledge

Prior experience with Unix-based operating systems and scripting languages such as Bash, Python, or R will be helpful, but is not required. The course will introduce the necessary computational skills during hands-on computer labs.

The course is positioned in the curriculum as a broader practical follow-up to B23 Molecular Evolution, and as a hands-on computer-lab-based complement to B54 Decoding Genomes: From Sequences to Phylodynamics. Students who have taken either of these courses will have useful background knowledge, but neither course is required as a formal prerequisite.

Textbooks

Recommended textbooks (not required purchases):

Vince Buffalo. Bioinformatics Data Skills: Reproducible and Robust Research with Open Source Tools. O'Reilly 2015, 507 pages
[https://github.com/vsbuffalo/bds-files/]

Phillip Compeau and Pavel Pevzner. Bioinformatics Algorithms: An Active Learning Approach. Active Learning Publishers 2018, 728 pages
[https://www.bioinformaticsalgorithms.org/]

Steven Haddock and Casey Dunn. Practical Computing for Biologists. Oxford University Press 2010, 564 pages
[https://practicalcomputing.org/; with updated Python 3 scripts]

Altuna Akalin. Computational Genomics with R. CRC Press 2020, 462 pages
[https://compgenomr.github.io/book/]

Martin Jones. Python for Biologists book series 2013-2020
Python for Biologists: A complete programming course for beginners; Advanced Python for Biologists; Biological data exploration with Python, pandas and seaborn
[https://pythonforbiologists.com/]

ノート

New for AY2026; Alternates with A319