Introduction to Clinical Bioinformatics

Clinical Bioinformatics - An introduction

Where to start?

If you are a clinical trainee in medical genetics (either a resident or a genetic counseling student), a medical geneticist in practice or from other clinical specialties interacting with genomic data (e.g. clinical molecular/cytogenetics, pathology, etc.), you might be thinking: I wish I knew more about bioinformatics, but I don’t know where to start. Hopefully, this blog post will help you to clarify your expectations and might help in providing some useful starting points.

The short answer to your question is: It depends what you want to learn, how much time you have on your hand and how realistic your expectations are.

My name is My Linh Thibodeau, and I am a resident from the Medical Genetics and Genomics Residency Training Program at the University of British Columbia (Vancouver, Canada) and I am currently pursuing my M.Sc Bioinformatics degree through the Clinician Investigator Program (UBC). I have always had an interest for coding and computer sciences. I am sharing my point of view on my learning process and my personal opinion on the best approaches to use depending on the desired outcome.

Will clinical bioinformatics be useful for me?

Bioinformatics training would be useful for any clinical trainee or clinician working with genomic data. The most important question is: How much bioinformatics do you need to know? The answer heavily depends on your training, career goals and clinical practice. I recommend an overall understanding of bioinformatics to all medical genetics trainees and clinicians, but since computer science is vast scientific field, the challenge is determining what you need to know and in how much details.

The best way to learn applied bioinformatics skills is with a “data-driven approach”. Ideally, you should have some dataset(s) available that you wish you knew how to analyze. Try to think of instances during your training or your career where you thought: If only I knew some bioinformatics, I could do this or that more efficiently. Start a list of tasks you believe bioinformatics would be helpful with.

If you are a clinician performing research in genomics with a leadership role, then acquiring applied coding skills might not be the most appropriate use of your time and instead, you might want to learn more on the basic theoretical concepts underlying big data research, such as:

Sequencing technologies
Algorithms, methods and tools for
- Alignment/Assembly
- Variant calling
- Gene and variant annotations
Bioinformatics pipeline design
Data file format, data storage, data security

What will this post cover?

This blog post is a work in progress. I will be providing an overview of some bioinformatics genomics skills and resources based on my personal training experience.

This post will also be followed by “workshop posts” containing some applied/hands-on simple exercises/examples, which will also be made available on my personal github.

First of all, register your accounts

The very first step on your journey is to be able to retrieve genomic data from the Web. My personal advice would be to register accounts for these resources:

UCSC
Ensembl
Genomic Oligoarray and SNP array evaluation tool v3.0 - University of Miami CMA tool. This has been a popular tool in clinical practice, but there are other useful tools that perform similar tasks in my opinion (e.g. NCBI Bulk Conversion or Ensembl BioMart)
COSMIC. If you are working with families with hereditary cancer, I suggest you also register a COSMIC account.

If you are planning to acquire intermediate-to-advanced coding skill, you can also register for this:

OMIM API access request.

Overview

100% useful clinical bioinformatics skills

While coding is a powerful tool, it can also have a steep learning curve at the beginning. There are many tools you can use which require no coding at all. So rather than starting from scratch you can benefit from those that have come before and focus on the more pertinent part of that data analysis

If you are a clinical trainee with a limited amount of time available, the best avenue for you is likely to learn how to make the most out of GUI (Graphical User Interface) resources.

Don’t be quick to dismiss GUI websites/resources as they are often underused resources and most clinicians (and even researchers) are not using them to their full potential. For example, although most people know how to punch genomic coordinates in the UCSC (University of California Santa Cruz) Genome Browser or BLAT a sequence (which is performing targeted alignment), most people do not know how to add custom tracks of genomic coordinates (e.g. BED file format) to visualize contigs (assembled de novo or with alignment from sequencing data) or splicing junctions (e.g. transcriptome data). I am providing you with some instructions on how to do this below.

Clinical genetics training should provide you with the opportunity to learn the basic uses for these resources (see below), but note that for most of these individual GUI resources, there are 1-4 days workshops available to learn the intricasies of their workings, so even though you might already know the basics, I believe taking it further could be useful for clinical and research applications.

Genome browsers

UCSC Genome Browser

As mentioned previously, the UCSC Genome Browser can be used for diverse tasks.

Basic uses: Query genomic position and retrieve databases annotations.
Additional uses: (see Exercises)
- Upload custom genomic coordinates text or BED file data to UCSC to visualize genomic contigs.
- Visualize transcriptome splicing events

I have personally used UCSC to visualize and characterize somatic splicing abnormalities. If you are working with assembly data, this UCSC feature could be extremly useful for you. It would allow you to visualize genomic events at the same time than the genomic annotations (gene transcripts, databases entries like ClinVar, dbSNP, DECIPHER, etc.).

Ensembl

Basic use: search for gene/variant/transcript annotations
Additional uses: (see Exercises)
- Use BioMart for data query and extensive annotations
- Convert genomic coordinates between assemblies

Remap

Basic use: Remap converts genomic coordinates from one assembly to another (e.g. from GRCh37 to GRCh38). Note that this task can also be accomplished with the Ensembl Assembly Converter or UCSC LiftOver tool.

See exercise for an example.

Population Databases

ExAc and gnomAD

See example for IGV.js visualization. The sequencing reads for each individual can be visualized at the bottom as ExAc is using the visualization IGV.js tool.

dbSNP

If you are searching more than 2-3 SNPs repetitively, it would be more time efficient to submit a batch query. Unfortunately, the batch query tool is becoming deprecated, but I think it is a still useful exercise to try.

See exercise for an example.

Phenotypic resources

Phenomizer

This is a tool from the Human Phenotype Ontology (HPO).

Likely useful coding skills

In order to acquire coding skills, it takes approximately 4 months of full time dedicated and intensive work. Why 4 months? Oh well, because it is approximately the amount of time it took me. Although I was performing some research work at the same time, I also have completed one year of mechanical engineering before medical school and I had learned basic coding skills. Therefore, all things considered, I am pretty confident that 4 months would be the minimum amount of time required to learn sufficient coding skils in order to apply them to real clinical or research problems.

Unfortunately, there aren’t any curriculum for clinical bioinformatics yet. Moreover, in order to reach a coding skill level that will allow you to use coding for clinical data, you will most likely need to learn some genomics research coding as a baseline, then you will be able to build on this skill set with more complex clinical questions.

Learn coding skills mainly useful for research

Introductory curriculum:

Individual Genomic Tools: Other workshops such as the ones provided at conferences can be a good introduction. I highly recommend to seek and attend such workshops when available, especially at clinical conferences since these sessions will provide a snapshot overview of a genomic tool. For example, the American Society of Human Genetics Annual Meeting 2017 had several interactive invited workshops, including “Accessing the Breadth of Data in Ensembl: A Worked Clinical Example” and “Overview and Interpretation of GTEx Resources: eQTLs and Gene Expression”.

Basic coding/bioinformatics: The Canadian Bioinformatics Workshops can provide a “crash course-like” curriculum. Although some of more advanced workshop require a letter of support from a person in the field of Bioinformatics (ideally from a known leader in the field), these workshops are meant to be introductory and might be suitable for a keen clinical trainee.

More comprehensive “coding oriented” curriculum:

R coding: STAT545/STAT547
Python: self-learning > CPSC 301 Computing for Life Sciences (UBC)
- Other possibility: Workshops (UBC and SFU)
Bash (Linux/Unix) terminal: No real “medical/biological” formal curriculum available on online platforms or at nearby teaching institutions (pending review).

Learn clinically useful coding skills

Here are some examples of useful skills to acquire and the language(s) that would allow to complete them.

R coding or Python:
- Comparing lists of genes
- Using an API (Application Programming Interface). For example, the OMIM API, to retrieve genetic and phenotypic data.
Bash:
- Learn how to use the terminal to query txt files (.txt, .tsv, .csv) or a VCF file

Perhaps less useful coding skills, except if you have a lot of time available and are an independent learner

This section is meant for the motivated individual who wants to acquire bioinformatics skills independently in a self-learning process.

Step 1. Download some datasets (intermediate-to-advanced)

This also only apply to you if are planning to acquire intermediate-to-advanced coding skill. If it is the case, I would advise is to download useful datasets to your computer:

DECIPHER The Development Disorder Genotype - Phenotype Database (DDG2P database).
Catalogue of Somatic Mutations in Cancer/COSMIC gene census. You need to have registered an account to download the file to your computer (otherwise, you are stuck with browsing the data on the web). I find that the COSMIC cancer gene census encompass almost all hereditary cancer predisposition genes and therefore, can be useful as a reference list.
Orphadata from Orphanet. This resource could potentially be very useful to a clinician as it contains a lot of phenotypic data, but its use is limited by the difficulty to query the XML format data (e.g. Phenotypes associated with rare disorders) and the inconsistencies of the tree structure. Using this resource will require a lot of patience, time and hard work, and will most likely require intermediate Python coding skills, so I would classify it as “Advanced”.

Step 2. Learn the basics

Although less useful in the short run, starting with learning basic computer science skills would teach underlying workings and function of very simple algorithms using if/else/or statements.

Basics of R (3-4 weeks): Online Technical Foundations of Informatics. Specifically, Introduction to Git and Github, Introduction to R.
- Pros: If a limited amount of time is available, learning how to employ “ready to use” functions in R is more appropriate. The time to reach usefullness is shorter. Applications to biological (and clinical) questions/problems is easier.
- Cons: This coding language was not meant for building algorithms and instead, was meant as a statistical tool for research. Therefore, it is less flexible and it contains several incohenrences. Error messages are difficult to troubleshoort.
Basics in Python (3-4 weeks): Online Code Academy Learn Python. Python is a more coherent coding language.
Basics of terminal/bash commands: Unfortunately, terminal use (bash Linus/Unix) is not as intuitive as R or Python languages, and therefore, I believe that “self-learning” might be a difficult approach. It is why I would recommend a more formal training avenue.
- The University of British Columbia- West Grid/Compute Canada Introduction to Genomic Analysis is a workshop series initially meant for Canadian researchers using Compute Canada resources, but the curriculum seem to cover the most important basic terminal use/bash skills required for bioinformatics and therefore, appears to be a good option for genetics/genomics clinical trainees.

Bioinformatics For Clinicians

Clinical Genetics | Genomic Data Science | Bioinformatics | Research