Sunday, September 18, 2016

Single cell RNA-sequencing: are there going to be some established analysis standards?

Novel technology that allows to detect precisely the cell types and understand how they are organised in such multi-combined cell systems as brain is really awesome. However, due to high complexity of the experiment procedure it is really not easy to find correct overcomes for multiple problems in research project with scRNA-seq. An interesting aspect is the selection of appropriate data analysis methods and tools. There was a nice recent publication in Genome Biology about this. Glad to note that Qualimap2 was mentioned there, stating that even though it is working on multiple datasets,however was not designed for scRNA-seq and additional precise quality control is required for such data. And I totally agree with this statement.

Actually, a good overview for scRNA-seq quality control can be found in a recent tutorial publication in f1000. Additionally, there is a nice detailed course available from University of Cambridge Bioinformatics training unit.

In general scRNA data analysis procedure has many issues that might lead to errors and completely different final results can be produced by various tools on the same data. A known example: comments about the publication describing the tool for cell-cycle heterogenity normalization. Two recent publications provide rather useful status of scRNA-seq analysis procedure in my opinion. First one is "Disentangling neural cell diversity using single-cell transcriptomics" by JF Poulin et al. This manuscript gives an overview about current status in scRNA-seq data analysis and advices about experiment strategy selection. Second one is "Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptiomics" by K Shekhar et al. There is a full description of the analysis procedure of ~25000(!) bipolar cells (retina, visibility) including source code to reproduce it. In my opinion, such description should become a standard.

Also, there are already various resources about scRNA-seq tools. For example, here's a list in special github repo from Linnarsson lab. Of course, there are many other places with such information. So, if there are any additional confident resources really useful for this topic - will be glad to see the comments ;)

Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results

I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.

Thursday, July 16, 2015

BOSC 2015: amazing as always

I just returned from ISMB/ECCB 2015 conference that was in Dublin. However, I would like to describe my favorite pre-conference event. Of course, this is BOSC. I consider BOSC as one of the best conferences due to the importance of "open" activity in any science. As always, all the presentations were awesome. Hopefully the videos of the talks will be available soon, I look forward to it. Here I'll just mention some talks that were specially interesting for me.

Of course, both keynotes were amazing.

First one was from Holly Bik. As a biologist she turned to bioinformatics and described in detail this process and current results. Already from the name of the talk it's clear to understand the main point: "Bioinformatics: Still a scary world for biologists". As a bioinformatician in frequent contact with biologists I understand this status and totally agree with it. Biologists have a lot of work to do, and getting experience in bioinformatics is not so easy. However, it is great that there are already more and more biologists who are going deeper in this direction.

Second keynote was from Ewan Birney, a Joint Associate Director of EMBL-EBI and co-founder of Open Bioinformatics Foundation. He is a perfect example of bioinformatics super-specialist. Several topics were described in detail during this talk. The most interesting subject in my opinion - importance of open source code. Three basic reasons: transparency, efficiency and community. In the end of the talk there was a surprising block about storing of text information in DNA (more details here). In this case one room will be enough to keep encoded in DNA "all" written books. So, if you read "Fahrenheit 451" by Ray Bradbury, this looks like a perfect solution for the main problem.

Among other talks I was especially interested in RNA-seq analysis. "GO express" project presented by Kevin Rue-Albrecht is a nice example of a detailed clustering and group detection investigation from gene expression analysis.

One other project I would like to mention is a novel pipeline design toolkit called Nextflow (presented by Paolo Di Tommaso). Of course, there is a number of workflow managers. However the toolkit that can be applied only by creating small scripts to make runs parallel and with detailed reports is very useful for practical bioinformaticians.

The most surprising talk for me that was from Bastian Greshake about openSNP database. Basically this database allows to submit personal SNP data created by companies such as 23andme. In result it provides rather interesting statistics comparison from supplied datasets. This comparison might be useful for for future research projects and even just for users to find people rather similar to them. One cool point: disclaimer, totally recommend to read it!

I was also giving a small talk during the conference. It was about second version of Qualimap. The talk was rather quick (only 5 mintues, no questions afterwards) therefore very useful as practice. Additionally I got several interesting questions in the poster session. Moreover, I met people who are using Qualimap and additionally during the ISMB/ECCB conference there was a cool poster about Eager project from Alexander Peltzer that mentions Qualimap as a part of a pipeline.

Ok, that is enough about BOSC event, totally recommend to visit the next one during ISMB 2016!

Wednesday, May 13, 2015

Linux: how to print header line of tab-delimeted file in column with numbers

Let's say there is a header of certain atab-delimited file that has a lot of parameters. There are many examples: annotation files, results of certain algorithms, etc...

Sometimes it's interesting to show only select certain parameters. This can be performed easily using cut command. For example, this is how to report only chromosome name and start-end positions of gene exon from GTF file:

cut -f 1,4,5 Homo_sapiens.GRCh38.78.gtf | less

To perform such task using cut command it's important to report exact indexes of required columns. But, what if there are too many columns (more than 30 for example)? How to detect exact indexes?

Here's an example of a command to detect index numbers of a first line of a file:

head -n 1 fusions.detailed.txt | tr "\t" "\n" | cat -n

This example creates a list of all parameters of a detailed report from InFusion tool. Result of this example can be found here (numbers and words in bold font).

Monday, March 30, 2015

Computing coverage out of exon regions

After performing alignment of RNA-seq or exome sequencing data it is quite important to compute coverage. There are several ways how to perform this task. However, there is one additional interesting issue: check if there is any coverage outside of exon regions. For example, in case of RNA-seq this allows to show if there are some previously unknown transcripts expressed. In this post I will describe how to compute out-of-region coverage from BAM file.

To perform out-of-region coverage check two files should be available:
- alignment data in SAM/BAM format. Sample: kidney.bam.
- gene annotations in BED format. Sample: transcripts.human.64.bed.

I will describe two methods to perform this task. Of course, I will recommend second method, using my favorite tool ;)

Using BEDtools

First of all, a file listing the coordinates of regions outside of genes from annotation file is required. This can be done using command complmentBed:

$~/tools/BEDTools/bin/complementBed -i transcripts.human.64.bed -g hg19.fa.fai > transcripts.human.64.outside.bed

After, the creating out-of-region annotation file, coverage can be computed:

$~/tools/BEDTools/bin/coverageBed -abam kidney.bam -b transcripts.human.64.outside.bed -d | awk '{c+=$5;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

More details about BEDtools can be found here.

Using Qualimap

Qualimap BAM QC mode supports performing analysis of a BAM file within regions from annotation file in BED/GFF/GTF formats. Additionally, there is an option to perform additional analysis in "out of regions" block. The options is called "Analyze outside regions" , more details in documentation. So, here's an example:

$ qualimap bamqc -bam kidney.bam -gff annotations/transcripts.human.64.bed -os --java-mem-size=4G

Qualimap can be downloaded from here.

Well, that's enough for now. Have fun ;)

P.S. Thanks a lot to Tristan Carland for reporting a bug in QualiMap when computing coverage out of regions. In version 2.1 the bug is fixed.

Tuesday, February 24, 2015

Neuroscience is coming

Well, I was out of this world for a couple months, but now I am back :) And my current report will be not about bioinformatics, but about neuroscience.

Not so long ago I started reading neuropsychology books. The first book I read was called "The Man Who Mistook His Wife" by Oliver Sacks. I enjoyed this book from the beginning till the end. It is really amazing. The book describes some negative events which might happen to the brain because of a certain disease. There are four parts in this book: "Losses", "Excesses", "Transports" and "The World of the Simple". Each part focuses on a certain type of a brain damaging event.

I am not going to explain this in detail, but will just give some examples:

- A man loses an ability to memorize anything. When he is 40 years old, he still thinks that he is 20. He can not memorize anyone or anything, but still happy about his life.

- A young woman loses an ability to use her body, even to move or to walk (so called proprioception). But she manages to restore herself using other parts of the brain.

- Old ladies hear in their brains music from their young years. When the music plays, they can not hear anything else.

These are just three examples but there are much more amazing stories. I really recommend to read this book. Moreover, Oliver Sacks wrote more than ten books. Some of them were even used to create a movie. For example, a film called "Awakenings" with Robert De Niro and Robin Williams is really incredible in my opinion.

Also Oliver Sacks in his books always refers to other great neuroscientists. I didn't even know previously that one of the famous neuropsychologists was from Soviet Union. His name is Alexander Luria. I also read some of his books, and I totally recommend one that is called "The Man with a Shattered World: The History of a Brain Wound". This is a story about a Russian who was wounded during WWII. A bullet stuck in his brain, but he managed to survive and improve himself. I really recommend this book too.

There are other neuroscience authors, perhaps I will write again on this topic after reading their books.

And the last notice: one of the developing sciences is called neuroinformatics. Maybe soon it will become even more crucial than bioinformatics :)

P.S.: Thanks a lot to neuropsychologists whom I met and to Arakdy Urkop for recommending me the book.


Tuesday, October 7, 2014

Visualizing RNA-seq alignments with fusions: is there anything better than IGV?

I was recently working on improving the module for generating alignments with annotated fusion-supporting reads in the context of InFusion pipeline for fusion gene and chimeric transcript discovery from RNA-seq data. Once you have such alignment in BAM format it is quite useful to visually inspect the reads supporting fusion breakpoint (junction between fused genes or transcripts). First of all it is a good sanity check to see if the prediction makes sense. Additionally, inspecting the coverage around the breakpoint might give some hints if the prediction is false positive.

So far I was using IGV for this purpose and it was working nicely for me. One cool thing about IGV is that it allows to color reads by any SAM record tag. This makes possible to color-code reads belonging to the same fusion based on a custom tag which is assigned by InFusion. IGV also allows splitting the screen to visualize both parts of the fusion (sadly without zooming). You get to see something like this:

When checking several fusion predictions I noticed that IGV was not showing some fusion-supporting alignments. Later I figured out that this was due to a configuration setting of a maximum number of reads displayed. But at first I thought - why not try using something else? There are tons of genome browsers out there.

I tested several popular of them, but none worked as good as IGV. See my comments below.

Important note: all software was tested on Ubuntu 12.04 x64.

In my tests I was trying to visualize a certain region of a BAM file that according to prediction has a fusion breakpoint. This how the region looks in IGV v2.2.2:

As you can see reads belonging to fusions are shown in different colors based on a custom SAM record tag. This allows to easily distinguish between the fusion-supporting and "normal" alignments. Another very nice feature is the coverage track. It makes possible to see the overall coverage level in the region and visually detect any suspicious peaks.

Now let's see how other browsers display the same region.

IGB v8.1.11. I loaded the BAM file and successfully zoomed to the region. At first I was not able to see any reads, however after clicking on "Optimize track height" the reads appeared. Unfortunately it was not possible to color reads by arbitrary tag and I could not find an option to search for a read by name. So it was not easy to find and inspect fusion-supporting alignments:

GenomeView 2450. As previously I loaded the BAM file and zoomed to the region of interest. Only certain number of reads was shown. Also as I understand this browser has a fixed color-scheme for reads. Therefore it was not possible to visualize fusion supporting reads:

Tablet v1.14.04. I loaded the BAM file and tried to jump to the desired base in the chromosome, however the software crashed complaining that the BAM file is not sorted or index is missing (which was not true). I also tried to search by name for a certain read supporting the fusion breakpoint in order to quickly zoom to the location. The read was found successfully, but once again when I tried to go the location of the read the software crashed.

Update 08.10.2014: The developers replied very quickly to my bug-report. The issue was due to one alignment record in the BAM file having wrongly formatted CIGAR. After fixing this record the region was shown successfully. Searching read by name is a very handy feature of Tablet. However the coloring by tag is not supported. One possible solution to mark fusion-supporting reads is to assign read groups to them, but this would require some additional work.

Savant v2.0.5. I loaded the BAM file, but unfortunately when I zoomed to the region of interest the data track was flickering and it was not possible to see anything. I reported the issue to the developers. Here how it looks on a screenshot:

Conclusion. Although the bugs in the software might be fixed in future versions, at the current moment IGV remains the best browser to visualize RNA-seq alignments with fusion genes. I suppose the same true (and some people agree) for visualization of structural variations in WGS-data.