Sunday, May 21, 2017

Developing R package: library specific functions

Recently started developing an R package with focus on Bioconductor. Had a lot of experience in using R, but package development in this language is a novel area for me. Got into very strange problem: testing code of function out of package works well, but activating this function from prebuilt package leads to error. The problem was with usage of "intersect" base operation. By default it was loaded from sets library, while was required from GRanges in my case.

Spend some time to figure this out, but thanks to this post fixed it. Issue was related to additional marking of function origin in DESCRIPTION. Basically, additional mark was required in Roxygen function annotation:

#' @importMethodsFrom GenomicRanges intersect

Also, one thing that I had to control - frequent reload of library to run testing. The following command works well:

detach("package:InTAD", unload = TRUE)

Of course, easier testing can be organized using testthat, but initial coding requires simple reload.

These are only minor issues, but closer to Bioconductor submission attempt I will try to give more overview.

Tuesday, April 4, 2017

News about fusion genes discovery

Fusion discovery from RNA-seq goes on. Finally, a nice and useful publication trying to establish perfectly the variance between fusions and chimeras was published. Probably it will help to decrease the mess in the area (especially interpretation of termins like chimera, fusion, read-through...).

Also, InFusion tooklit is being applied. The manuscript describing it was already cited in the recent investigation of meningoma methylation patterns. Moreover, it was included and compared to other tools by Brian Haas in upcoming paper about STAR-fusion method. Glad to observe precise and honest comparison of methods, look forward to see this manuscript published.

Sunday, December 18, 2016

This finally happend: novel tool for fusion discovery from RNA-seq data

There was a long and a bit crazy story, however the toolkit that I was working on has now a publication in PLOS OnE. In short, tool has some unique features including support of strand-specificity to detect anti-sense chimeras and discovery of intergenic fusions. It's open-source and available in bitbucket repo.

It took us more than 3 years to finish the publication and this was amazing experience, which allowed me to learn what really can happen in science. Main lesson: there can be serious competition in research area that results in multiple unexpected issues. The journals with high impact factor are complex systems. Imagine submitting a manuscript, waiting for review 4 months(!) and receiving a statement that the "manuscript should be rejected while tool ability X is not important", with some strange citations. And then, exactly one month after rejection of your manuscript, there is a publication in the same journal with main statement of "the importance of ability X" in their tool. In our case the "ability X" was fusion isoform discovery confirmed with experimental validation.

This was probably main project for me during PhD and a perfect lesson. Of course, I am super grateful to my supervisor Dr. Fernando Garcia-Alcalde, to the whole team from MPIIB and to Lexogen company for amazing support. Moreover, additional useful comments I received from Prof. Steven Salzberg and he doesn't need an introduction for those who are working in sequencing data analysis area (btw, there is also his awesome blog).

Despite the complexity and multiple factors, science remains to me the most exciting and interesting area, where critical revision and communication have high impact. I will be glad to any comments, fixes or suggestions about InFusion. And of course, wish everyone not to experience any strange "non-scientific" problems, but have honest and correct reviews. Stay in science ;)

Sunday, September 18, 2016

Single cell RNA-sequencing: are there going to be some established analysis standards?

Novel technology that allows to detect precisely the cell types and understand how they are organised in such multi-combined cell systems as brain is really awesome. However, due to high complexity of the experiment procedure it is really not easy to find correct overcomes for multiple problems in research project with scRNA-seq. An interesting aspect is the selection of appropriate data analysis methods and tools. There was a nice recent publication in Genome Biology about this. Glad to note that Qualimap2 was mentioned there, stating that even though it is working on multiple datasets,however was not designed for scRNA-seq and additional precise quality control is required for such data. And I totally agree with this statement.

Actually, a good overview for scRNA-seq quality control can be found in a recent tutorial publication in f1000. Additionally, there is a nice detailed course available from University of Cambridge Bioinformatics training unit.

In general scRNA data analysis procedure has many issues that might lead to errors and completely different final results can be produced by various tools on the same data. A known example: comments about the publication describing the tool for cell-cycle heterogenity normalization. Two recent publications provide rather useful status of scRNA-seq analysis procedure in my opinion. First one is "Disentangling neural cell diversity using single-cell transcriptomics" by JF Poulin et al. This manuscript gives an overview about current status in scRNA-seq data analysis and advices about experiment strategy selection. Second one is "Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptiomics" by K Shekhar et al. There is a full description of the analysis procedure of ~25000(!) bipolar cells (retina, visibility) including source code to reproduce it. In my opinion, such description should become a standard.

Also, there are already various resources about scRNA-seq tools. For example, here's a list in special github repo from Linnarsson lab. Of course, there are many other places with such information. So, if there are any additional confident resources really useful for this topic - will be glad to see the comments ;)

Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results


I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.

Thursday, July 16, 2015

BOSC 2015: amazing as always


I just returned from ISMB/ECCB 2015 conference that was in Dublin. However, I would like to describe my favorite pre-conference event. Of course, this is BOSC. I consider BOSC as one of the best conferences due to the importance of "open" activity in any science. As always, all the presentations were awesome. Hopefully the videos of the talks will be available soon, I look forward to it. Here I'll just mention some talks that were specially interesting for me.

Of course, both keynotes were amazing.

First one was from Holly Bik. As a biologist she turned to bioinformatics and described in detail this process and current results. Already from the name of the talk it's clear to understand the main point: "Bioinformatics: Still a scary world for biologists". As a bioinformatician in frequent contact with biologists I understand this status and totally agree with it. Biologists have a lot of work to do, and getting experience in bioinformatics is not so easy. However, it is great that there are already more and more biologists who are going deeper in this direction.

Second keynote was from Ewan Birney, a Joint Associate Director of EMBL-EBI and co-founder of Open Bioinformatics Foundation. He is a perfect example of bioinformatics super-specialist. Several topics were described in detail during this talk. The most interesting subject in my opinion - importance of open source code. Three basic reasons: transparency, efficiency and community. In the end of the talk there was a surprising block about storing of text information in DNA (more details here). In this case one room will be enough to keep encoded in DNA "all" written books. So, if you read "Fahrenheit 451" by Ray Bradbury, this looks like a perfect solution for the main problem.

Among other talks I was especially interested in RNA-seq analysis. "GO express" project presented by Kevin Rue-Albrecht is a nice example of a detailed clustering and group detection investigation from gene expression analysis.

One other project I would like to mention is a novel pipeline design toolkit called Nextflow (presented by Paolo Di Tommaso). Of course, there is a number of workflow managers. However the toolkit that can be applied only by creating small scripts to make runs parallel and with detailed reports is very useful for practical bioinformaticians.

The most surprising talk for me that was from Bastian Greshake about openSNP database. Basically this database allows to submit personal SNP data created by companies such as 23andme. In result it provides rather interesting statistics comparison from supplied datasets. This comparison might be useful for for future research projects and even just for users to find people rather similar to them. One cool point: disclaimer, totally recommend to read it!

I was also giving a small talk during the conference. It was about second version of Qualimap. The talk was rather quick (only 5 mintues, no questions afterwards) therefore very useful as practice. Additionally I got several interesting questions in the poster session. Moreover, I met people who are using Qualimap and additionally during the ISMB/ECCB conference there was a cool poster about Eager project from Alexander Peltzer that mentions Qualimap as a part of a pipeline.

Ok, that is enough about BOSC event, totally recommend to visit the next one during ISMB 2016!


Wednesday, May 13, 2015

Linux: how to print header line of tab-delimeted file in column with numbers


Let's say there is a header of certain atab-delimited file that has a lot of parameters. There are many examples: annotation files, results of certain algorithms, etc...

Sometimes it's interesting to show only select certain parameters. This can be performed easily using cut command. For example, this is how to report only chromosome name and start-end positions of gene exon from GTF file:

cut -f 1,4,5 Homo_sapiens.GRCh38.78.gtf | less

To perform such task using cut command it's important to report exact indexes of required columns. But, what if there are too many columns (more than 30 for example)? How to detect exact indexes?

Here's an example of a command to detect index numbers of a first line of a file:

head -n 1 fusions.detailed.txt | tr "\t" "\n" | cat -n

This example creates a list of all parameters of a detailed report from InFusion tool. Result of this example can be found here (numbers and words in bold font).