Thursday, January 24, 2013

Chromosome names: with "chr" prefix or without

There is one irritating problem related to lack of standardization in genome annotations: "chr" prefix inconsistency in chromosome names. For example, UCSC RefSeq annotations include this prefix, while the Ensemble annotations don't.

It's a good rule to take into account this problem when developing bioinformatics software. Unfortunately not everyone follows this guidance. In result one might need to remove the "chr" prefix or insert it in FASTA file or an alignment file to prepare correct input data.

Here is a nice post from Pierre Lindenbaum, which provides solution to the problem using sed.

And here is my small example command, which inserts "chr" prefix into FASTA file:

sed -i -e 's/^>/>chr/' hg19.fa


P.S. Thanks to my supervisor for improvement, it's better to set "^" anchor to indicate the start of the line.

No comments: