Thursday, January 24, 2013

Chromosome names: with "chr" prefix or without

There is one irritating problem related to lack of standardization in genome annotations: "chr" prefix inconsistency in chromosome names. For example, UCSC RefSeq annotations include this prefix, while the Ensemble annotations don't.

It's a good rule to take into account this problem when developing bioinformatics software. Unfortunately not everyone follows this guidance. In result one might need to remove the "chr" prefix or insert it in FASTA file or an alignment file to prepare correct input data.

Here is a nice post from Pierre Lindenbaum, which provides solution to the problem using sed.

And here is my small example command, which inserts "chr" prefix into FASTA file:

sed -i -e 's/^>/>chr/' hg19.fa


P.S. Thanks to my supervisor for improvement, it's better to set "^" anchor to indicate the start of the line.

Friday, January 11, 2013

Output subset of columns from file

Nice tip to remember: Unix command line tool cut allows to output not only selected columns from file, but also a subset of columns.

Example:

kokonech@ultor:~/playgrnd/scythe$ cut -f 1,2,3,4,5,6 scythe_output/fusions.txt
#ref1 break_pos1 strand1 ref2 break_pos2 strand2
chr20 49446917 + chr17 58761341 +
kokonech@ultor:~/playgrnd/scythe$ cut -f 1-6 scythe_output/fusions.txt
#ref1 break_pos1 strand1 ref2 break_pos2 strand2
chr20 49446917 + chr17 58761341 +