Wang Zheng Yuan: October 2013

OSDDLinux.net

Next Generation Sequencing (NGS) software packages

In the era of Next Generation Sequencing (NGS) technology, it is easy to sequence whole genome, exome and transcriptome of an organism. But there are several challenges also associated with analysis of data produce by these technologies as high throughput data came in form of short reads, and also containing several artifacts. We have developed several modules for the analysis of Next Generation Sequencing (NGS) data, generated after sequencing of whole genomes, transcriptomes and human exomes.

Automated pipeline for whole genome assembly and annotation We have developed an automated pipeline for genome assembly and annotation of microbial genomes. User can provide path of input sequencing reads files and parameters in the configuration file for the pipeline. This pipeline work in three steps; (i) Filtering of genome sequencing data, (ii) Genome assembly of filtered reads, (iii) Genome annotation of assembled genome.
USAGE: assemb_anno.pl -i (Configuration file) -o (Output directory name)
Example Command: ./assemb_anno.pl -i Configuration_file -o my_out
-i Configuration_file
-o Output Directory

Benchmarking of Genome assemblers (GenomeABC server) Recently, several algorithms have been developed for assembling of whole genome from short reads. A number of algorithms are available free for public use in form of software packages such as Velvet, SOAPdenovo, AbySS, Euler-sr, Edena and SSAKE. Presently, it is difficult for a user to choose appropriate assembler for their genomes due to lack of benchmarking of existing genome assemblers. We have developed GenomeABC software for the bencmarking of assembled genomes. Here, we have included three modules for the purposes; (i) Benchmarking of genome assembles, (ii) Generation of artificial genome and simulated reads, (iii) Generation of mutated genome and simulated reads corresponding to this.
(i) Benchmarking of genome assembles
This is a major module of GenomeABC which allows users to evaluate their assemblers. In order to use this module user should provide reference genome and contigs generated by their assemblers. This module will compare contigs and reference genome in order to evaluate performance of assemblers. In this study, BLAT is used to map contigs on reference genome.

USAGE: benchmarking_new_assembled_genome.pl -c (fasta format contig file) -r (fasta format reference genome file) -o (output file name)
Example Command: ./benchmarking_new_assembled_genome.pl -c contigs.fasta -r ref.fasta -o out.txt
-c Sequence in FASTA format
-r Reference genome file
-o Output Directory

(ii) Generation of artificial genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide tobe mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.

USAGE: make_genome.pl -s (Genome Size (Put 5000000 for 5-Mb)) -a (A % (i.e. 25%)) -t (T % (i.e. 25%)) -g (G % (i.e. 25%)) -c (C % (i.e. 25%)) -l (Read length) -i (Insert length) -v (Coverage) -y (Type of reads) -o (Out directory) -s Size of genome shich have to be created.
-a Percentage of A in the genome.
-t Percentage of T in the genome.
-g Percentage of G in the genome.
-c Percentage of C in the genome.
-l Read length.
-i Insert length.
-v Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.

(iii) Generation of mutated genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide to be mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.

USAGE: make_mut_genome.pl -i (Input genome fasta file) -m (Percentage of mutation) -l (Read length) -f (Insert length) -c (Coverage) -y (Type of reads) -o (Out put file)
-i Input genome file.
-m Percentage of mutation.
-l Read length.
-f Insert length.
-c Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.

Variation detection in normal-tumor paired data We have developed a pipeline for the identification of SNPs and somatic variations among normal-tumor paired sequencing data. User should provide sequencing data of tumor sample and normal tissue sample of same individual for the comparison of both data simultaneously and identification of SNPs and somatic variation. This pipeline works in several steps by usingdifferent kind of freely available tools; (i) Filtering of sequencing data, (ii) Alignment of filtered reads to human genome, (iii) Variation detection in the normal-tumor samples (IV) Mapping of somatic varaiations at gene level.
USAGE: variation_detect.pl -i (Configuration file) -o (Output directory name)
Example Command: ./variation_detect.pl Configuration_file -o my_out
-i Configuration_file
-o Output Directory

Software packages (.deb) for genome assembly and annotation

We have also developed some debian (.deb) packages for whole genome asembly and annotation from Next Generation Sequencing (NGS) data. After installing OSDDlinux, user can download and install these .deb packages in the system.

Program	Purpose	Usage
ABySS	Genome assembler	Command line
Amos	Genome assembler	Command line
Artemis	Genome Viewer	Graphical user interface
Augustus	Gene prediction	Command line
Amphora	Phylogenomic Inference Pipeline for Bacterial and Archaeal Sequences	Command line
Annovar	Variation prediction	Command line
Blat	Alignment tool, faster than BLAST	Command line
Blast	Alignment tool	Command line
Brig	Genome Viewer	Graphical user interface
Celera	Genome assembler	Command line
Chimerascan	Chimeric transcripts detector	Command line
Cufflinks	Transcript assembly, differential expression, and differential regulation for RNA-Seq	Command line
Edena	Genome assembler	Command line
EVM	Gene prediction	Command line
FastQC	Filter NGS data i.e. Short reads	Graphical user interface
FastXQC	Filter NGS data i.e. Short reads	Command line
Genemark	Gene prediction	Command line
Genosets	Comparative Genomics visualization	Graphical user interface
Glimmer	Gene prediction	Command line
IGV	Genome Viewer	Graphical user interface
JSpecies	Genome comparison	Graphical user interface
ALLPATHS-LG	Genome assembler	Command line
Maker	Gneome annotation pipeline, Eukaryotes	Command line
Maq	Short reads aligner	Command line
Mauve	Genome Viewer	Command line
Mummer	Genome comparison	Command line
Mira	Genome assembler	Command line
NGS-QC toolkit	Filter NGS data i.e. Short reads	Command line
OFS	An Operen prediction program	Command line
Pasha	Parallelized Short Read Assembly	Command line
Ray	Genome assembler	Command line
RNAmmer	RNA prediction	Command line
SOAPdenovo	Genome assembler	Command line
SOAP-aligner	Short reads aligner	Command line
Spades	Genome assembler	Command line
Tablet	Genome alignment viewer	Graphical user interface
Tophat	A spliced read mapper for RNA-Seq	Command line
Vaast	Variation prediction	Command line
VCFtools	Variation prediction	Command line
Newbler	Genome assembler	Command line

Installation instructions

Installation of whole genome assembly and annotation pipeline This pipeline has been developed for whole genome assembly and annotation of microbes (Bacteria and Fungal genomes). It uses a wide variety of software for the purpose and runs in mainly three steps.
(1) Filter the raw sequencing data
First step is to filter the raw sequencing reads for high quality bases from vector and adaptor contaminated reads. For this purpose, NGS-QC toolkit is integrated in the pipeline. Bioperl is required for this software to work.
(2) Genome assembly of filtered data
Filtered reads are further used to assemble the genome with user defined parameters (i.e. Hash lengths, K). Genome assembly results are then provided to the user for selecting the best result. Velvet and SOAPdenovo software are used at this step, for genome assembly.
(3) Whole genome annotation
The best genome assembly set is used further for genome annotation. Prokka and MAKER softwares have been integrated for the annotation of bacterial and fungal genomes respectively. Genome assembly set and annotated genome files are produced as output of this pipeline.
Dependencies :- Several libraries of bioperl need to be installed for full functioning of Prokka and Maker softwares. The user should be aware of the dependencies of the integrated softwares.
Benchmarking of genome assemblers (GenomeABC) Standalone version of GenomeABC server has been developed for the analysis of assembled genome and benchmark the assemblers. This is a set of simple perl scripts and user can easily use this software. BLAT and Bioperl are the necessary software required to run the GenomeABC software.
Variation detection in normal-cancer paired data This pipeline uses several softwares.
(1) First step is to filter the raw sequencing reads for high quality bases from the vector and adaptor contaminated reads. For this purpose, NGS-QC toolkit has been integrated in the pipeline.
(2) BWA software has been integrated for the alignment of filtered reads to the human reference genome.
(3) In the step further, SAMtool software processes the alignment files.
(4) Finally, VarScan.v2.3.5 software detects the somatic variations and SNPs in the given sequencing data.
User should have all these software installed to run this pipeline.
Debian packages of all these softwares can be downloaded at the OSDDlinux website (http://osddlinux.osdd.net/ngs.php).
Dependencies:- User should have all the mentioned softwares in the default path i.e. /gpsr/local/bin to run this pipeline.

Installation of debian (.deb) packages User can download .deb packages from the OSDDlinux page (http://osddlinux.osdd.net/ngs.php). For the installation of these packages, user should have OSDDlinux operating system with /gpsr/software directory. To install, user simply needs to execute the command:-
sudo dpkg -i package.deb
Software would automatically get installed in the /gpsr/software/ directory and executable files can be called from /gpsr/local/bin directory.
Example:- sudo dpkg -i maq.deb
Installation location : /gpsr/software/
Exiculable present: /gpsr/local/bin

Wang Zheng Yuan

Wednesday, October 23, 2013

NGS packages from OSDDLinux.net

Tuesday, October 22, 2013

Dan's courses series on CCNA training