Preparing sequences for vPhyloMM

7/19/2009

This protocol is NOT for installing or running vPhyloMM. It is a description of the steps necessary to prepare Fasta files and the necessary auxillary files that vPhyloMM requires.

The first step is to gather a collection of related sequences that will wish to run in vPhyloMM. There are several ways of accomplishing this, from generating these sequences in the lab to downloading them from previous studies. In an effort to be comprehensive, this protocol starts with a list of GenBank accession numbers.

The sequences need to be downloaded from GenBank and organized into single Fasta files by patient. In addition, each sequence within a Fasta file must have a FASTA ID of the format >XXX.UNIQUE_ID where XXX is a three-digit time-code padded with 0's. UNIQUE_ID is a unique string identifying the sequence. The unit for the time code is not important as long as ALL sequences are using the same time scale. For example 031.fasta might represent 31 days, or 31 months. FASTA IDs of sequences not using the same timescale must be converted.

After obtaining the sequences, giving them appropriate FASTA_IDs, and organizing them into files by patient, it is necessary to globally align them, usually against a reference sequence, such that each sequence is exactly the same length. The GeneCutter tool at HIV Los Alamos http://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html can do this via the web interface for HIV sequence data. Note that GeneCutter can only accept a single input file. For other data, any tool that can align a set of sequences against a reference is fine.

After aligning sequences with GeneCutter, the sequences should be trimmed to the relevant region. Determine the appropriate start and end position for the sequences using an alignment viewer or reference sequence.

Once the sequences have been trimmed to the appropriate length, they must be renamed and placed in a single directory. The naming scheme for patient files follows the same format as that for sampling time within the files. Each file must be named XXX.fasta where XXX represents a unique 3 digit patient-code. The following bash commands automate the renaming of fasta files by simply naming them serially starting from "000.fasta":

mkdir ALLFILES j=0 for i in {<DIR_1>,<DIR_2>,<DIR_3>,...}/*.fas* do cp "$i" ALLFILES/$(printf "%03i" $j).fas let j=j+1 done

After renaming sequence and placing them all in a single directory, the markers file which identifies phenotypically significant amino acid mutations can be created. This file should have the positional offset as the first value, followed by all of the markers on the same line separated by commas. Each marker should optionally begin with the wild-type amino acid, followed by an amino acid sequence position number, followed by a backslash-delimited list of amino acids (single letter abbreviations). An example markers file might look like the following (note that only the uncommented line will be used):

# Efavirenz mutations of interest: 0,L100I,K101E,101Q,K103N,108I\A,G190S,P225H\R # Indinovir mutation in the pol gene: # 0,V82A\T\F\S\M,M46I\L,I54V\T\A

The sequences are now ready for vPhyloMM. First create a variables-file with the command:

perl Run_vPhyloMM.pl --new myVariables.txt

If you are running vPhyloMM from the gui, the rest of the changes can be done from there. If you are running vPhyloMM from the command line, you must now edit the variables file as necessary to run the reports that you want. Note that graphing and concurrent threading are disabled by default.

Last updated: 7/31/2009