This protocol is NOT for installing or running vPhyloMM. It is a description of the steps necessary to prepare Fasta files and the necessary auxillary files that vPhyloMM requires.
The first step is to gather a collection of related sequences that will wish to run in vPhyloMM. There are several ways of accomplishing this, from generating these sequences in the lab to downloading them from previous studies. In an effort to be comprehensive, this protocol starts with a list of GenBank accession numbers.
The sequences need to be downloaded from GenBank and organized into
single Fasta files by patient. In addition, each sequence within a Fasta
file must have a FASTA ID of the format >XXX.UNIQUE_ID
where XXX
is a three-digit time-code padded with 0's.
UNIQUE_ID
is a unique string identifying the sequence. The
unit for the time code is not important as long as ALL sequences are
using the same time scale. For example 031.fasta
might
represent 31 days, or 31 months. FASTA IDs of sequences not using the
same timescale must be converted.
After obtaining the sequences, giving them appropriate FASTA_IDs, and
organizing them into files by patient, it is necessary to globally align
them, usually against a reference sequence, such that each sequence is
exactly the same length. The GeneCutter tool at HIV Los Alamos
http://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html
can do this via the web interface for HIV sequence data. Note that
GeneCutter can only accept a single input file. For other data,
any tool that can align a set of sequences against a reference is fine.
After aligning sequences with GeneCutter, the sequences should be trimmed to the relevant region. Determine the appropriate start and end position for the sequences using an alignment viewer or reference sequence.
Once the sequences have been trimmed to the appropriate length, they
must be renamed and placed in a single directory. The naming scheme for
patient files follows the same format as that for sampling time within
the files. Each file must be named XXX.fasta where XXX represents a
unique 3 digit patient-code. The following bash commands automate the
renaming of fasta files by simply naming them serially starting from
"000.fasta
":
mkdir ALLFILES
j=0
for i in {<DIR_1>,<DIR_2>,<DIR_3>,...}/*.fas*
do cp "$i" ALLFILES/$(printf "%03i" $j).fas
let j=j+1
done
After renaming sequence and placing them all in a single directory, the markers file which identifies phenotypically significant amino acid mutations can be created. This file should have the positional offset as the first value, followed by all of the markers on the same line separated by commas. Each marker should optionally begin with the wild-type amino acid, followed by an amino acid sequence position number, followed by a backslash-delimited list of amino acids (single letter abbreviations). An example markers file might look like the following (note that only the uncommented line will be used):
# Efavirenz mutations of interest:
0,L100I,K101E,101Q,K103N,108I\A,G190S,P225H\R
# Indinovir mutation in the pol gene:
# 0,V82A\T\F\S\M,M46I\L,I54V\T\A
The sequences are now ready for vPhyloMM. First create a variables-file with the command:
perl Run_vPhyloMM.pl --new myVariables.txt
If you are running vPhyloMM from the gui, the rest of the changes can be done from there. If you are running vPhyloMM from the command line, you must now edit the variables file as necessary to run the reports that you want. Note that graphing and concurrent threading are disabled by default.