Example 2:
How to Generate an
Consensus FASTA file from a Multi-FASTA of a Sample
In
this example, you will take a sample and convert it into a consensus fasta file. You can find some sample input files in ANDES/example_data. Remember
that the multi-FASTA are all the reads for a given
sample. So you are generating a
consensus sequence for a given sample.
This example show you how to also remove noise from a sample, so you
don’t have too many ambiguity codes in your consensus.
Convert a Multi-FASTA file
into a profile:
In
this example, we will convert a multi-FASTA into a profile.
1. Go (cd) into the ANDES/example_data directory.
2. Run Clustalw2 on sequences to generate .aln file
clustalw2
-infile=20081201.fasta -quicktree
This will generate the 20081201.dnd and the 20081201.aln file.
3. Convert .aln file to profile.
../ClustalALN_to_PositionProfile.pl
-a 20081201.aln
This will generate the 20081201.prof.
4. Convert the profile to a consensus fasta file
../Profile_To_ConsensusFASTA.pl
-c 20081201.cons.fasta -p 20081201.prof
This will generate the 20081201.cons.fasta file.
Let’s say that there are too many ambiguity codes in this fasta file that you can’t find it useful. You can filter out low frequencies alleles.
Apply a percentage filter to the profile to remove
noise:
5. Run a percent threshold filter on the profile:
../Filter_Profile_By_Threshold.pl
-i 20081201.prof -o 20081201.gt10p.prof -p 10
This will remove any alleles from the profile that do not exist at greater than 10%, and write a new profile named: 20081201.gt10p.prof
6. Regenerate a new consensus with the filtered profile:
../Profile_To_ConsensusFASTA.pl
-c 20081201.cons.gt10p.fasta -p 20081201.gt10p.prof
This will generate a new consensus fasta file with the name 20081201.cons.gt10p.fasta
You can sanity check the results by doing a diff between the two files. You should see that the filtered fasta file has fewer ambiguity codes in it.