Example 2:

How to Generate an Consensus FASTA file from a Multi-FASTA of a Sample

In this example, you will take a sample and convert it into a consensus fasta file. You can find some sample input files in ANDES/example_data. Remember that the multi-FASTA are all the reads for a given sample. So you are generating a consensus sequence for a given sample. This example show you how to also remove noise from a sample, so you don’t have too many ambiguity codes in your consensus.

Convert a Multi-FASTA file into a profile:

In this example, we will convert a multi-FASTA into a profile.

1. Go (cd) into the ANDES/example_data directory.

2. Run Clustalw2 on sequences to generate .aln file

clustalw2 -infile=20081201.fasta -quicktree

This will generate the 20081201.dnd and the 20081201.aln file.

3. Convert .aln file to profile.

../ClustalALN_to_PositionProfile.pl -a 20081201.aln

This will generate the 20081201.prof.

4. Convert the profile to a consensus fasta file

../Profile_To_ConsensusFASTA.pl -c 20081201.cons.fasta -p 20081201.prof

This will generate the 20081201.cons.fasta file.

Let’s say that there are too many ambiguity codes in this fasta file that you can’t find it useful. You can filter out low frequencies alleles.

Apply a percentage filter to the profile to remove noise:

5. Run a percent threshold filter on the profile:

../Filter_Profile_By_Threshold.pl -i 20081201.prof -o 20081201.gt10p.prof -p 10

This will remove any alleles from the profile that do not exist at greater than 10%, and write a new profile named: 20081201.gt10p.prof

6. Regenerate a new consensus with the filtered profile:

../Profile_To_ConsensusFASTA.pl -c 20081201.cons.gt10p.fasta -p 20081201.gt10p.prof

This will generate a new consensus fasta file with the name 20081201.cons.gt10p.fasta

You can sanity check the results by doing a diff between the two files. You should see that the filtered fasta file has fewer ambiguity codes in it.