Example 1:

How to Generate an RMSD Plot Between a Reference and Multiple Samples

In this example, you will take 3 samples and compare it against a reference. You can find some sample input files in ANDES/example_data. Remember that each fasta file contains all the reads for a given sample, thus each profile represents the reads that are in a sample.

Convert a Multi-FASTA file into a profile:

In this example, we will convert a multi-FASTA into a profile.

1. Go (cd) into the ANDES/example_data directory.

2. Run Clustalw2 on sequences to generate .aln file

clustalw2 -infile=20081201.fasta -quicktree

This will generate the 20081201.dnd and the 20081201.aln file.

3. Convert .aln file to profile.

../ClustalALN_to_PositionProfile.pl -a 20081201.aln

This will generate the 20081201.prof. Repeat steps 2 and 3, for 20081116.fasta, and 20090501.fasta, as well.

This will leave you with 4 .prof files.

Convert a FASTA file with ambiguity codes into a profile:

In this example, we will convert a single record FASTA file into a profile.

1. In the ANDES/example_data directory, execute the following script:

../FASTA_to_PositionProfile.pl -f reference.fasta -o reference.prof

This will generate the FJ966952.prof. The name of the profile is based on the id in the defline. If you want a different name, use the -o option, for this example, we will go with the default.

Compute the RMSD distance between the reference profile and each of the sample profiles

1. In the ANDES/example_data, execute the following script:

../Compute_DiffFromRef_PositionProfile_per_position.pl -o distFromRef -r reference.prof 20081201.prof 20090501.prof 20090416.prof

This will generate the distFromRef file.

2. To plot the distFromRef file, run the script:

../Plot_PositionProfile_From_Reference.r distFromRef

This will generate the distFromRef.pdf file.

3. You can view this with the Linux acroread command, or whatever Acrobat reader you have installed.

You should see something like the following:

From this plot you can see that samples 20090416 and 20090501 are similar to the reference, but 20081201 has many more positions that are different. There is also a region between 500 and 1000 that seems to have more variation.