How to Generate an RMSD Plot Between a Reference and Multiple Samples
In this example, you will take 3 samples and compare it against a reference. You can find some sample input files in ANDES/example_data. Remember that each fasta file contains all the reads for a given sample, thus each profile represents the reads that are in a sample.
Convert a Multi-FASTA file into a profile:
In this example, we will convert a multi-FASTA into a profile.
1. Go (cd) into the ANDES/example_data directory.
2. Run Clustalw2 on sequences to generate .aln file
clustalw2 -infile=20081201.fasta -quicktree
This will generate the 20081201.dnd and the 20081201.aln file.
3. Convert .aln file to profile.
../ClustalALN_to_PositionProfile.pl -a 20081201.aln
This will generate the 20081201.prof. Repeat steps 2 and 3, for 20081116.fasta, and 20090501.fasta, as well.
This will leave you with 4 .prof files.
Convert a FASTA file with ambiguity codes into a profile:
In this example, we will convert a single record FASTA file into a profile.
1. In the ANDES/example_data directory, execute the following script:
../FASTA_to_PositionProfile.pl -f reference.fasta -o reference.prof
This will generate the FJ966952.prof. The name of the profile is based on the id in the defline. If you want a different name, use the -o option, for this example, we will go with the default.
Compute the RMSD distance between the reference profile and each of the sample profiles
1. In the ANDES/example_data, execute the following script:
../Compute_DiffFromRef_PositionProfile_per_position.pl -o distFromRef -r reference.prof 20081201.prof 20090501.prof 20090416.prof
This will generate the distFromRef file.
2. To plot the distFromRef file, run the script:
This will generate the distFromRef.pdf file.
3. You can view this with the Linux acroread command, or whatever Acrobat reader you have installed.
You should see something like the following:
From this plot you can see that samples 20090416 and 20090501 are similar to the reference, but 20081201 has many more positions that are different. There is also a region between 500 and 1000 that seems to have more variation.