JelleZijlstra/JPP | GitXplorer

CS205 Final Project: Computational Phylogenetics Jelle Zijlstra, November--December 2012

Introduction This project implements phylogenetic inference using maximum parsimony (MP). Its input is a list of taxa (e.g., animal species) with their states for a variety of characters and its output is a tree that describes their relationships. This method is used to reconstruct the evolutionary relationships among different organisms.

1.1. Example of MP in action For example, suppose we have four species---A, B, C, and D. They vary in two characters: A has red eyes while B, C, and D have blue eyes; and A and B have a blue nose while C and D have a red one. We arbitrarily select A as the outgroup (the first-branching taxon). That means there are three possible trees:

(1) /-- A --| /-- B --| /-- C --| -- D

(2) /-- A --| /-- C --| /-- B --| -- D

(3) /-- A --| /-- D --| /-- B --| -- C

The character matrix (the input to our program) would look like this:

4 2 A 00 B 10 C 11 D 11

When using maximum parsimony, we want to select the tree that requires the fewest changes in character states. For all three trees above, there is one change in character 1: from 0 in A to 1 in the branch (clade) formed by B, C, and D. However, for character 2, trees (2) and (3) require two changes in state: from state 0 in A, it changed to 1 independently in C and D, or it changed to 1 in the B-C-D clade, and then back to 0 in B only. In either case, there are two state changes. In tree (1), on the other hand, the character changes only once in the branch leading to C and D.

Summing over the two characters, tree (1) has a length of 2, while trees (2) and (3) have a length of 3. Therefore, (1) is the most parsimonious tree. We conclude that C and D shared a more recent common ancestor with each other than either did with B.

1.2. MP on larger datasets Real-world datasets tend to have more than four taxa and two characters. This poses a problem, because the number of possible trees increases very fast with the number of taxa--at O(n!). Therefore, it is impossible in practice to enumerate all possible trees when there are more than ~10 taxa.

However, there are alternatives. These algorithms rely on the observation that optimal and near-optimal trees tend to be similar to each other. This is often visualized as a tree landscape'', where similar good trees are clustered together in mountains''. We then need a ``hill-climbing'' algorithm, which can move from a tree to other, similar trees in order to climb up the tree.

This is accomplished using heuristic algorithms that perform slight rearrangements on a tree in order to improve it---for example, exchange two adjacent nodes or branches of the tree. Common heuristic algorithms include:

Nearest Neighbor Interchange (NNI) -- perform rearrangements of the type (A,(B,C)) -> (B,(A,C))
Subtree Pruning Rebranching (SPR) -- prune a random subtree from the tree, then replant that tree at a random different location in the tree.
Tree Bissection Reconnection (TBR) -- similar to SPR, but more powerful. The tree is bissected at a random node, and the two partial trees that result are both treated as unrooted and reconnected at random spots.

Since these algorithms all rely on random rearrangements of a previous tree, they do not provide an exact solution. However, in practice they usually come at least close to finding the best tree.

Implementation and usage In this project, I implemented exhaustive search, branch-and-bound search (a slightly faster-running exact algorithm), and the heuristic algorithms NNI, SPR, and TBR. The heuristic algorithms can be run in parallel by multiple processes using MPI.

The code is written in Python and requires the following external modules:

mpi4py
numpy
argparse
CProfile (if profiling is enabled)

Type ``python main.py'' for usage notes. Here are some example invocations:

openmpirun -n 4 python main.py -a tbr -t 60 -i 30 -d data/oryzos.txt Runs the program on the datafile ``data/oryzos.txt'' using TBR for 60 minutes with 30 s between communications between the 4 MPI processes.
python main.py -a exhaustive -d data/shortoryzos.txt Performs an exhaustive search on the datafile ``data/shortoryzos.txt''.

2.1. Organization. The implementation resides in the src/ directory. The program comprises the following code files:

branchnbound.py -- Implementation of branch-and-bound search.
charmatrix.py -- Class representing a taxon-character matrix.
exhaustive.py -- Implementation of exhaustive search.
fuser.py -- Implementation of tree fusion (an operation on two trees to detect common subtrees and unite the optimal components of both)
list_tree.py -- Class representing a tree as a list of pointers to other nodes in the tree; useful for heuristic algorithms like SPR and TBR.
list_tree_searcher.py -- Abstract class used for search in multiple processes using a heuristic algorithm.
main.py -- Entry point of the program.
mixed.py -- Searcher implementation that lets different processes use different algorithms.
nni.py -- Implementation of NNI.
parallel_search.py -- Class used for communication between different MPI processes to exchange trees.
run_tests.py -- Runs unit tests on modules that have them.
serial_util.py -- Provides a utility function for timing the execution of pieces of code.
spr.py -- Implementation of SPR.
tbr.py -- Implementation of TBR.
tree.py -- Implementation of trees as immutable, recursive data structures.
tree_parse.py -- Parser that parses a string of the form "(1,(2,3))", used for MPI communication, into a tree object. The data/ directory contains example datasets, and the dev/ directory contains some code used in development.

Performance. Since this program uses heuristic search and does not calculate an exact result, it is not meaningful to compute measures like speedup. Instead, we can look at how short the trees produced by the program are as a function of the number of processes and the runtime.

I ran the program on the dataset in data/oryzos.txt, with a communication interval of 15 s, using TBR, under the following three configurations:

Single process, run for 2 min -- produces a single tree of length 973
Single process, run for 8 min -- produces a single tree of length 918
Four processes, run for 2 min -- produce a single tree of length 961 I also ran the program under the "mixed" configuration with four processes for two minutes with an interval of 15 s. This yielded a tree of length 934.

Thus, the multiple processes together perform better than a single process, but not nearly as well as a single process run longer. This suggests that the strategy used for communicating between the processes is not yet optimal.

Possible future optimizations. Profiling (using CProfile) reveals that the TBR algorithm spends about half of its time in the list_tree.char_length method. This method could probably be optimized by caching partial results in the tree and invalidating portions of the cache in the event of rearrangements.

The program also spends a significant amount of time in the MPI recv method. It may be possible to increase productivity by performing further searches during recv wait time.

Links.

http://www.github.com/JelleZijlstra/JPP - source code
http://cloud.cs50.net/~zijlstra/JPP/html/ - project website
http://youtu.be/0fS0zfdye_k - project video on YouTube

JPP

Commits

Why TypeError?

Use builtin methods instead of flatten

Small fixes

Add more pages

Add datafile

Add labels to leaf and node functions

README