Bandelt, H.-J., Quintana-Murci, L., Salas, A. and Macaulay, V. (2002). The fingerprint of phantom mutations in mtDNA data. American Journal of Human Genetics, 71, 1150-1160. [PDF file from journal site]
(June 2010): the programs SPECTRA and NETMAT have been recompiled and should now work in Windows XP, Vista and 7. Let me know if they don't!
Prepare your data file as follows. Each line of the file represents
one individual's DNA sequence, with a list of positions that differ
from some reference sequence. The distinct mutations at each site
(transitions, transversions, indels) are given distinct labels. The
convention we use is to label (i) a transition with respect to the
reference sequence by its position in the reference sequence and (ii)
a transversion or indel by its position, plus a suffix to indicate the
base change. For example, if the reference sequence was
AAGGCCTTA
we might code the sequence
AGGGCAT-A
as
2 6A 8del
Sequences that match the reference sequence should be signified with a
"0". Labels should be restricted to a maximum of 6 characters.
An example data file is available here of the 50 Adygei mtDNA HVSI sequences from Macaulay et al. (1999) Am. J. Hum. Genet. 64:232-249. Here the mutations are labelled by their position in the reference sequence, minus 16,000. Three character codes refer to transitions, and the (single) four character code to a transversion.
To generate an RDF file for NETWORK, use:
netmat -s -o output.rdf < input.txt
where "input.txt" is the name of the data file and "output.rdf" is the name of the output file (for an example, see "adygei.rdf").
To generate an RDF file for NETWORK, but filtering away some
mutations use:
netmat -s -o output.rdf -f filter.txt < input.txt
A file "filter.txt" should contain a list of the codes corresponding
to the speedy mutations, one per line. In the example above, if the
deletion at position 8 was speedy, you would prepare a file containing
the single line "8del", and use that as "filter.txt". Our suggested
filter files for HVS-I and HVS-II of human mtDNA can be found below.
To generate a matrix for the program SPECTRA, while filtering away some
mutations use:
netmat -t -s -o output.txt -f filter.txt < input.txt
A suggested weight filter file for HVS-I
A file containing
the list of speedy transitions in HVS-I (between 16051 and 16365, less
16000) that we used for the weighty filter examples
in the paper can be downloaded here. To use this file, with NETMAT, a possible
command line would be
netmat -s -o matrix.rdf -f speedy.hvsi < input.txt
which would prepare a file "matrix.rdf" for NETWORK.
For a given binary matrix, the program SPECTRA computes the cube and incompatibility spectra, and can also perform the permutation described in the paper.
In the context we describe, the matrix is the result of a mapping from aligned DNA sequence data. Each row represents the sequence of a (haploid) individual and each column the nucleotide at a particular position in the DNA. If a position is polymorphic in the sample, only two nucleotides are assumed to be segregrating, and these are coded as '0' and '1'. A non-segregating position can be represented as a column of '0's or of '1's.
On the first line of the data file, two positive integers should be
present which are i) the number of sequences (rows in the matrix) and
ii) the number of DNA positions (columns in the matrix). Then follows
the matrix, with one row of the file per row of the matrix. An
example would be:
5 4
0001
0011
1001
1111
0001
which represents 5 individuals and 4 DNA positions, the first and third
of which are incompatible, the last of which is fixed.
The program, which you can download here, runs
under the DOS prompt in the various versions of Windows. (If you
would like the C code in order to compile the program for your
favorite operating system, please contact me at the email address
below.) It reads the data from the standard input and sends the
results to the standard output, so you will probably want to use
redirection to make things easy. So, for example, you might type:
spectra < data.txt > results.txt
at the DOS prompt to run the program on the data file
"data.txt" and to put the output in "results.txt".
Applied to the above example, the program should put the following in
"results.txt":
The raw data:
No. of haplotypes: 5
No. of characters: 4
000 : 0001
001 : 0011
002 : 1001
003 : 1111
004 : 0001
------------
Non-pruned characters: 0 1 2
The cooked data:
No. of haplotypes: 5
No. of characters: 3
000 : 000
001 : 001
002 : 100
003 : 111
004 : 000
------------
The incompatibility matrix of the cooked data:
000 : 001
001 : 000
002 : 100
------------
The incompatibility and cube spectra:
s = ( 1 3 1)
f = ( 5 5 1)
------------
This contains the data that the program read from the input file (sequences numbered consecutively, starting from zero); the "cooked" data, which has
fixed positions and positions that split the sequences into the same
two sets merged (positions are numbered consecutively, starting from zero);
the incompatibility matrix of the cooked data; and
the incompatibility (s) and cube spectra (f).
You can specify the option "-n perms" on the command line to perform the permutation described in the paper. Here "perms" is the number of permutations to perform. Crudely speaking, the 0s and 1s in each column are jumbled up in each permutation. The output file then contains the same output as above, for the original (unpermuted) data, and then for each permuted data set; and finally the mean cube and incompatibility spectra across the permutations (with standard deviations).
If you download the software and want to be told of any bugs or new versions, please email me your email address (at the address below).
We would be very grateful to hear of any problems with the software, errors in the paper or of any problems you have with the web site. A list of any errata will appear here.
Vincent Macaulay
11th November 2002 (corrected 26th June 2010)
v.macaulay@stats.gla.ac.uk