Running CG with other prediction
programs
- by Mirela Andronescu, last modified Mar 6, 2008.
CG (Constraint Generation) is a computational
approach that estimates free energy parameters used by RNA secondary
structure prediction software. CG was described in detail in:
M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews, and
K.P.
Murphy, ``Efficient parameter estimation for RNA secondary structure
prediction'', Bioinformatics 2007 23(13): i19-i28,
CG can essentially work with
any RNA secondary structure prediction software, as long as the energy
function is linear or quadratic in the parameter vector.
You just need a prediction function and a few other functions for your
model (see details below). One option is to send
me the necessary files, and I'll run CG with your files for you. Here's
what you need:
- Configuration file
- Initial
parameter file
- Training data set
- Testing data set
- Code to create the structural constraints
- Code to predict and analyse results of new
parameters
- [optional] Thermodynamic
file
- [optional] File that specifies which
parameters are fixed and
which are variable
- [optional] File with
additional constraints
When you have all these, please send me a directory with all these
files.
Here's a sample directory, where I used Simfold as the
prediction program: Simfold.tar.gz
1. The
configuration file is a file where you specify the names (and path) of
all the other
files on this web page. Read the rest of the document first.
This file also contains some input options for CG. I'll make sure I'll
test several such options, for best performance. If you have strong
opinion about what these options should be, please make sure you make
it clear in the configuration file.
Here's a configuration file example: config_sample.txt
2. Initial parameter file. This is a text
file, with the values of the initial parameters, one per line. Here's
an example: turner_parameters_fm363_constrdangles.txt
3. Training data set. This is one text file,
to be used as "structural training set" (see paper).
There are two options:
- Option 1. I have a large
data set with RNA sequences and known secondary structures from the RNA
SSTRAND database,
that might be good enough for most purposes. If you think it's good
enough for you, you need to create a C++ program, using the C++ example
below. I need this program from you because I have to obtain predicted
structures using your prediction program and your initial parameter
set.
- Here's the functions you will need:
- A function that takes as input a file with a set of
parameters
(like the initial
parameters file
above),
and fills your internal data structures
- A function that predicts the minimum free energy (or low free
energy) secondary structure for a given sequence
- If your data has secondary structure constraints, you need a
variant of the prediction function to accomodate constraints
- Here's what you have to do:
- If your prediction program is written in C or C++, I
recommend you use the following C++ program as a model to get the
training data set (see comments in the file): add_initial_predictions_simfold.cpp.
The name of your file can be anything, just specify this name in the
configuration file. (e.g. if your prediction program is called foo, you
might want to call this program add_initial_predictions_foo.cpp.)
- I will call your program as follows, so make sure it works. I
will run it on Linux OpenSUSE 10.1.
- add_initial_predictions_simfold.cpp
input_initial_parameter_file.txt input_training_set_without_predictions.txt
output_training_set.txt
- where input_initial_parameter_file.txt is the initial
file you provided at point 2
- input_training_set_without_predictions.txt has the format
below, see example1_nopred.txt, example2_nopred.txt:
> some name of the molecule
sequence: should have only A, C, G and U. Otherwise
please make sure your program can deal with all the base types.
known structure, in dot-parentheses format
optional, constrained structure
empty line
- The directory you send me
should contain the code and make file,
or just the executable. It should work on Linux OpenSUSE 10.1.
- Option 2. Create a
training set. You might want
to use the C++ model above and create the training set that way.
- The format of the training set should be as below:
> some name of the molecule
sequence: should have only A, C, G and U. Otherwise
please make sure your program can deal with all the base types.
known structure, in dot-parentheses format
optional, constrained structure
secondary structure predicted with the initial set
of parameters
empty line
- Example of training data set. Some of the molecules have
constrained
structures, some don't: example1_pred.txt,
example2_pred.txt.
- In this case, provide me with
a training set.
The training set should be comprehensive enough for good training. The
better it is, the better the quality of the estimated parameters.
4. Testing data set. Exactly the same format as
the training data set, you can use one of two options above. The
molecules in this set should be different
from the ones in the training data set.
5. Code to create the structural constraints.
You need to create an
executable that takes as input a data set, and
writes two
output files (see details below). The minimum you need for this is:
- A function that returns the number of parameters used in the
model
- A function that takes as input a file with a set of parameters,
and fills your internal data structures
- A function that returns the number of features that occur in a
given structure. Here's some details:
- If your energy function is linear
in the energy parameters, then your energy function can be written like
this:
- deltaG = c' x + f
- where x is the vector of parameters
- c is the vector of how many times each parameter occurs
in the given structure
- c' means c transposed
- f is a constant
- As a simple example, if your model has 3 parameters x1, x2
and x3, and deltaG for some structure is x1+ x3 + x3 + 0.5, then
c'=(1,0,2) and f=0.5;
- The Turner model underlying mfold, simfold, RNAstructure,
RNA Vienna package etc is linear.
- If your energy function is quadratic
in the energy parameters, then your energy function can be written like
this:
- deltaG = x' P x + c' x + f
- where x is the vector of parameters
- P is a symmetric matrix of the coefficients for each
quadratic term
- c is a vector of counts for each linear term
- c' means c transposed
- f is a constant
- As a simple example, if your model has 3 parameters x1, x2
and x3, and deltaG for some structure is x1*x2 + 0.5*x2*x3 + 2*x2, then
P = (0,1,0; 1,0,0.5; 0,0.5,0), c' = (0,2,0) and f=0.
- The Dirks&Pierce and Rivas&Eddy models for
pseudoknotted structures are quadratic.
- Optionally, a function that returns the free energy (under your
model) of a sequence folded into a given structure.
- If you have
your functions already written in C/C++, I recommend using the C/C++
model
provided below. (Writing a Perl script with system calls will be much
slower). Just replace the necessary code, following the comments in the
file:
- The executable will be run like this, so make sure it works. I
will run it on Linux OpenSUSE 10.1.
- ./create_structural_constraints_simfold params.txt
input_set_no_pseudoknots.txt constraints_output.txt
num_constraints_per_molecule.txt
- Files to test your code works well.
- The directory you send me should
contain the code and make file,
or just the executable.
6. Code to predict and analyse results of new
parameters. You need to create an executable that takes as input a set
of parameters compatible to your model and a data set file. The program
predicts structures with the new parameters and computes the accuracy
obtained. The functions you need are:
- A function that takes as input a file with a set of parameters
(like the initial parameters file
above),
and fills your internal data structures
- A function that predicts the minimum free energy (or low free
energy) secondary structure for a given sequence
- If your data has secondary structure constraints, you need a
variant of the prediction function to accomodate constraints
- A function that computes the sensitivity of prediction
- A function that computes the positive predictive value of the
prediction
- If you have your functions written in C++, I recommend using the
model below. Just replace the necessary code, following the comments in
the file:
- The executable will be run like this, so make sure it works. I
will run it on Linux OpenSUSE 10.1.
- ./predict_and_analyse_results_simfold params.txt input_set.txt
output_predictions.txt output_accuracy.txt
- Files to test your code works well.
- Example parameter set, one parameter value per line: params.txt
- Input data set, to test if your program works: input_set.txt
- First output file, containing predictions. Make sure you get
something similar (don't worry about the details): output_predictions.txt
- Second output file, containing accuracy results. Make sure you
get something similar (don't worry about the details): output_accuracy.txt
- The directory you send me should
contain the code and make file,
or just the executable.
7. [optional] The thermodynamic file is one
text file, to be used as constraints corresponding to the thermodynamic
set (see paper). I have a file with all optical melting experiments
that I could find, in XML format. I need a piece of code that creates
the linear constraints corresponding to this file, for your model. The
code will be similar to the code to create the structural constraints.
TODO. I will provide a model soon,
talk to me if you need this.
8. [optional] File that specifies which
parameters are fixed and
which are variable.
Sometimes you might want to keep some parameters fixed to some values.
If so, start from a file like the initial parameters file, and replace
every value that you do NOT wish to keep fixed by the word "variable".
Here's an example in which parameters with the index 205 and 259 have
fixed values, and all the others are variable: params_fix_205_259.txt
9. [optional] File with additional
constraints. Sometimes you need to specify some constraints for some
variables. For example, in the following example we want all dangling
end parameters to be negative or zero, and we want the 3' dangling ends
to be less than or equal to the 5' dangling ends: constraints_dangling_ends_fm363.txt