Constraint Generation algorithm for RNA parameter estimation

Constraint Generation (CG) - an algorithm for efficient RNA parameter estimation

CG version 1

What is CG
Copyright
Download CG
CG user manual
Running CG with other prediction programs
Download data sets
Prediction with CG parameters
History
Contact

What is CG

Constraint generation (CG) is the first computational approach to RNA free energy parameter estimation that can be efficiently trained on large sets of structural as well as thermodynamic data. Our constraint generation approach employs a novel iterative scheme, whereby the energy values are first computed as the solution to a constrained optimization problem. Then the newly-computed energy parameters are used to update the constraints on the optimization function, so as to better optimize the energy parameters in the next iteration.
Using our method on biologically sound data, we obtain revised parameters for the Turner99 energy model. We show that by using our new parameters, we obtain significant improvements in prediction accuracy over current state-of-the-art methods.

Copyright

The CG algorithm and code is copyrighted under GNU General Public Licence by Mirela Andronescu, Anne Condon and Holger Hoos, Department of Computer Science, University of British Columbia.

Disclaimer:
Although the authors have made every effort to ensure that CG correctly implements the underlying models and fullfills the functions described in the documentation, neither the authors nor the University of British Columbia guarantee its correctness, fitness for a particular purpose, or future availability.

Reference:
If you use CG for your work or publications, we kindly ask you to let us know, and to cite the following reference:

M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews and K.P. Murphy, "Efficient parameter estimation for RNA secondary structure prediction", Bioinformatics (in press), ISMB/ECCB 2007.

Download CG

This version is CG 1.0.
CG was implemented and tested on SUSE 10.1 and SUSE 9.1. It uses several packages and compilers:

ILOG CPLEX version 9.1. All the quadratic program optimizations are performed with CPLEX. An early version also worked with Matlab, however that version has not been tested for a long time, and is probably incomplete, since many changes to the package have happened in the meantime.
the MultiRNAFold version 1.5+ has to be downloaded and successfully compiled. This package is used for RNA secondary structure prediction, necessary at each CG iteration. See MultiRNAFold download page.
G++ version 4.1 is used to compile MultiRNAFold and several C++ files in the CG package.
LAM MPI version 7.1.2 is used by CG to do the prediction on several processors in a load-balanced way.
Perl version 5.8.0 is used as a main script to run each component, and parse the results.

Once you have all the above ready, you can download and run CG.
Also make sure you read the user manual file.

CG has been tested with the versions mentioned above. Other versions may work too. Please let me know if you encounter any problems.

Download data

The following sets have been used to train and test CG.

Note: these files do not contain the Turner99 predictions. The CG package contains these data-sets with Turner99 predictions.

The format of the following 4 files is:

documentation line;
RNA sequence, all on one line;
RNA secondary structure in dot-parentheses format, all on one line;
optional, a ''restricted string", which was born from processing the original structures;
an empty line.

S-Full - a ''structural data set" containing 1660 currently available RNA sequences and known secondary structures.
The entire S-Processed set (including training, test and validation) - a ''structural data set" containing 5273 structures (most of them are restricted), processed from known RNA secondary structures. There is overlap between S-Full and S-Processed, although the structures have been processed in a different way, and many of the S-Processed structures are restricted and usually shorter. We split this set in three:
- S-Processed-TRA used for training (this is what we called in the paper S-Processed), with 3439 structures;
- S-Processed-VAL used for validation, with 858 structures;
- S-Processed-TES used for testing, with 974 structures.
S-151Rfam - a ''structural data set" containing 151 structures from Rfam, as collected by Do et al. for CONTRAfold training.
T-Processed - a "thermodynamic data set" containing the 207 triples, a subset of T-Full which has only single sequences (as opposed to duplexes).

An XML file with the T-Full data, a "thermodynamic data set" containing 946 RNA sequences, secondary structures and free energies, obtained from optical melting experiments. This file contains full information about the papers where the data was collected from.

Prediction with CG parameters

To do RNA secondary structure prediction with the new ISMB/ECCB 2007 energy parameters obtained with CG, download MultiRNAFold-1.6, and run simfold with the -p option as in the example.
The best parameters we obtained, and which give 7% better F-measure than the Turner99 parameters, when tested on S-Full. Note these parameters are currently not in the right format to be used with other software packages (such as Mfold, RNA Vienna Package or RNAstructure). For now, you can use them with SimFold (which is part of MultiRNAFold), or write your own conversion script.
The features corresponding to each parameter can be found in this file.

History

June 20, 2007: Uploaded the first version 1.0, used in the ISMB/ECCB 2007 paper.

Contact

We would like to know who is using our package, for what, if you think it's useful, and any other feedback you may have. We will appreciate you sending this information to us.

If you have any questions/suggestions/comments/concerns or you find bugs, please contact Mirela Andronescu: andrones at cs dot ubc dot ca.

Thank you for your interest in Constraint Generation!

CG version 2 (latest version)

CG version 1