|
What is CG
Copyright
Download CG
CG user manual
Running
CG with other
prediction
programs
Download data sets
Prediction with CG parameters
History
Contact
What is
CG
Constraint generation
(CG) is the
first computational approach to RNA free energy parameter estimation
that can be efficiently trained on large sets of structural as well as
thermodynamic data. Our constraint generation approach employs a novel
iterative scheme, whereby the energy values are first computed as the
solution to a constrained optimization problem. Then the newly-computed
energy parameters are used to update the constraints on the
optimization function, so as to better optimize the energy parameters
in the next iteration.
Using our method on biologically sound data, we obtain revised
parameters for the Turner99 energy model. We show that by using
our new parameters, we obtain significant improvements in prediction
accuracy over current state-of-the-art methods.
Copyright
The CG algorithm and code is copyrighted under GNU General Public Licence
by Mirela Andronescu, Anne Condon and Holger Hoos, Department of
Computer Science, University of British Columbia.
Disclaimer:
Although
the authors have made every effort to ensure that CG
correctly implements the underlying models and fullfills the
functions described in the documentation, neither the authors nor the
University of British Columbia guarantee its correctness, fitness for a
particular purpose, or future availability.
Reference:
If you use CG for your work or publications, we kindly ask you to let
us know, and to cite the following reference:
M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews and K.P. Murphy,
"Efficient parameter estimation for RNA secondary structure
prediction", Bioinformatics (in press), ISMB/ECCB 2007.
Download CG
This version is CG 1.0.
CG was implemented and tested on SUSE 10.1 and SUSE 9.1. It uses
several packages and compilers:
- ILOG CPLEX version
9.1. All the quadratic program optimizations are performed with
CPLEX.
An early version also worked with Matlab, however that version has not
been tested for a long time, and is probably incomplete, since many
changes to the package have happened in the meantime.
- the MultiRNAFold version
1.5+ has to be downloaded and
successfully compiled. This package is used for RNA secondary structure
prediction, necessary at each CG iteration. See MultiRNAFold download page.
- G++ version 4.1 is
used to compile MultiRNAFold and several C++ files in the CG package.
- LAM MPI version
7.1.2 is used by CG to do the prediction on several processors in a
load-balanced way.
- Perl version 5.8.0
is used as a main script to run each component, and parse the results.
Once
you have all the above ready, you can download and run CG.
Also make sure you read the user manual
file.
CG has been tested with the versions mentioned above. Other versions
may work too. Please let me know if you encounter any problems.
Download data
The
following sets have been used to train and test CG.
Note: these files do not contain the Turner99 predictions. The CG
package contains these data-sets with Turner99 predictions.
The format of the following 4 files is:
- documentation line;
- RNA sequence, all on one line;
- RNA secondary structure in dot-parentheses format, all on
one line;
- optional, a ''restricted string", which was born from
processing the original structures;
- an empty line.
- S-Full -
a
''structural data set" containing 1660 currently available RNA
sequences and known secondary structures.
- The entire S-Processed set (including training, test and validation) - a
''structural data set" containing
5273 structures (most of them are restricted), processed from known RNA
secondary structures. There is overlap between S-Full and S-Processed,
although the structures have been processed in a different way, and
many of the S-Processed structures are restricted and usually shorter. We split this set in three:
- S-151Rfam
- a
''structural data set" containing 151 structures from Rfam, as
collected by Do et al. for CONTRAfold training.
- T-Processed
- a
"thermodynamic data set" containing the 207 triples, a subset of T-Full
which has only single sequences (as opposed to duplexes).
An XML file with the T-Full data, a
"thermodynamic data set" containing 946 RNA sequences, secondary
structures and free energies, obtained from optical melting
experiments. This file contains full
information about the papers where the data was collected from.
Prediction with CG parameters
- To do RNA secondary structure prediction with the new
ISMB/ECCB 2007 energy parameters obtained with CG, download MultiRNAFold-1.6, and
run simfold with the -p option as in the example.
- The best parameters we
obtained, and which give
7% better F-measure than
the Turner99 parameters, when tested on S-Full. Note these parameters
are currently not in the right format to be used with other software
packages (such as Mfold, RNA Vienna Package or RNAstructure). For now,
you can use them with SimFold (which is part of MultiRNAFold), or write
your own conversion script.
- The features corresponding to
each parameter can
be found in this file.
History
June 20,
2007: Uploaded the first version 1.0, used in
the ISMB/ECCB 2007 paper.
Contact
We would like to know who is using our
package, for what, if you
think it's useful, and any other feedback you may have. We will
appreciate you sending this information to us.
If you have any questions/suggestions/comments/concerns or you find
bugs, please contact Mirela Andronescu: andrones at cs dot ubc dot ca.
Thank you for your interest in Constraint Generation!
|