User manual for Constraint Generation (CG) version 1.0


Compilation and setup
Source files
Data files
Running CG
Output files
Running CG with other prediction programs



Compilation and setup
  • The package MultiRNAFold has to be downloaded and compiled. You may copy it under the same directory as CG, or you may modify the Makefile.
  • CG currently uses LAM-MPI for the prediction step at each iteration. You need to have a file called hostfile in the CG directory. This file lists the machines you want to use. You have to run lamboot -v hostfile to start the lamd deamons.
  • Type make to compile the C++ programs that are called by the main file.



Source files
  • CGlearn.pl is the main file. It calls various other files, outlined below.
  • options_*.txt (e.g. options_151Rfam.txt or options_best_ISMB.txt) are files that are given as input to the main file.
  • get_percentage_lb_ub.pl is a Perl script that creates files with the upper and lower bounds on the initial parameters.
  • Directory data contains training and test data sets, parameter files, and other files
    • data/all_thermodynamic_constraints_fm363_l2norm.lp contains the CPLEX constraints coming from the thermodynamic set T-Full.
  • pick_best_training.pl is a script which picks the best parameter set according to the f-measure on the training set, and then tests this set on the provided test set. It is called by CGlearn.pl, but it can also be called separately.



Data files
  • The training files are called TRA_*.txt, and the test files are called TES_*.txt.  The format is:
    1. documentation line;
    2. RNA sequence, all on one line;
    3. RNA secondary structure in dot-parentheses format, all on one line;
    4. optional, a ''restricted string", which was born from processing the original structures;
    5. the secondary structure predicted with the initial parameter set (usual Turner99).
    6. an empty line.



Running CG
  • You can type "perl CGlearn.pl" at the command prompt, and you get a usage message. The easiest way is to modify the options file options_151Rfam.txt provided and run "perl CGlearn.pl options_151Rfam.txt". Or you can give a bunch of options at the command prompt, see the usage message.
  • If, for some reason, CGlearn.pl stops during some iteration, you can run the same command, and it will continue from the last completed iteration. So you don't have to run it all over again.



Output files
  • CG creates a directory with a complicated name, depending on the input options provided. This directory contains several files of interest:
    • results_final.txt contains the accuracy of the prediction on the training set, at each CG iteration. After all iterations are done, CG finds the best parameter set according to the f-measure on the training set, and tests it on the testing set specified as an input option. This information is written in the results_final.txt file, at the end.
    • params_sub-*.txt are the parameters estimated at each iteration.
    • output_verbose.txt contains the input options and a log of the run.
    • input_options.txt is just a copy of the input options file.