RNA STRAND v2.0 - The RNA secondary STRucture and statistical ANalysis Database

    [ Home | Search | Analyse | Submit structures | News | Help ]



This page describes the meaning of keywords or phrases used on the RNA STRAND web pages. The purpose is to clarify what information should be entered in the submissions, search, and analysis pages, and what information is presented in the search and analysis results pages. The individual entries are ordered consistently with the way they appear in the various input and output pages.


Submitter name:

The name of the person who submitted a secondary structure. The submitter is credited with the structure entry, using the name as entered upon submission.

Submitter email:

The email address of the person who submitted a molecule. This e-mail address is also used to contact the submitter when confirming the molecule and structure. Note: these are stored internally, and are not made public at any time or used for any other purposes.

Molecule ID or RNA STRAND ID:

A unique identifier associated with each entry in RNA STRAND

Molecule name:

The molecule name may be derived from its function or the organism from which it is found, or from a file name or nomenclature of the database from which it was obtained. When submitting a molecule, the submitter may choose any suitable name.

Type:

The type of the molecule typically refers to the family of functionally related molecules to which it belongs, e.g. ribosomal RNA (rRNA) or transfer RNA (tRNA). Non-biological molecules are of "Synthetic" type.

If "Other" specify:

If the type is not among those currently supported by the database, it is listed as "Other". To choose a type, when submitting or searching the database, use a drop-down menu. If the "Other" type is chosen when submitting a molecule, please provide details about the type of the molecule.

Organism:

This refers to the scientific name of the organism in which the molecule is found. When submitting a molecule, providing the scientific name is optional but strongly encouraged for biological molecules. When searching the database, part of a scientific name may be entered, and database entries which contain the entered text as a substring are matches.

Source:

If the secondary structure is already contained in another (external) database, the source refers to that database. In other cases, such as when the structure is manually entered from a publication taken from a web page but not a database, the source is listed as "Other Sources". When submitting a structure, if the molecule is in more than one source, at most one source may be chosen from the drop-down menu. When searching the database, the search may be restricted to molecules from a particular source, by choosing that source from a drop-down menu.

Source ID:

This refers to the ID of the molecule in the source database, if any (see previous entry), and is otherwise empty. The source ID may be used to conduct a search in our database, which can be convenient when working with a molecule whose ID in another database is on hand. For example, molecule 1626 in the RNA STRAND was taken from the RCSB Protein Data Bank (PDB), where it has ID 1nbs; searching on this ID in the RNA STRAND database enables the user to analyze the secondary structure of the molecule. When submitting a structure for a molecule that has a source database, please enter the ID of the molecule in that source.

Reference:

This refers to a publication that contains more detailed information about the molecule or its structure. When submitting a structure, enter the most pertinent reference here.

Validated by NMR or X-Ray:

If the secondary or 3D structure was determined using experimental methods, then this field should be "Yes". Otherwise this field should be "No", for example, if the structure was determined computationally.

Method for secondary structure determination:

This field gives information about how the secondary structure was determined. In cases where the secondary structure was determined from a 3D structure, details of the experimental method are listed (e.g. NMR, X-ray crystallography, chemical modification data) plus additional details on the relevant conditions or the degree of resolution of the method. In addition, if a secondary structure is obtained computationally (such as via the Simple MC-Annotate tool) [1], the algorithm or program used should be listed here. If the sequence is determined using comparative sequence analysis, this should be specified here, and if possible, a reference to details of the algorithm or alignment should be given.

Number of molecules in complex:

This is the number of RNA molecules that interact to form the secondary structure. In almost all instances it is 1. When searching the database, the user can do a selection based on the number of molecules in the complex. If "Any number" is chosen from the pop-down menu, then complexes with any number of molecules are selected. If "Greater than or equal to" is chosen, then a number should be entered in the following field, and all complexes with a number of molecules greater than or equal to that number are selected. If "Less than or equal to" is chosen, then a number should be entered in the following field, and all complexes with a number of molecules less than or equal to that number are selected. If "Between" is selected, then numbers should be entered in both of the following fields, and the complexes with a number of molecules between the two numbers is selected.

Fragments used?

This is "Yes" if the entered sequence forms just a part of a larger molecule, and "No" if all the nucleotides of the molecule are listed.

Duplicates:

If two entries have the exact same sequence, this field is "No" for one of them, and "Yes" for the other one. This way, a user can easily select unique sequences on the Search page, or sequences which have duplicates. The entry having "Yes" in this field contains the RNA STRAND ID of the other identical sequence, in the field "Comments".

Sequence:

In the "Search" page, if a sequence is specified, then any entry is reported which contains the specified sequence as part of the entire sequence for that entry. Any of the IUPAC RNA codes (see below) are allowed (case does not matter). * can be specified for any base any number of times. For example you can search for entries with the sequence AMGHGNUC*ACGNA.
Symbol Meaning Explanation
AAadenine
CCcytosine
GGguanine
UUuracil
MA or Camino
RA or Gpurine
WA or U 
SC or G 
YC or Upyrimidine
KG or Uketo
VA or C or Gnot U
HA or C or Unot G
DA or G or Unot C
BC or G or Unot A
NA or C or G or Uany


Abstract shape:

In the "Search" page, an abstract shape can be specified, following the RNAshapes representation (see http://bibiserv.techfak.uni-bielefeld.de/rnashapes/, and P. Steffen, B. Voß, M. Rehmsmeier, J. Reeder, and R. Giegerich. RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics, 22(4):500-503, 2006). We use three levels of abstract shape: level 5 (most abstract shape), level 3 (intermediate abstraction level) and level 1 (least abstract shape). The only characters allowed are brackets [], underscore _ and star *. Matching brackets denote an interrrupted or uninterrupted stem (depending on the level of abstraction). Underscore _ denotes a loop, and star * means any characters any number of times. For example, to search for a tRNA shape, query for the abstract shape level 5 [[][][]]. To get a specified exact shape, no stars should be added. To get shapes that contain the specified shape, stars should be added.

File upload:

When submitting a secondary structure, the structure needs to be uploaded as a file in one of three formats. In the "File upload" input field, the file name can be specified by browsing your file system.

Formats:

We refer to several file formats, described below:
  • RNAML format: This is an XML format that has been designed specifically to easily express data on RNA sequence and structure. Examples can be found here.

  • BPSEQ format:The file name should end with the suffix ".bpseq", as in "mystr.bpseq". The bpseq format is a simple text format in which there is one line per base in the molecule, listing the the position of the base (leftmost position is 1), the base name (A,C,G,U, or other alphabetical characters), and the position number of the base to which it is paired, with a 0 denoting that the base is unpaired. For more information, see the Comparative RNA Web Site. An example is as follows:

    1 G 8 
    2 G 7 
    3 C 0 
    4 A 0 
    5 U 0  
    6 U 0 
    7 C 2 
    8 C 1 
    

    For complexes with more than one molecule, the molecules are listed in sequence, with the base pairs numbers of each successive molecule following in order from the previous molecule.

  • CT format: The first line contains the sequence length L. There are L subsequent lines, one per nucleotide. The ith line starts with i, then the letter denoting the ith nucleotide, then the 5'-connecting base index (i-1), then the 3' connecting base index (i+1), then the paired base index (or 0 if unpaired), and finally base index in the original sequence. For example, the structure represented in bpseq format above would be represented in ct format as follows:

         8
         1  G  0  2  8  1
         2  G  1  3  7  2
         3  C  2  4  0  3
         4  A  3  5  0  4
         5  U  4  6  0  5
         6  U  5  7  0  6
         7  C  6  8  2  7
         8  C  7  0  1  8
    

    The CT format is better than BPSEQ for complexes of two or more RNA molecules because the boundaries can be expressed explicitly. If the i-th line corresponds to the first nucleotide in one of the molecules, the third column is 0. If the i-th line corresponds to the last nucleotide in one of the molecules, the fourth column is 0. This is an example of a short duplex stem, i.e. composed of two molecules:

         4
         1  G  0  2  4  1
         2  G  1  0  3  2
         3  C  0  4  2  3
         4  C  3  0  1  4
    

  • Dot-parentheses format: In this format, the sequence is given first, from 5' to 3' end, on lines 50-characters long. Then, the structure is given, with matching parentheses ( and ) denoting a base pair, and a dot denoting an unpaired base, on lines 50-characters long.
    For example, the structure represented in bpseq format above would be represented in dot-parentheses format as follows:

         GGCAUUCC
         ((....))
    

    To add pseudoknotted base pairs, matching letters can be used to indicate matching base pairs, with the upper case form of the letter used closest to the 5' end, and the lower case form used closest to the 3' end, as in the following example:

         GGGCACACUUCCCUGUG
         (((..AAA..))).aaa
    

  • PDB format: This is an standard representation provided by the Protein Data Bank for macromolecular structure data derived from X-ray diffraction and NMR studies. Details about the format can be found here.


Length:

This refers to the number of bases in a molecule. For complexes with more than one molecule, the sum of the lengths of the molecules is taken. When searching the database, the user can select for molecules in a certain length range. If "Any length" is chosen from the drop-down menu, then molecules of any length are selected. If "Greater than or equal to" is chosen, then a number should be entered in the following field, and all molecules of length greater than or equal to that number are selected. If "Less than or equal to" is chosen, then a number should be entered in the following field, and all molecules of length less than or equal to that number are selected. If "Between" is selected, then numbers should be entered in both of the following fields, and the molecules of length between the two numbers is selected.

Number of stems:

This is the number of stems in the molecule. A stem (or uninterrupted stem) is a sequence of base pairs, with no interspersed internal or bulge loops. We include not only canonical (Watson-Crick and wobble G-U) base pairs, but also non-canonical base pairs and base pairs involving modified bases, in our stems. (Base pairs involving at least one base which is not A,C,G, or U - i.e. modified bases or bases represented using general characters of the IUPAC code) are reported as "other base pair" in the analyzer's output. We also include stems of length one - sometimes called isolated base pairs - in our secondary structure.

Stem length:

The stem length is the number of base pairs in the stem. In the "Search" page, if a range is specified for "Stem length, per stem", then any molecule that contains at least one stem whose length is in the specified range is reported. For example, if "Stem length, per stem greater than 2" is specified, then any molecule that has at least one stem whose length is greater than 2 is reported.

Estimated free energy:

The estimated free energy of each stem was computed using the nearest-neighbour thermodynamic parameters, as defined in the following paper: Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1998 Oct 20;37(42):14719-35.

Number of hairpin loops:

This is the number of hairpin loops in the molecule. A hairpin loop has one closing base pair. For complexes of two or more interacting molecules, the analyzer may report loops at the "junctions" of molecules; thus when analyzing loops, it is advisable to exclude complexes with two or more molecules.

Hairpin loop sequence:

In the "Search" page, if a sequence is specified for "Hairpin loop sequence", then any molecule is reported which contains at least one hairpin whose closing base pair and unpaired bases, from the 5' end to the 3' end, exactly match that sequence. Any IUPAC code can be used (see below, case does not matter). For example, to search for GNRA hairpin loops closed by any base pair, the user will enter the sequence NGNRAN.
Symbol Meaning Explanation
AAadenine
CCcytosine
GGguanine
UUuracil
MA or Camino
RA or Gpurine
WA or U 
SC or G 
YC or Upyrimidine
KG or Uketo
VA or C or Gnot U
HA or C or Unot G
DA or G or Unot C
BC or G or Unot A
NA or C or G or Uany


Number of free bases in hairpin loops:

In the "Search" page, if a range is specified for "Number of free bases in hairpin loops, per hairpin loop", then any molecule that contains at least one hairpin loop with a number of unpaired bases in the specified range is reported.

Number of bulges:

This is the number of bulge loops in the molecule. A bulge loop has two closing base pairs i.j, i'.j', with either i and i' adjacent or j and j' adjacent (but not both). For complexes of two or more interacting molecules, the analyzer may report loops at the "junctions" of molecules; thus when analyzing loops, it is advisable to exclude complexes with two or more molecules.

Number of free bases in bulge loops, per bulge loop:

In the "Search" page, if a range is specified for "Number of free bases in bulge loops, per bulge loop", then any molecule that contains at least one bulge loop with a number of unpaired bases in the specified range is reported.

Number of internal loops:

This is the number of internal loops in the molecule. An internal loop has two closing base pairs i.j, i'.j', with neither i and i' adjacent or j and j' adjacent, i.e. there are unpaired bases on each side of the loop. For complexes of two or more interacting molecules, the analyzer may report loops at the "junctions" of molecules; thus when analyzing loops, it is advisable to exclude complexes with two or more molecules.

Number of free bases in internal loops:

In the "Search" page, if a range is specified for "Number of free bases in internal loops, per internal loop", then any molecule that contains at least one internal loop with a number of unpaired bases in the specified range is reported.

Internal loop absolute asymmetry:

The absolute asymmetry of an internal loop is the absolute value of the difference between the number of unpaired bases on each side of the loop. The relative asymmetry of an internal loop is the ratio of the number of unpaired bases on the larger side of the loop to the number of unpaired bases on the smaller side of the loop. In the "Search" page, if absolute or relative asymmetry is specified, then all molecules which have at least one internal loop whose absolute or relative asymmetry fits the search criteria are reported.

Number of multi-loops:

This is the number of multi-loops in the molecule. A multi-loop has at least three closing base pairs. For complexes of two or more interacting molecules, the analyzer may report loops at the "junctions" of molecules; thus when analyzing loops, it is advisable to exclude complexes with two or more molecules.

Number of free bases in multi-loops, per molecule:

In the "search" page, if a range is specified for "Number of free bases in multi-loops, per multi-loop", then any molecule that contains at least one multi-loop with a number of unpaired bases in the specified range is reported.

Multi-loop asymmetry:

In a multi-loop with k branches, there are k sequences of unpaired bases between branches (some sequences may be empty). Suppose that the lengths of these k sequences, in increasing order of length, are L1, L2, ..., Lk. Then, for the multi-loop,

  • the maximum absolute asymmetry is Lk - L1
  • the minimum absolute asymmetry is min [Li - L{i-1}], taken over i between 2 and k
  • the average absolute asymmetry is the average of the values Lj - Li, taken over all i,j between 2 and k, where j > i

  • the maximum relative asymmetry is Lk/Li, where Li is the smallest positive length
  • the minimum relative asymmetry is min [Li/L{i-1}], taken over i between 2 and k, or 0 if L1 = 0
  • the average relative asymmetry is the average of the values Lj/Li, taken over all i,j between 2 and k, where j > i and the ratio LJ/Li is defined to be 0 if Li = 0.

In the "Search" page, if a range is specified for the "Multi-loop ... asymmetry, per multi-loop" option, then all molecules are reported which have at least one multi-loop whose specified asymmetry lies in the specified range.

Multi-loop number of branches:

The multi-loop number of branches is the number of stems that get together in a multibranched loop. For example a typical transfer RNA multi-loop has four branches.

Non-canonical base pairs:

Number of non-canonical base pair per molecule. This does not include base pairs with modified nucleotides. We consider that all C-G, A-U and G-U pairs are canonical base pairs, and all other base pairs are non-canonical base pairs. However, note that from the point of view of the planar edge-to-edge hydrogen bonding interaction, there are C-G, A-U and G-U base pairs that do not interact via Watson-Crick edges, and vice-versa. This information cannot be obtained from comparative structure analysis data, and is beyond the purposes of this work.

Number of pseudoknots:

This is the number of pseudoknots in the molecule. A pseudoknot has base pairs i.j and i'.j' which overlap, i.e. i < i' < j < j'. Since we include stems of length 1 (i.e. isolated base pairs) in our secondary structures (see Number of Stems), note that a pseudoknot may contain a pseudoknotted stem of length 1, and a pseudoknot may be broken by removing just one base pair.

Precise definitions of pseudoknots, bands, and related pseudoknotted structures can be found in: Linear Time Algorithm for Parsing RNA Secondary Structure, (B. Rastegari and A. Condon), 5th Workshop on Algorithms in Bioinformatics (WABI), Oct 2005.

Number of bands/un-bands/in-bands/unpaired bases, per pseudoknot:

Roughly, a band is a stem, or stems interspersed with bulge or interior or multiloops, which crosses another stem in a pseudoknot (see paper by Rastegari and Condon for details; we additionally use "un-band" loops to refer to closed regions within a pseudoloop as defined in that paper, and "in-band" loops to refer to pseudo-interior and pseudo-multi loops as defined in that paper).

In the "Search" page, if a range is specified for "Number of bands/un-bands/in-bands/unpaired bases, per pseudoknot", then any molecule that contains at least one pseudoknot with a number of bands/un-bands/in-bands/unpaired bases in the specified range is reported.

Number of unpaired bases:

This is the number of unpaired bases in the molecule. The number of unpaired bases plus the number of paired bases is the length of the molecule.

Number of paired bases:

This is the number of paired bases in the molecule. The number of unpaired bases plus the number of paired bases is the length of the molecule.

Minimum number of bands/base pairs to remove in order to break the pseudoknot:

This is the minimum number of bands or base pairs whose removal leaves the secondary structure pseudoknot free.

In the "Search" page, if a range is specified for "Minimum number of bands/base pairs to remove in order to break the pseudoknot", then any molecule is reported that contains at least one pseudoknot for which the number of bands/base pairs to remove in order to break the pseudoknot is in the specified range.

Several ways to remove minimum number of base pairs:

In the "Search" page, the user may restrict the search to report only structures which have the property that, for at least one pseudoknot in the structure, there is more than one way to remove a minimum number of base pairs, in order to break the pseudoknot. To do this, "Yes" is chosen. Alternatively, by choosing "No", the search is restricted to report only structures which have the property that, for at least one pseudoknot in the structure, there is only one way to remove a minimum number of base pairs, in order to break the pseudoknot.

Length of pseudoknot:

The length of a pseudoknot is the number of nucleotides spanning the pseudoknot.

Number of domains:

This is the number of branches (or pseudoknotted regions) in the external loop.

Average stem length:

The length of a stem is the number of base pairs in the stem. The average is taken over all stems in the selected molecule or set of molecules.

Average hairpin loop length:

The length of a hairpin loop is the number of bases (all unpaired) strictly between the bases in the closing base pair of the hairpin. The average is taken over all hairpins in the selected molecule or set of molecules.

Average bulge length:

The length of a bulge loop is the number of (unpaired) bases in a bulge loop. The average is taken over all bulges in the selected molecule or set of molecules.

Average internal loop length:

The length of an internal loop is the number of (unpaired) bases in the internal loop. The average is taken over all internal loops in the selected molecule or set of molecules.

Average multi loop length:

The length of a multi loop is the number of (unpaired) bases in a multi loop. The average is taken over all multi loops in the selected molecule or set of molecules.

Average band length:

The length of a band is the number of base pairs in the band. The average is taken over all bands in the selected molecule or set of molecules.

Structure figures:

We use RNAplot from the Vienna RNA package of Hofacker et al. to generate our figures. We augment the figure produced by RNAplot in the following ways. For pseudoknotted structures, a minimum number of base pairs that cause the structure to be pseudoknotted are indicated by using blue lines to connect the bases of each pair. We colour in red the lines connecting non-canonical base pairs (i.e. AA, AC, AG, CC, CU, GG, or UU) and other non-standard base pairs (i.e. at least one of the bases is not A,C,G, or U). If a base pair is both pseudoknotted and non-canonical or non-standard, then the color of its line is both red and blue: a blue dashed line on top of a thicker red line. Finally, we position a yellow circle behind the nucleotide at the 5' end of the sequence, and a cyan circle behind the nucleotide at the 3' end of the sequence, to aid the viewer in finding the ends of the sequence.

Search options:

When the results of a search are presented, the Search Options box near the top of the page can be used to control how the output is presented. The number in the "start" controls the range of molecule ID's that are presented: only molecules in the output of the search whose Molecule ID is greater than or equal to the "start" value plus 1,000 are reported. The "limit" value controls the number of results listed per page of output. The default value is 10. Note: if the selected limit is more than 500 results per page, this might cause a delay in displaying the page.

Remove outliers:

If the "Remove outliers" checkbox is selected, then all the outliers are removed. A point is an outlier if it is less than q1-1.5*(q3-q1) or greater than q3+1.5*(q3-q1), where q1 and q3 are the first and third quartiles.

Normalize by RNA type:

If the "Normalize by RNA type" checkbox is selected, then instead of using one data point per molecule or per structural feature, we use one data point for each RNA type, where this point is determined by averaging all data points for that class. This way, the user can avoid biasing the analysis when some RNA types contain much more structures than other RNA types.



    [ Home | Search | Analyse | Submit structures | News | Help ]

For questions, comments, suggestions and bug reports, please contact:

Copyright © 2004-2008 BETA LAB - University of British Columbia