Login |

MoFa - The Molecular Fragment Miner


MoFa, the Molecular Fragment Miner, is a program that finds automatically molecular substructures and discriminative fragments in a set of molecule descriptions given some user defined parameters.

The algorithm was designed in cooperation with Tripos, Inc., Data Analysis Research Lab, South San Francisco, CA, USA and the Working Group Neural Networks and Fuzzy Systems of the University of Magdeburg.

Details about the application and the algorithm can be found in these papers:

  • Mining Molecular Fragments: Finding Relevant Substructures of Molecules
    Christian Borgelt and Michael R. Berthold
    IEEE International Conference on Data Mining (ICDM 2002, Maebashi, Japan), 51-58
    IEEE Press, Piscataway, NJ, USA 2002
    (8 pages) icdm_02.pdf (112 kb) icdm_02.ps.gz (69 kb)
    Data files and instructions to reproduce the results published in the paper can be found here or under the download section.
    (Note: You should have at least 500 MB of memory in order to run the experiments.)
  • Large Scale Mining of Molecular Fragments with Wildcards
    Heiko Hofer, Christian Borgelt, and Michael Berthold.
    Proc. 5th International Symposium on Intelligent Data Analysis (IDA 2003, Berlin, Germany), 380-389.
    Springer-Verlag, Heidelberg, Germany 2003
    (10 pages) ida_03.pdf (187 kb) ida_03.ps.gz (125 kb)
  • Finding Discriminative Molecular Fragments
    Christian Borgelt, Heiko Hofer, and Michael Berthold
    Workshop Information Mining - Navigating Large Heterogeneous Spaces of Multimedia Information
    German Conference on Artificial Intelligence, Hamburg, Germany 2003
    (13 pages) wsim_03.pdf (303 kb) wsim_03.ps.gz (143 kb)

Note that this program version does not support wildcard atoms and does not have a graphical user interface as the version described in the last two papers. The version supporting these features is property of Tripos, Inc.

An example

To demonstrate the usage of the program, consider the following steroid data set of 17 sample molecules:

Data Set of Steroids
Steroids Steroids Steroids Steroids Steroids Steroids
Steroids Steroids Steroids Steroids Steroids Steroids
Steroids Steroids Steroids Steroids Steroids

The very typical core structure of steroid molecules is a system of fused rings consisting of three 6-rings and one 5-ring. There may be different branches emerging from various ring atoms but all in all these ring structures are most characteristic.

To run MoFa on this data set one has to specify the input parameters. This example uses these settings:

          java -jar mofa.jar -s25 -r5:6 "" steroids.smiles

The program will generate 18 fragments (in our test environment, Pentium IV, 2.6GHz, in much less than a second). The following picture shows two of these fragments along with their relative frequency. (Click here to see all 18 fragments.)

Three (out of 18) fragments satisfying the above mentioned conditions
Frequent fragments
5/17 molecules
Frequent fragments
14/17 molecules
Frequent fragments
17/17 molecules

Note that this example does not find discriminative fragments as there are no complement molecules to match the fragments against. If there is a complement set, the occurrence of the fragments would be counted in this set and fragments that occur to often would be discarded.

The above example uses the Ring Mining as introduced in the IDA paper. Ring Mining carries out a preprocessing step where all rings of a certain size (here of size 5 and 6) are found and marked in the molecules of the data base . The search algorithm will then add rings as there are. That is, once a ring is encountered during the search process the whole ring is added. This leads to tremendous speed-ups and produces less fragments than with Ring Mining disabled. The algorithm with Ring Mining—as can be seen in the upper example—does not generate fragments with "open rings", i.e. where only a part of a ring structure is present. This is encouraging since rings in chemical structures are mostly considered as a unit. Thus, fragments (generated with no Ring Mining) containing long chains of carbon atoms (or other atom types occurring in ring structures) are not very useful since it cannot be induced if these chains are originally part of a ring or not.

Using the following command line will disable the Ring Mining.

        java -jar mofa.jar -s25 "" steroids.smiles

The search process lasts considerably longer due to redundant search (1 1/2 minutes compared to 0.06 seconds with no Ring Mining). The outcome comprises 153 fragments, most of them with long chains of carbon atoms, i.e. open rings. This page lists the generated fragments with their relative frequency.