Similarity Measure: Similarity Flooding
Developers: Malte Kiesel Björn Endres Lizhen Qu
Description
This module is based on the algorithm similarity flooding from Sergey Melnik, Hector Garcia-Molina and Erhard Rahm. The detailed description of the algorithmus can be found at Similarity Flooding: A Versatile Graph Matching Algorithm. Besides a implementation of the original algorithmus, Lizhen Qu provides a optimized version to make it run more efficiently with huge ontologies.
Original SF
Developers: Malte Kiesel Lizhen Qu Björn Endres
The original SF takes a matrix calculation based approach, the library colt supporting efficient matrix calculations. The whole implementation is achieved through the two classes Graph and SimFlood. The class Graph is used for several purposes: A graph model representing the Ontology, a data structure for pairwise connectivity graph (PCG) and the conversion from the PCG to the induced propagation graph (IPG). The IPG is represented as an object of cern.colt.matrix.DoubleMatrix2D. The class SimFlood, which implements the SimilarityMeasure, calculates the similarity values from IPG and stores them in an object of class de.dfki.km.phaselib.impl.similarities.common.SFSimilarityMatrix.
Characteristics
Every call of either calcSimilarities() or calcSimilarity() results in a new calculation of similarity values.
Evaluation/Performance
TODO
Specification
Intitialisation
The SimilarityMeasure class is
de.dfki.km.phaselib.impl.similarities.sf.SimFlood
Initialisation is straight forward (new SimFlood())
Parameters
Parameter name | ValueType | Default | Description |
PARAM_TAXONOMY_ONLY | Boolean | FALSE | Defines whether the properties should be included in the connectivity graph. Thus, this option is vital if you want to calculate similarities between properties as well. |
Dependencies
- colt - an efficient scientific library used here for matrix based calculation.
Optimized SF
Developer: Lizhen Qu
The central class of the optimized SF is SimilarityMatcher, which implements SimilarityMeasure. OntoGraph describes a model, which is a graph representation of an ontology. A pairwise connectivity graph is represented by PCGGraph. Because IPG is basically a PCG with weighted edges going into two directions, it's also represented by the class PCGGraph. Other than the original implementation, optimized SF uses PCGVertex to store the similarity values, which makes it easier to add new subgraphs later on.
Characteristics
The optimized similarity flooding (SF) contains both the implementation of original algorithm and 3 optimized ones. The differences are characterized by the way of constructing pairwise connectivity graphs (PCG), which is configurable by the parameter sf.optimized.growthPolicy. If the growth policy is set to All, the original algorithm is used. If the policy is set to SinglePCG, it will constructs PCGs based on the relations of the input alignment. It constructs only PCGs containing all the input relations. With a parameter sf.optimized.numberOfPCGs, the top n PCGs ranked by variance of confidence will be selected as output. According to observation, PCGs with lower quality has always low variance. Another policy is called PCGStepwise, it tries to construct all the PCGs having more than one vertex. Then based on the parameter sf.optimized.numberOfPCGsTaxonomy, the top n PCGs with biggest variance will be selected. The last one is Taxonomy. Just as the name said, only one Taxonomy from ontology A will be compared with a taxonomy of ontology B.
The optimized SF allows also setting all fixpoint computing formula introduced in article Similarity Flooding: A Versatile Graph Matching Algorithm. 2 kinds of propagation coefficients are also available.
To achieve the best performance, if there is no input relation, fixpoint formula Basic is proven to be the best. With input relations e.g. from string based measures, formula C is generally the best. For faster evaluation, SinglePCG with formula C are considered a good match in circumstance that SF get input alignment from other similarity measure, the growth policy can results in constructing only 1 to 3 PCGs.
de.dfki.km.phaselib.impl.alignments.SimCombinedAlignGenerator is a example of SF accepting a alignment from string based measure, threshold by 0.5, which has a precision of 83% and recall 38%. In de.dfki.km.phaselib.impl.alignments.inria.BestFirstAlignGenerator SF uses original implementation with formula basic, and string based measure generates alignment parallel with threshold 0.5. The better one of the 2 generated alignment will be choosen as the final result. It's assumed that the best alignment find more relations. With the simple rule this generator reaches a precision of 0.98 and recall 0.58.
Evaluation/Performance
Alone with formula basic and inria benchmarks. precision 0.68, recall 0.31.
Suggested configuration
If there is a alignment from another similarity measure as input, fixpoint computing formula C and growth policy All together is a good combination. One alternative is to use growth policy SinglePCG for faster computation, because it generates much less PCGs. If there is a no alignment as input(pure structure comparision). One possibility is to use fixpoint computing formula Basic and growth policy. Sometimes better results can be achieved if SF_OPTIMIZED_BIND_INSTANCE is set to be true, growth policy is PCGStepwise and SF_BIND_WHOLE_DATA_TAXONOMY is set to be false
Intitialisation
The SimilarityMeasure class is
de.dfki.km.phaselib.impl.similarities.sf.optimized.SimilarityMatcher
Initialisation is straight forward (new SimilarityMatcher())
Parameters
Parameter name | ValueType | Default | Description |
PARAM_TAXONOMY_ONLY | Boolean | FALSE | Defines whether the properties should be included in the connectivity graph. Thus, this option is vital if you want to calculate similarities between properties as well. |
SF_OPTIMIZED_BIND_INSTANCE | Boolean | FALSE | Defines whether the instances should be included in the connectivity graph. |
SF_OPTIMIZED_FORMULA | SimilarityMatcher.Formula | Formula.Basic | fixpoint computing formula according to the paper of Similarity Flooding. |
SF_OPTIMIZED_GROWTH_POLICY | SimilarityMatcher.GrowthPolicy | GrowthPolicy.All | The way to build paarwise connectivity graph. |
SF_PROPAGATION_COEFFICIENTS | SimilarityMatcher.PC | PC.inv_prod | propagation coefficients, details see melnik paper. |
SF_BIND_WHOLE_DATA_TAXONOMY | Boolean | TRUE | if bind all the supported datatypes into the PCGs |
SF_OPTIMIZED_NUMBER_OF_PCGS | Integer | 100 | it is used only together with growth policy PCGStepwise. It choose top n PCGs according to their variance as result |
Dependencies
- log4j - logging