wiki:similarity_SimFlood

Version 9 (modified by lqu, 18 years ago) (diff)

--

TracNav?

Similarity Measure: Similarity Flooding

Developers: Malte Kiesel Björn Endres Lizhen Qu

Description

This module is based on the algorithm similarity flooding from Sergey Melnik, Hector Garcia-Molina and Erhard Rahm. The detailed description of the algorithmus can be found at Similarity Flooding: A Versatile Graph Matching Algorithm. Besides a implementation of the original algorithmus, Lizhen Qu provides a optimized version to make it run more efficiently with huge ontologies.


Original SF

Developers: Malte Kiesel Lizhen Qu Björn Endres

The original SF takes a matrix calculation based approach, the library colt supporting efficient matrix calculations. The whole implementation is achieved through the two classes Graph and SimFlood. The class Graph is used for several purposes: A graph model representing the Ontology, a data structure for pairwise connectivity graph (PCG) and the conversion from the PCG to the induced propagation graph (IPG). The IPG is represented as an object of cern.colt.matrix.DoubleMatrix2D. The class SimFlood, which implements the SimilarityMeasure, calculates the similarity values from IPG and stores them in an object of class de.dfki.km.phaselib.impl.similarities.common.SFSimilarityMatrix.

Characteristics

Every call of either calcSimilarities() or calcSimilarity() results in a new calculation of similarity values.

Evaluation/Performance

TODO

Specification

Intitialisation

The SimilarityMeasure class is

de.dfki.km.phaselib.impl.similarities.sf.SimFlood

Initialisation is straight forward (new SimFlood())

Parameters

Parameter name ValueType Default Description
PARAM_TAXONOMY_ONLY Boolean FALSE Defines whether the properties should be included in the connectivity graph. Thus, this option is vital if you want to calculate similarities between properties as well.

Dependencies

  1. colt - an efficient scientific library used here for matrix based calculation.

Optimized SF

Developer: Lizhen Qu

The central class of the optimized SF is SimilarityMatcher, which implements SimilarityMeasure. OntoGraph describes a model, which is a graph representation of an ontology. A pairwise connectivity graph is represented by PCGGraph. Because IPG is basically a PCG with weighted edges going into two directions, it's also represented by the class PCGGraph. Other than the original implementation, optimized SF uses PCGVertex to store the similarity values, which makes it easier to add new subgraphs later on.

Characteristics

The optimized similarity flooding (SF) contains both the implementation of original algorithm and 3 optimized ones. The differences are characterized by the way of constructing pairwise connectivity graphs (PCG), which is configurable by the parameter sf.optimized.growthPolicy. If the growth policy is set to All, the original algorithm is used. If the policy is set to SinglePCG, it will constructs PCGs based on the relations of the input alignment. It constructs only PCGs containing all the input relations. With a parameter sf.optimized.numberOfPCGs, the top n PCGs ranked by variance of confidence will be selected as output. According to observation, PCGs with lower quality has always low variance. Another policy is called PCGStepwise, it tries to construct all the PCGs having more than one vertex. Then based on the parameter sf.optimized.numberOfPCGsTaxonomy, the top n PCGs with biggest variance will be selected. The last one is Taxonomy. Just as the name said, only one Taxonomy from ontology A will be compared with a taxonomy of ontology B.

The optimized SF allows also setting all fixpoint computing formula introduced in article Similarity Flooding: A Versatile Graph Matching Algorithm. 2 kinds of propagation coefficients are also available.

To achieve the best performance, if there is no input relation, fixpoint formula Basic is proven to be the best. With input relations e.g. from string based measures, formula C is generally the best. For faster evaluation, SinglePCG with formula C are considered a good match in circumstance that SF get input alignment from other similarity measure, the growth policy can results in constructing only 1 to 3 PCGs.

de.dfki.km.phaselib.impl.alignments.SimCombinedAlignGenerator is a example of SF accepting a alignment from string based measure, threshold by 0.5, which has a precision of 83% and recall 38%. In de.dfki.km.phaselib.impl.alignments.inria.BestFirstAlignGenerator SF uses original implementation with formula basic, and string based measure generates alignment parallel with threshold 0.5. The better one of the 2 generated alignment will be choosen as the final result. It's assumed that the best alignment find more relations. With the simple rule this generator reaches a precision of 0.98 and recall 0.58.

Evaluation/Performance

Alone with formula basic and inria benchmarks. precision 0.68, recall 0.31.

Suggested configuration

If there is a alignment from another similarity measure as input, fixpoint computing formula C and growth policy All together is a good combination. One alternative is to use growth policy SinglePCG for faster computation, because it generates much less PCGs. If there is a no alignment as input(pure structure comparision). One possibility is to use fixpoint computing formula Basic and growth policy. Sometimes better results can be achieved if SF_OPTIMIZED_BIND_INSTANCE is set to be true, growth policy is PCGStepwise and SF_BIND_WHOLE_DATA_TAXONOMY is set to be false

Intitialisation

The SimilarityMeasure class is

de.dfki.km.phaselib.impl.similarities.sf.optimized.SimilarityMatcher

Initialisation is straight forward (new SimilarityMatcher())

Parameters

Parameter name ValueType Default Description
PARAM_TAXONOMY_ONLY Boolean FALSE Defines whether the properties should be included in the connectivity graph. Thus, this option is vital if you want to calculate similarities between properties as well.
SF_OPTIMIZED_BIND_INSTANCE Boolean FALSE Defines whether the instances should be included in the connectivity graph.
SF_OPTIMIZED_FORMULA SimilarityMatcher.Formula Formula.Basic fixpoint computing formula according to the paper of Similarity Flooding.
SF_OPTIMIZED_GROWTH_POLICY SimilarityMatcher.GrowthPolicy GrowthPolicy.All The way to build paarwise connectivity graph.
SF_PROPAGATION_COEFFICIENTS SimilarityMatcher.PC PC.inv_prod propagation coefficients, details see melnik paper.
SF_BIND_WHOLE_DATA_TAXONOMY Boolean TRUE if bind all the supported datatypes into the PCGs
SF_OPTIMIZED_NUMBER_OF_PCGS Integer 100 it is used only together with growth policy PCGStepwise. It choose top n PCGs according to their variance as result

Dependencies

  1. log4j - logging