wiki:similarity_SimPackAdapter

Version 1 (modified by kiesel, 15 years ago) (diff)

spam: reset to 2006 page version

TracNav?

SimilarityMeasure: SimPack Adapter

Developer: Björn Endres

Description

This module tries to use any measure provided by the SimPack project as a PhaseLibs SimilarityMeasure. Since the two projects follow a completely different approach, the adapter needs a set of parameters to work properly. There is, however, a small database of these parameters included in the module, so just try with a SimPack measure configuration of the list below. If the measure config you need is not (yet) in the database, an autodetection process is started to guess the needed parameters. This may produce weird results, especially since the SimPack measures are not normalised in any way.

The internal measure database is being extended constantly, but feel free to request the integration of a specific measure if you need it.

Evaluation/Performance

The SimPack project follows the approach of using one measure object in order to calculate a single value. Thus, calculating a similarity matrix requires the generation of a large number of such measure objects. Therefore the algorithm is performing suboptimal. However, this adapter allows the easy integration of a large number of measures.

Characteristics

The following list contains SimPack measures that have already been included into the parameter database.

Config Name SimPack measure class used Description Status Remarks
SimPack internal measures:
Jaro simpack.measure.string.Jaro Jaro metric buggy Original SimPack measure seems to be buggy. Use the other Jaro implementation instead
Simmetrics based: (simpack.measure.external.simmetrics.-)
SMJaro Jaro Jaro metric works (unconfirmed) none
Secondstring based: (simpack.measure.external.secondstring.-)
SSLevenshtein Levenshtein Levenshtein editing distance works (unconfirmed) none
SSJaro Jaro Jaro metric works (unconfirmed) none
SSMongeElkan MongeElkan The match method proposed by Monge and Elkan works (unconfirmed) none
SSSLIM SLIM Experimental, invented by SecondString author William Cohen. works (unconfirmed) none
SSSmithWaterman SmithWaterman Smith-Waterman string distance, following Durban et al. not working normalisation fails
SSNeedlemanWunsch NeedlemanWunsch Needleman-Wunsch string distance, following Durban et al. Sec 2.3. works (unconfirmed) none
SSMixture Mixture Mixture-based distance metric. not working Bug found in com.wcohen.ss.Mixture.java:48
SSWinklerRescorerOnJaro WinklerRescorer Winkler's reweighting scheme for distance metrics, applied to the Jaro metric ('An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves Thibaudeau.) works (unconfirmed) none
SSSoftTFIDF SoftTFIDF TFIDF-based distance metric, extended to use "soft" token-matching. Specifically, tokens are considered a partial match if they get a good score using an inner string comparator. not working strange results
Tree based: (simpack.measure.tree.-)
SubTreeEditDistance TreeEditDistance This implements an edit distance calculation for the class-rooted subtrees. The algorithm is taken from Gabriel Valientes book "Algorithms on trees and graphs" (Springer) and described in chapter 2.1 "The tree edit distance problem". works (unconfirmed) none

Specification

Intitialisation

The SimilarityMeasure class is

de.dfki.km.phaselib.impl.similarities.simPack.SimPackMeasure

The initialisation should be straight forward in most cases, and simply takes the SimPack measure class as an argument. Example:

   SimilarityMeasure measure = 
      new SimPackMeasure("SSJaro");

If you want to let the system to guess the config for a given measure, give it the class name: Example:

   SimilarityMeasure measure = 
      new SimPackMeasure(Jaro.class.getName());

If the measure used is not present in the internal database, i.e. not listed above, or if you want to use the measure with custom settings, you would have to use the full constructor:

   SimilarityMeasure measure = new SimPackMeasure(

      // well, of course you would still need the class
      the_measureClass_I_want_to_use, // (e.g. Jaro, like above)
      
      // this should intilialise the measure in the way you want it to
      my_Initialiser_object,          
      
      // this will (re)norm the values calculated
      my_ValueTransformer_object,     

      // this will provide the measure with IAccessor-objects
      my_AccessorFactory_object);

Parameters

TODO

Dependencies

Depending on the SimPack measures wrapped. Refer to the SimPack dependencies.

License Issues

The SimPack project is published under the following Creative Commons license:

Creative Commons License