SimilarityMeasure: SimPack Adapter
Developer: Björn Endres
Description
This module tries to use any measure provided by the SimPack project as a PhaseLibs SimilarityMeasure. Since the two projects follow a completely different approach, the adapter needs a set of parameters to work properly. There is, however, a small database of these parameters included in the module, so just try with a SimPack measure configuration of the list below. If the measure config you need is not (yet) in the database, an autodetection process is started to guess the needed parameters. This may produce weird results, especially since the SimPack measures are not normalised in any way.
The internal measure database is being extended constantly, but feel free to request the integration of a specific measure if you need it.
Evaluation/Performance
The SimPack project follows the approach of using one measure object in order to calculate a single value. Thus, calculating a similarity matrix requires the generation of a large number of such measure objects. Therefore the algorithm is performing suboptimal. However, this adapter allows the easy integration of a large number of measures.
Characteristics
The following list contains SimPack measures that have already been included into the parameter database.
Config Name | SimPack measure class used | Description | Status | Remarks |
SimPack internal measures: | ||||
| simpack.measure.string.Jaro | Jaro metric | buggy | Original SimPack measure seems to be buggy. Use the other Jaro implementation instead |
Simmetrics based: | (simpack.measure.external.simmetrics.-) | |||
SMJaro | Jaro | Jaro metric | works (unconfirmed) | none |
Secondstring based: | (simpack.measure.external.secondstring.-) | |||
SSLevenshtein | Levenshtein | Levenshtein editing distance | works (unconfirmed) | none |
SSJaro | Jaro | Jaro metric | works (unconfirmed) | none |
SSMongeElkan | MongeElkan | The match method proposed by Monge and Elkan | works (unconfirmed) | none |
SSSLIM | SLIM | Experimental, invented by SecondString author William Cohen. | works (unconfirmed) | none |
| SmithWaterman | Smith-Waterman string distance, following Durban et al. | not working | normalisation fails |
SSNeedlemanWunsch | NeedlemanWunsch | Needleman-Wunsch string distance, following Durban et al. Sec 2.3. | works (unconfirmed) | none |
| Mixture | Mixture-based distance metric. | not working | Bug found in com.wcohen.ss.Mixture.java:48 |
SSWinklerRescorerOnJaro | WinklerRescorer | Winkler's reweighting scheme for distance metrics, applied to the Jaro metric ('An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves Thibaudeau.) | works (unconfirmed) | none |
| SoftTFIDF | TFIDF-based distance metric, extended to use "soft" token-matching. Specifically, tokens are considered a partial match if they get a good score using an inner string comparator. | not working | strange results |
Tree based: | (simpack.measure.tree.-) | |||
SubTreeEditDistance | TreeEditDistance | This implements an edit distance calculation for the class-rooted subtrees. The algorithm is taken from Gabriel Valientes book "Algorithms on trees and graphs" (Springer) and described in chapter 2.1 "The tree edit distance problem". | works (unconfirmed) | none |
Specification
Intitialisation
The SimilarityMeasure class is
de.dfki.km.phaselib.impl.similarities.simPack.SimPackMeasure
The initialisation should be straight forward in most cases, and simply takes the SimPack measure class as an argument. Example:
SimilarityMeasure measure = new SimPackMeasure("SSJaro");
If you want to let the system to guess the config for a given measure, give it the class name: Example:
SimilarityMeasure measure = new SimPackMeasure(Jaro.class.getName());
If the measure used is not present in the internal database, i.e. not listed above, or if you want to use the measure with custom settings, you would have to use the full constructor:
SimilarityMeasure measure = new SimPackMeasure( // well, of course you would still need the class the_measureClass_I_want_to_use, // (e.g. Jaro, like above) // this should intilialise the measure in the way you want it to my_Initialiser_object, // this will (re)norm the values calculated my_ValueTransformer_object, // this will provide the measure with IAccessor-objects my_AccessorFactory_object);
Parameters
TODO
Dependencies
Depending on the SimPack measures wrapped. Refer to the SimPack dependencies.
License Issues
The SimPack project is published under the following Creative Commons license: