wiki:similarity_AcronymMatch

TracNav?

SimilarityMeasure: Acronym Matcher

Developer: Björn Endres

Description

This module uses the entities' labels to calculate the likelihood that one is meant to be an acronym of the other. The algorithm is pretty smart and recognises extensions as in W3C or basic leet as in 2L8. It is meant to be a supplement for more general similarity measures, improving them by the ability to detect acronyms. This measure is symmetric, since the shorter of the two labels is always checked for being an acronym of the longer one. A set of parameters allows for tuning the measure to different scenarios.

Characteristics

In order to demonstrate the abilities of this SimilarityMeasure, here some examples (using the default values):

Frame A name Frame B name Measure value
Graduate Management in Admission Test (Educational Testing Service) GMAT 1.00
International Semantic Web Conference 2005 ISWC05 1.00
The World Wide Web Consortium W3C 1.00
ventricular fibrillation v-fib 1.00
Bundesrepublik Deutschland BRD 0.92
Roll on the floor, laughing! rofl 0.89
false positive examples:
Bundesrepublik Deutschland brb 0.63
Graduate Management in Admission Test (Educational Testing Service) GNU 0.49
ventricular fibrillation BAT 0.35

The examples suggest, that a threshold of appromiately 0.9 should be applied in order to get reliable results. The values can, however, always be used as an additional evidence.

Evaluation/Performance

TODO

Specification

Intitialisation

The SimilarityMeasure main class is

de.dfki.km.phaselib.impl.similarities.acronymMatch.AcronymMatcher

Initialisation is straight forward:

new AcronymMatcher()

Parameters

Parameter name ValueType Default Description
PARAM_MAX_ACRONYM_LENGTH Integer 12 No acronyms longer than this value will be regarded, they will score 0.0
PARAM_MAX_EXTENSIONS Integer 12 the maximal number of extensions extracted of a word
PARAM_CASE_PENALTY Float 0.43 the penalty given per acronym letter found wrong case in the term
PARAM_JUMP_PENALTY Float 4.00 a penalty which is given if a large part of the term is jumped over
PARAM_INWORD_PENALTY Float 0.33 the penalty given per acronym letter found within a word (not at the beginning)

Dependencies

none

License Issues

TODO

Last modified 18 years ago Last modified on 09/25/06 18:07:26