liblevenshtein 4.0.0
A library for generating Finite State Transducers based on Levenshtein Automata.
Loading...
Searching...
No Matches
liblevenshtein::distance::MergeAndSplitDistance Class Reference

Computes the standard Levenshtein distance extended with two additional elementary operations: (1) merge and (2) split. More...

#include <merge_and_split_distance.h>

Inheritance diagram for liblevenshtein::distance::MergeAndSplitDistance:
Collaboration diagram for liblevenshtein::distance::MergeAndSplitDistance:

Public Member Functions

auto between (std::string v, std::string w) -> std::size_t override
 Measures the edit distance between two terms.
 
auto operator() (const std::string &v, const std::string &w) -> std::size_t
 Measures the edit distance between two terms.
 

Protected Member Functions

auto get (const SymmetricPair &key, std::size_t &distance) -> bool
 Collects the memoized distance between the pair of terms represented by the SymmetricPair if the distance has been previously determined.
 
auto set (const SymmetricPair &key, const std::size_t &distance) -> std::size_t
 Memoizes the distance between the SymmetricPair of terms for future reference.
 

Static Protected Member Functions

static auto f (const std::string &u, std::size_t t) -> std::string
 Returns the suffix of \(u\) from position \(t\).
 

Private Attributes

std::unordered_map< SymmetricPair, std::size_t > memo
 Memoized distances between pairs of terms.
 
std::shared_mutex mutex
 Coordinates memoization among threads.
 

Detailed Description

Computes the standard Levenshtein distance extended with two additional elementary operations: (1) merge and (2) split.

This is most useful for correcting OCR errors (optical character recognition).

Definition at line 15 of file merge_and_split_distance.h.

Member Function Documentation

◆ between()

auto liblevenshtein::distance::MergeAndSplitDistance::between ( std::string v,
std::string w ) -> std::size_t
overridevirtual

Measures the edit distance between two terms.

Parameters
vFirst term to compare.
wSecond term to compare.
Returns
Edit distance between v and w.

Implements liblevenshtein::distance::Distance.

Definition at line 6 of file merge_and_split_distance.cpp.

7 {
8 const SymmetricPair key(v, w);
9
10 std::size_t distance;
11 if (get(key, distance)) {
12 return distance;
13 }
14
15 if (v.empty()) {
16 return set(key, w.length());
17 }
18
19 if (w.empty()) {
20 return set(key, v.length());
21 }
22
23 char a = v[0];
24 std::string x = v.substr(1);
25 char b = w[0];
26 std::string y = w.substr(1);
27
28 // Discard identical characters
29 while (a == b && !(x.empty() || y.empty())) {
30 a = x[0];
31 v = x;
32 x = v.substr(1);
33 b = y[0];
34 w = y;
35 y = w.substr(1);
36 }
37
38 // x.empty() || y.empty()
39 if (a == b) {
40 if (x.empty()) {
41 return set(key, y.length());
42 }
43 return set(key, x.length());
44 }
45
46 distance = between(x, w);
47 if (0 == distance) {
48 return set(key, 1);
49 }
50
51 std::size_t min_distance = distance;
52
53 distance = between(v, y);
54 if (0 == distance) {
55 return set(key, 1);
56 }
57
58 if (distance < min_distance) {
59 min_distance = distance;
60 }
61
62 distance = between(x, y);
63 if (0 == distance) {
64 return set(key, 1);
65 }
66
67 if (distance < min_distance) {
68 min_distance = distance;
69 }
70
71 if (w.length() > 1) {
72 distance = between(x, f(w, 1));
73
74 if (0 == distance) {
75 return set(key, 1);
76 }
77
78 if (distance < min_distance) {
79 min_distance = distance;
80 }
81 }
82
83 if (v.length() > 1) {
84 distance = between(f(v, 1), y);
85
86 if (0 == distance) {
87 return set(key, 1);
88 }
89
90 if (distance < min_distance) {
91 min_distance = distance;
92 }
93 }
94
95 return set(key, 1 + min_distance);
96}
auto get(const SymmetricPair &key, std::size_t &distance) -> bool
Collects the memoized distance between the pair of terms represented by the SymmetricPair if the dist...
static auto f(const std::string &u, std::size_t t) -> std::string
Returns the suffix of from position .
auto set(const SymmetricPair &key, const std::size_t &distance) -> std::size_t
Memoizes the distance between the SymmetricPair of terms for future reference.
auto between(std::string v, std::string w) -> std::size_t override
Measures the edit distance between two terms.
void query(ll::Dawg *dawg, const std::string &query_term, std::size_t max_distance)
Definition main.cpp:25
auto distance(State *state, std::size_t query_length) -> std::size_t
Infers the Levenshtein distance from the given State and query length.

References liblevenshtein::distance(), and query().

Here is the call graph for this function:

◆ f()

auto liblevenshtein::distance::MemoizedDistance::f ( const std::string & u,
std::size_t t ) -> std::string
staticprotectedinherited

Returns the suffix of \(u\) from position \(t\).

If \(t \ge \left|u\right|\), then the empty string is returned.

Parameters
uTerm whose suffix is to be returned.
tBeginning position in u for the substring.
Returns
The substring of u beginning at position t if \(t < \left|u\right|\), or the empty string otherwise.

Definition at line 26 of file memoized_distance.cpp.

26 {
27 if (t < u.length()) {
28 return u.substr(1 + t);
29 }
30 return "";
31}

References query().

Here is the call graph for this function:

◆ get()

auto liblevenshtein::distance::MemoizedDistance::get ( const SymmetricPair & key,
std::size_t & distance ) -> bool
protectedinherited

Collects the memoized distance between the pair of terms represented by the SymmetricPair if the distance has been previously determined.

Otherwise, the distance is not collected and false is returned.

Parameters
keySymmetricPair of terms whose distance may have been memoized.
distanceOutgoing reference containing the memoized distance between the pair of terms, if the distance has been previously determined.
Returns
Whether the distance between the pair of terms has been previously determined.

Definition at line 9 of file memoized_distance.cpp.

9 {
10 std::shared_lock reader(mutex);
11 auto iter = memo.find(key);
12 if (iter != memo.end()) {
13 distance = iter->second;
14 return true;
15 }
16 return false;
17}
std::unordered_map< SymmetricPair, std::size_t > memo
Memoized distances between pairs of terms.
std::shared_mutex mutex
Coordinates memoization among threads.

References liblevenshtein::distance(), and query().

Here is the call graph for this function:

◆ operator()()

auto liblevenshtein::distance::Distance::operator() ( const std::string & v,
const std::string & w ) -> std::size_t
inherited

Measures the edit distance between two terms.

This is equivalent to calling between(v, w).

Parameters
vFirst term to compare.
wSecond term to compare.
Returns
Edit distance between v and w.

Definition at line 5 of file distance.cpp.

6 {
7 return between(v, w);
8}
virtual auto between(std::string v, std::string w) -> std::size_t=0
Measures the edit distance between two terms.

References query().

Here is the call graph for this function:

◆ set()

auto liblevenshtein::distance::MemoizedDistance::set ( const SymmetricPair & key,
const std::size_t & distance ) -> std::size_t
protectedinherited

Memoizes the distance between the SymmetricPair of terms for future reference.

Parameters
keySymmetricPair of terms whose distance is to be memoized.
distanceDistance between the SymmetricPair of terms to memoize.
Returns
The distance once it has been memoized.

Definition at line 19 of file memoized_distance.cpp.

20 {
21 std::unique_lock writer(mutex);
22 memo[key] = distance;
23 return distance;
24}

References liblevenshtein::distance(), and query().

Here is the call graph for this function:

Member Data Documentation

◆ memo

std::unordered_map<SymmetricPair, std::size_t> liblevenshtein::distance::MemoizedDistance::memo
privateinherited

Memoized distances between pairs of terms.

Definition at line 57 of file memoized_distance.h.

◆ mutex

std::shared_mutex liblevenshtein::distance::MemoizedDistance::mutex
mutableprivateinherited

Coordinates memoization among threads.

Definition at line 60 of file memoized_distance.h.


The documentation for this class was generated from the following files: