liblevenshtein 4.0.0
A library for generating Finite State Transducers based on Levenshtein Automata.
Loading...
Searching...
No Matches
liblevenshtein-cpp

Action Type Build Status
Operating System Test Ubuntu Test MacOS Test Windows
Engineering Excellence Coverage Status Linter CodeQL
Demo App Run Demo
Documentation Deploy static content to Pages

A library for generating Finite State Transducers based on Levenshtein Automata.

NOTE: This library is currently in rc phase. I'll have it production ready as soon as possible. Currently, there is >90% test coverage over the sources and the library is usable as described below.

Due to limited resources on my part, this library requires C++20 features (or whichever is the latest standard). If you need compatibility with an older standard, please either submit a pull request (preferably) or create an issue stating the standard you need compatibility with and I will comply if I can.

For a demonstration, please reference the example app.

For API documentation, please reference the GitHub Pages.

Development and Installation

For instructions how to develop and install liblevenshtein, please reference the wiki.

Usage

Algorithms

liblevenshtein supports three variations of Levenshtein distance, where each variation is defined by the elementary operations it supports. An elementary operation is an edit operation that errs in a penalty of 1 unit.

  1. liblevenshtein::Algorithm::STANDARD
    • Standard Levenshtein distance including the traditional elementary operations of insert, delete, and substitute.
  2. liblevenshtein::Algorithm::TRANSPOSITION
    • Standard Levenshtein distance extended with transpose as an elementary operation.
    • The elementary operations supported by this algorithm follow: insert, delete, substitute, and transpose.
    • A transposition reorders the characters ab as ba, erring with a penalty of 1 unit instead of 2.
      • The standard algorithm treats transpositions as either a sequence of delete+insert, insert+delete, or substitute+substitute, each of which errs in a penalty of 2 units.
    • This algorithm is preferred for correcting typographical errors, where the majority of misspellings in English are within 2 units of error from the intended spelling with many errors being transpositions.
  3. liblevenshtein::Algorithm::MERGE_AND_SPLIT
    • Standard Levenshtein distance extended with two additional elementary operations: merge and split.
    • The elementary operations supported by this algorithm follow: insert, delete, substitute, merge, and split.
    • This algorithm does not include transpose as an elementary operation.
    • A merge collapses characters cl as a single character d.
    • A split expands character d as two characters cl.
    • This algorithm is preferred for correcting OCR (Optical Character Recognition) errors, where an OCR model may incorrectly read the sequence of characters cl as d or the character d as the sequence cl. Of course, these operations consider all combinations of characters from your dictionary and not just the obvious ones.

Results

liblevenshtein supports returning results in two formats:

  1. std::string
    • Spelling candidates are returned as strings without including their edit distances from the query term.
    • This is likely what you want for production.
  2. liblevenshtein::Candidate
    • Spelling candidates are returned as instances of std::pair<std::string, std::size_t>, where each pair includes the spelling candidate and its edit distance from the query term.
    • This is likely what you want for development.

Example

# file: CMakeLists.txt
cmake_minimum_required(VERSION 3.20 FATAL_ERROR)
project(liblevenshtein-demo
VERSION 1.0.0
DESCRIPTION "Demonstrates how to use liblevenshtein-cpp."
HOMEPAGE_URL "https://github.com/universal-automata/liblevenshtein-cpp"
LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_FLAGS_DEBUG "-g -O0")
SET(CMAKE_C_FLAGS_DEBUG "-g -O0")
set(CMAKE_COMPILE_WARNING_AS_ERROR ON)
set(CMAKE_VERBOSE_MAKEFILE ON)
include(GNUInstallDirs)
find_package(Protobuf REQUIRED)
find_package(liblevenshtein REQUIRED)
add_executable(${PROJECT_NAME}
"main.cpp")
target_link_libraries(${PROJECT_NAME}
PRIVATE
protobuf::libprotobuf
levenshtein)
// file: main.cpp
#include <algorithm>
#include <cstddef>
#include <string>
#include <utility>
#include <vector>
#include <google/protobuf/stubs/common.h>
namespace ll = liblevenshtein;
int main(int argc, char *argv[]) {
// Verify that the version of the library that we linked against is
// compatible with the version of the headers we compiled against.
// path to file containing serialized dictionary
std::string serialization_path;
ll::Dawg *dawg = ll::deserialize_protobuf(serialization_path);
if (dawg == nullptr) {
std::vector<std::string> terms; // populate with your spelling candidates
std::sort(terms.begin(), terms.end()); // must be sorted for now
// NOTE: If (dawg == nullptr) then the construction of the dictionary
// failed, probably because terms wasn't sorted lexicographically in
// ascending order.
dawg = ll::sorted_dawg(terms.begin(), terms.end());
}
std::string query_term; // assign the term whose spelling you wish to correct
std::size_t max_distance = 2; // maximum number of operations allowed to transform
// a spelling candidate into query_term (edit distance)
// NOTE: ll:Candidate is an alias for std::pair<std::string, std::size_t>
for (const ll::Candidate& candidate : transduce(query_term, max_distance)) {
const std::string& term = candidate.first; // spelling candidate for query_term
const std::size_t& distance = candidate.second; // minimum number of operations required
// to transform query_term into term
}
// save the dictionary for reuse
ll::serialize_protobuf(dawg, serialization_path);
delete dawg;
// Optional: Delete all global objects allocated by libprotobuf.
google::protobuf::ShutdownProtobufLibrary();
return 0;
}
A Directed Acyclic Word Graph (DAWG) maps sequences of characters to form words; the collection of wo...
Definition dawg.h:23
Constructs a Levenshtein Transducer that, when given a dictionary automaton, query term,...
Definition transducer.h:26
auto main(int argc, char *argv[]) -> int
Definition main.cpp:58
void query(ll::Dawg *dawg, const std::string &query_term, std::size_t max_distance)
Definition main.cpp:25
Various utilities regarding Levenshtein transducers.
Definition namespaces.dox:9
auto sorted_dawg(IterType iter, IterType end) -> SortedDawg *
Factory function that initializes a new SortedDawg using the terms yielded from an iterator.
std::pair< std::string, std::size_t > Candidate
Spelling candidate that includes both the dictionary term and its edit distance from the query term.
Definition transducer.h:17
auto deserialize_protobuf(const fs::path &path) -> Dawg *
Deserializes the protobuf containing a Dawg at the given path or returns nullptr if none exists.
auto serialize_protobuf(Dawg *dawg, const fs::path &path) -> bool
Serializes the given Dawg to protobuf at the given path.