Action Type | Build Status |
Operating System | | | |
Engineering Excellence | | | |
Demo App | |
Documentation | |
A library for generating Finite State Transducers based on Levenshtein Automata.
NOTE: This library is currently in rc phase. I'll have it production ready as soon as possible. Currently, there is >90% test coverage over the sources and the library is usable as described below.
Due to limited resources on my part, this library requires C++20 features (or whichever is the latest standard). If you need compatibility with an older standard, please either submit a pull request (preferably) or create an issue stating the standard you need compatibility with and I will comply if I can.
For a demonstration, please reference the example app.
For API documentation, please reference the GitHub Pages.
Development and Installation
For instructions how to develop and install liblevenshtein, please reference the wiki.
Usage
Algorithms
liblevenshtein
supports three variations of Levenshtein distance, where each variation is defined by the elementary operations it supports. An elementary operation is an edit operation that errs in a penalty of 1 unit.
liblevenshtein::Algorithm::STANDARD
- Standard Levenshtein distance including the traditional elementary operations of
insert
, delete
, and substitute
.
liblevenshtein::Algorithm::TRANSPOSITION
- Standard Levenshtein distance extended with
transpose
as an elementary operation.
- The elementary operations supported by this algorithm follow:
insert
, delete
, substitute
, and transpose
.
- A transposition reorders the characters
ab
as ba
, erring with a penalty of 1 unit instead of 2.
- The standard algorithm treats transpositions as either a sequence of
delete+insert
, insert+delete
, or substitute+substitute
, each of which errs in a penalty of 2 units.
- This algorithm is preferred for correcting typographical errors, where the majority of misspellings in English are within 2 units of error from the intended spelling with many errors being transpositions.
liblevenshtein::Algorithm::MERGE_AND_SPLIT
- Standard Levenshtein distance extended with two additional elementary operations:
merge
and split
.
- The elementary operations supported by this algorithm follow:
insert
, delete
, substitute
, merge
, and split
.
- This algorithm does not include
transpose
as an elementary operation.
- A
merge
collapses characters cl
as a single character d
.
- A
split
expands character d
as two characters cl
.
- This algorithm is preferred for correcting OCR (Optical Character Recognition) errors, where an OCR model may incorrectly read the sequence of characters
cl
as d
or the character d
as the sequence cl
. Of course, these operations consider all combinations of characters from your dictionary and not just the obvious ones.
Results
liblevenshtein
supports returning results in two formats:
std::string
- Spelling candidates are returned as strings without including their edit distances from the query term.
- This is likely what you want for production.
liblevenshtein::Candidate
- Spelling candidates are returned as instances of
std::pair<std::string, std::size_t>
, where each pair includes the spelling candidate and its edit distance from the query term.
- This is likely what you want for development.
Example
# file: CMakeLists.txt
cmake_minimum_required(VERSION 3.20 FATAL_ERROR)
project(liblevenshtein-demo
VERSION 1.0.0
DESCRIPTION "Demonstrates how to use liblevenshtein-cpp."
HOMEPAGE_URL "https://github.com/universal-automata/liblevenshtein-cpp"
LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_FLAGS_DEBUG "-g -O0")
SET(CMAKE_C_FLAGS_DEBUG "-g -O0")
set(CMAKE_COMPILE_WARNING_AS_ERROR ON)
set(CMAKE_VERBOSE_MAKEFILE ON)
include(GNUInstallDirs)
find_package(Protobuf REQUIRED)
find_package(liblevenshtein REQUIRED)
add_executable(${PROJECT_NAME}
"main.cpp")
target_link_libraries(${PROJECT_NAME}
PRIVATE
protobuf::libprotobuf
levenshtein)
#include <algorithm>
#include <cstddef>
#include <string>
#include <utility>
#include <vector>
#include <google/protobuf/stubs/common.h>
int main(
int argc,
char *argv[]) {
std::string serialization_path;
std::vector<std::string>
terms;
}
std::size_t max_distance = 2;
const std::size_t& distance =
candidate.second;
}
google::protobuf::ShutdownProtobufLibrary();
return 0;
}
A Directed Acyclic Word Graph (DAWG) maps sequences of characters to form words; the collection of wo...
Constructs a Levenshtein Transducer that, when given a dictionary automaton, query term,...
auto main(int argc, char *argv[]) -> int
void query(ll::Dawg *dawg, const std::string &query_term, std::size_t max_distance)
Various utilities regarding Levenshtein transducers.
auto sorted_dawg(IterType iter, IterType end) -> SortedDawg *
Factory function that initializes a new SortedDawg using the terms yielded from an iterator.
std::pair< std::string, std::size_t > Candidate
Spelling candidate that includes both the dictionary term and its edit distance from the query term.
auto deserialize_protobuf(const fs::path &path) -> Dawg *
Deserializes the protobuf containing a Dawg at the given path or returns nullptr if none exists.
auto serialize_protobuf(Dawg *dawg, const fs::path &path) -> bool
Serializes the given Dawg to protobuf at the given path.