GitXplorerGitXplorer
h

bitap

public
10 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
3cee8ad9ae4dcfe93ac7177b2f81125c9f25765d

Create .gitattributes

hheyimalex committed 5 years ago
Unverified
3f6dbb36ad82a1f9fe63d01a8a839acd16d574b4

Bump versions, set readme

hheyimalex committed 5 years ago
Unverified
edfae4fe49f01990e6a6ffb92e5bc7bcad1748fa

Improve inline documentation

hheyimalex committed 5 years ago
Unverified
da90f5416a5f9fe8d0360a969990ce2a9c82246f

Update README.md

hheyimalex committed 5 years ago
Unverified
febb939c2562c41beb3b74d64b6e0be3bf5aa266

Make MaskIterator private

hheyimalex committed 5 years ago
Unverified
3fb5457ba1d9308b2631d815407400989689dc97

Make find return start index

hheyimalex committed 5 years ago

README

The README file for this repository.

Bitap

Crates.io

Implementation of the bitap algorithm for fuzzy string search in Rust.

If you have a bunch of text to search through and a substring that you're looking for, bitap can efficiently find all places in the source text that are at most k edits away from the search string, where "edits" are in terms of Levenshtein distance or optimal string alignment distance. There's a small upfront cost, but the runtime is O(nk) in practice, which is pretty solid if that's what you need.

Usage

Compile a Pattern and then use it to search.

use bitap::{Pattern,Match};

// Compile the pattern you're searching for.
let pattern = Pattern::new("wxrld")?;

// Run one of the search functions:
// - pattern.lev for levenshtein distance matching
// - pattern.osa for optimal string alignment distance matching
// The second parameter determines the maximum edit distance
// that you want to return matches for.
let max_distance = 1;
let matches = pattern.lev("hello world", max_distance);

// Horray!
assert_eq!(matches.next(), Some(Match{ distance: 1, end: 10 }));

Limitations

  • Pattern size is limited to system word size (mem::size_of::<usize>() - 1), so you can't search for anything longer than 31/63 characters, depending on architecture. This is a fundamental limitation of the algorithm. This seems like a pretty bad limitation, but for fuzzy search at least you're probably going to split up your query into tokens and run bitap n times. Antidisestablishmentarianism is only 28 characters after all.

  • Bitap can tell you where a match ends, but not where it begins. The section on match highlighting goes into more detail about this.

  • Unicode is weird. When you think of "edit distance", you usually think in terms of "characters". But a Unicode code-point doesn't map to a single "character". A character could be "a" or it could be "ă̘̙̤̪̹̰͔͒̃̃͐̂͘". A single character could be ten dads. Under the 1-char-per-character rule, "alyx" is technically two edits away from "aléx" (where é is e + ́ ) when you really expect it to be one. The Pattern struct works this way internally, where one char equals one character. If you need more nuanced behavior, you're free to use iterator adapters described in the section below. You could also normalize text to remove extraneous decorations, which may be what your users want anyway.

Adapters

The Pattern struct handles most common usages of bitap; fuzzy searching for unicode patterns in unicode text. But it's not perfect, and bitap itself is generalizable to a much broader range of problems.

Luckily, the core of bitap is actually representable in a way that doesn't care about whether you're dealing with code points or graphemes or even nucleotides, and I can punt all those concerns to someone who cares!

They key insight is that the main algorithm works on an iterator of pattern masks. Bitap can then be implemented as a iterator adapter that takes in Iterator<Item = usize> and returns an iterator of matches. That's what the top level find, levenshtein and optimal_string_alignment functions are; you write the code that makes the pattern-mask iterator, they find the matches.

Static Variants

There are a couple of static versions of the iterator adapters, levenshtein_static and optimal_string_alignment_static. What's that about?

According to wikipedia:

In his seminal paper, Damerau stated that more than 80% of all human misspellings can be expressed by a single error of one of the four types.

Also, I did a lot of toying with algolia while thinking about fuzzy search, and they only allow up to two typos. So allowing up to two errors is probably a common enough case to optimize for, and we can eek out ~15% more performance by avoiding an allocation!

Match Highlighting

Bitap can unfortunately only tell you what index a match ends on. When edit distance is zero, the beginning of the match is trivially match.end - pattern_length + 1, but with edits it's that +- match.distance. This makes perfectly accurate highlighting a pain, but here are some strategies that have worked for me.

  • You almost certainly want filter your matches into local-minima; every zero distance match is sandwiched by two one edit matches, those by two edit matches, those by three edit matches, and so on. By filtering out those wrapping matches, you save yourself a lot of work.

  • If matches are relatively rare and max_distance is low, it's probably fine to use something like the strsim crate to brute-force the beginning of the match by checking all substrings between and returning when one has the appropriate edit distance.

  • Highlighting around insertions, ie "hello" highlighting "helxlo", is difficult and I haven't come up with an easy way to do it. Just highlight the whole thing and the humans reading it will understand.

  • In general, people care more about the start of a match than the end. If you run bitap in reverse, with a reversed pattern over a reversed string, match.end is actually the beginning! You can then highlight pattern_length characters ahead, skipping leading and trailing whitespace, and it's probably good enough.

As I think about this more some it may be added into the crate.

Also, I should note that while I haven't figured out how to recover the beginning of the match from the internal bitap state, that doesn't mean it's impossible. Interested to see if anyone can come up with something!