Fuzzy searching, all-in-SQLite version #103

dmfay · 2020-09-28T23:55:10Z

A quarter-second faster than the skim version at ~100k entries; the one thing this needs is highlight detection, which still only looks for a contiguous match.

cantino · 2020-09-30T16:56:40Z

Thanks! I'll play with this shortly :)

cantino · 2020-10-15T01:01:38Z

Sorry for the delay :)

This seems plenty fast enough. It'd be really nice if it could prioritize longer matches somehow, though. ...I'm not sure how best to do that using the % approach.

dmfay · 2020-10-15T13:11:29Z

No worries, there's a lot happening right now 😬

I'll look at fixing up the highlighting soon but here's how you'd prioritize longer matches in general, for results having the same rank:

diff --git a/src/history/history.rs b/src/history/history.rs
index 6c2eca1..cea1d2f 100644
--- a/src/history/history.rs
+++ b/src/history/history.rs
@@ -247,7 +247,8 @@ impl History {
                                   selected_occurrences_factor, occurrences_factor
                            FROM contextual_commands
                            WHERE cmd LIKE (:like)
-                           ORDER BY rank DESC LIMIT :limit";
+                           ORDER BY rank DESC, length(cmd) DESC
+                           LIMIT :limit";
         let mut statement = self
             .connection
             .prepare(query)

However, I don't think it's a good idea. Especially with fuzzy searching, shorter matches get filtered out naturally as you continue to type, so prioritizing longer matches makes the shorter all but unreachable by fuzzy searching alone.

$ cruns

cargo run -- search --fuzzy
cargo run -- search

If anything, I'd want the opposite, so cargo run -- search would be the top result until you entered an f or a z.

cantino · 2020-10-15T15:49:18Z

Agreed, that shorter commands (or longer contiguous substring matches?) would be better. Since the rank from Mcfly is a float, though, I think the length would almost never be used in your implementation.

…

On Thu, Oct 15, 2020 at 6:11 AM Dian Fay ***@***.***> wrote: No worries, there's a lot happening right now 😬 I'll look at fixing up the highlighting soon but here's how you'd prioritize longer matches in general, for results having the same rank: diff --git a/src/history/history.rs b/src/history/history.rs index 6c2eca1..cea1d2f 100644 --- a/src/history/history.rs +++ b/src/history/history.rs @@ -247,7 +247,8 @@ impl History { selected_occurrences_factor, occurrences_factor FROM contextual_commands WHERE cmd LIKE (:like) - ORDER BY rank DESC LIMIT :limit"; + ORDER BY rank DESC, length(cmd) DESC + LIMIT :limit"; let mut statement = self .connection .prepare(query) However, I don't think it's a good idea. Especially with fuzzy searching, shorter matches get filtered out naturally as you continue to type, so prioritizing longer matches makes the shorter all but unreachable by fuzzy searching alone. $ cruns cargo run -- search --fuzzy cargo run -- search If anything, I'd want the opposite, so cargo run -- search would be the top result until you entered an f or a z. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#103 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUO623Y5MWFJIGV64YPC3SK3YJHANCNFSM4R5DKHVA> .

dmfay · 2020-10-15T16:05:52Z

Serves me right for not checking; you could order by rank rounded to int, then length (ascending), then the full rank, but that's getting a bit kludgy.

Weighting contiguous characters higher would be a good next step after basic fuzzy searching. That's beyond the capabilities of SQLite so it'd have to happen after results are returned. Results near the bottom could have worse quality than those that didn't get in under the LIMIT too, but that's not much of a problem in practice.

cantino · 2020-10-15T16:55:29Z

I'd be fine merging this as-is if we could fix highlighting, and then we can always work on improving the sorting.

dmfay · 2020-10-17T14:40:55Z

This implementation could be smartened up a little bit -- right now clr matches a[cceler]ate instead of ac[celer]ate -- but it works, it's correct, and if moved a bit earlier end - start is exactly what we need to weight closer/contiguous matches.

dmfay · 2020-11-01T19:30:25Z

Fuzzy matches are now weighted by length, so a lower-ranked but shorter match has a chance to come in above a higher-ranked but longer match. For example, with the search text sshmyk and these two matches:

[ssh -i ~/.ssh/my_k]ey user@host
[Some gigantically long String wHich has already Matched 'ssh' but will not complete the match until a Y and a K] are found

The difference in the two match lengths (18 vs 111 characters) is factored into the weight calculation. Here the real ssh command gets a rank bump of +0.86 while the longer string's rank is increased only by +0.14, so the ssh command can overcome an unweighted rank disparity of up to 0.72.

If the longer match were instead, say, 30 characters, the ssh command would be ranked at +0.63 while the longer string would add +0.38. The ssh command's unweighted rank would have to be within 0.25 of the longer string to jump ahead.

cantino · 2020-11-15T20:01:41Z

I think this is a very sensible approach. Ideally, it'd be nice to empirically derive some sort of weighting for fuzzy matches that takes into account how long the character runs are, how many different sections it's matching, etc., but to do this, I think we'd need a bunch of training data that we don't have. For the time being, this seems fine if it's working well for you in practice.

cantino · 2020-11-15T20:05:55Z

I'll install this locally and start using it in fuzzy mode as well.

The code looks good. Would you mind running cargo clippy and cargo fmt on it, though, I think there are a couple style differences.

dmfay · 2020-11-15T20:47:19Z

Updated with formatting. I've been using it locally and am pretty happy with it; there's definitely room for further improvements but it's already quite useful.

cantino · 2020-12-06T19:33:02Z

Been working for me! Thanks for the contribution :)

cantino · 2020-12-06T21:26:41Z

Released in v0.5.1 :)

dmfay force-pushed the sqlite-fuzzy branch from 908011c to 0f17937 Compare September 29, 2020 00:10

dmfay added 2 commits September 28, 2020 20:11

--fuzzy search, sqlite version

9f2c90c

MCFLY_FUZZY env variable

0f17937

fuzzy highlights

96f5aa8

weight fuzzy results by ascending length in addition to rank

31e5ffd

formatting

065bac4

dmfay mentioned this pull request Nov 18, 2020

--fuzzy search entire history with skim #102

Closed

cantino merged commit 24a2443 into cantino:master Dec 6, 2020

dmfay mentioned this pull request Oct 27, 2021

Option to priorize exact matches over fuzzy ones #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy searching, all-in-SQLite version #103

Fuzzy searching, all-in-SQLite version #103

dmfay commented Sep 28, 2020

cantino commented Sep 30, 2020

cantino commented Oct 15, 2020

dmfay commented Oct 15, 2020

cantino commented Oct 15, 2020 via email

dmfay commented Oct 15, 2020 •

edited

Loading

cantino commented Oct 15, 2020

dmfay commented Oct 17, 2020

dmfay commented Nov 1, 2020 •

edited

Loading

cantino commented Nov 15, 2020

cantino commented Nov 15, 2020

dmfay commented Nov 15, 2020

cantino commented Dec 6, 2020

cantino commented Dec 6, 2020

Fuzzy searching, all-in-SQLite version #103

Fuzzy searching, all-in-SQLite version #103

Conversation

dmfay commented Sep 28, 2020

cantino commented Sep 30, 2020

cantino commented Oct 15, 2020

dmfay commented Oct 15, 2020

cantino commented Oct 15, 2020 via email

dmfay commented Oct 15, 2020 • edited Loading

cantino commented Oct 15, 2020

dmfay commented Oct 17, 2020

dmfay commented Nov 1, 2020 • edited Loading

cantino commented Nov 15, 2020

cantino commented Nov 15, 2020

dmfay commented Nov 15, 2020

cantino commented Dec 6, 2020

cantino commented Dec 6, 2020

dmfay commented Oct 15, 2020 •

edited

Loading

dmfay commented Nov 1, 2020 •

edited

Loading