Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy searching, all-in-SQLite version #103

Merged
merged 5 commits into from
Dec 6, 2020
Merged

Conversation

dmfay
Copy link
Contributor

@dmfay dmfay commented Sep 28, 2020

A quarter-second faster than the skim version at ~100k entries; the one thing this needs is highlight detection, which still only looks for a contiguous match.

@cantino
Copy link
Owner

cantino commented Sep 30, 2020

Thanks! I'll play with this shortly :)

@cantino
Copy link
Owner

cantino commented Oct 15, 2020

Sorry for the delay :)

This seems plenty fast enough. It'd be really nice if it could prioritize longer matches somehow, though. ...I'm not sure how best to do that using the % approach.

@dmfay
Copy link
Contributor Author

dmfay commented Oct 15, 2020

No worries, there's a lot happening right now 😬

I'll look at fixing up the highlighting soon but here's how you'd prioritize longer matches in general, for results having the same rank:

diff --git a/src/history/history.rs b/src/history/history.rs
index 6c2eca1..cea1d2f 100644
--- a/src/history/history.rs
+++ b/src/history/history.rs
@@ -247,7 +247,8 @@ impl History {
                                   selected_occurrences_factor, occurrences_factor
                            FROM contextual_commands
                            WHERE cmd LIKE (:like)
-                           ORDER BY rank DESC LIMIT :limit";
+                           ORDER BY rank DESC, length(cmd) DESC
+                           LIMIT :limit";
         let mut statement = self
             .connection
             .prepare(query)

However, I don't think it's a good idea. Especially with fuzzy searching, shorter matches get filtered out naturally as you continue to type, so prioritizing longer matches makes the shorter all but unreachable by fuzzy searching alone.

$ cruns

cargo run -- search --fuzzy
cargo run -- search

If anything, I'd want the opposite, so cargo run -- search would be the top result until you entered an f or a z.

@cantino
Copy link
Owner

cantino commented Oct 15, 2020 via email

@dmfay
Copy link
Contributor Author

dmfay commented Oct 15, 2020

Serves me right for not checking; you could order by rank rounded to int, then length (ascending), then the full rank, but that's getting a bit kludgy.

Weighting contiguous characters higher would be a good next step after basic fuzzy searching. That's beyond the capabilities of SQLite so it'd have to happen after results are returned. Results near the bottom could have worse quality than those that didn't get in under the LIMIT too, but that's not much of a problem in practice.

@cantino
Copy link
Owner

cantino commented Oct 15, 2020

I'd be fine merging this as-is if we could fix highlighting, and then we can always work on improving the sorting.

@dmfay
Copy link
Contributor Author

dmfay commented Oct 17, 2020

This implementation could be smartened up a little bit -- right now clr matches a[cceler]ate instead of ac[celer]ate -- but it works, it's correct, and if moved a bit earlier end - start is exactly what we need to weight closer/contiguous matches.

@dmfay
Copy link
Contributor Author

dmfay commented Nov 1, 2020

Fuzzy matches are now weighted by length, so a lower-ranked but shorter match has a chance to come in above a higher-ranked but longer match. For example, with the search text sshmyk and these two matches:

[ssh -i ~/.ssh/my_k]ey user@host
[Some gigantically long String wHich has already Matched 'ssh' but will not complete the match until a Y and a K] are found

The difference in the two match lengths (18 vs 111 characters) is factored into the weight calculation. Here the real ssh command gets a rank bump of +0.86 while the longer string's rank is increased only by +0.14, so the ssh command can overcome an unweighted rank disparity of up to 0.72.

If the longer match were instead, say, 30 characters, the ssh command would be ranked at +0.63 while the longer string would add +0.38. The ssh command's unweighted rank would have to be within 0.25 of the longer string to jump ahead.

@cantino
Copy link
Owner

cantino commented Nov 15, 2020

I think this is a very sensible approach. Ideally, it'd be nice to empirically derive some sort of weighting for fuzzy matches that takes into account how long the character runs are, how many different sections it's matching, etc., but to do this, I think we'd need a bunch of training data that we don't have. For the time being, this seems fine if it's working well for you in practice.

@cantino
Copy link
Owner

cantino commented Nov 15, 2020

I'll install this locally and start using it in fuzzy mode as well.

The code looks good. Would you mind running cargo clippy and cargo fmt on it, though, I think there are a couple style differences.

@dmfay
Copy link
Contributor Author

dmfay commented Nov 15, 2020

Updated with formatting. I've been using it locally and am pretty happy with it; there's definitely room for further improvements but it's already quite useful.

@cantino cantino merged commit 24a2443 into cantino:master Dec 6, 2020
@cantino
Copy link
Owner

cantino commented Dec 6, 2020

Been working for me! Thanks for the contribution :)

@cantino
Copy link
Owner

cantino commented Dec 6, 2020

Released in v0.5.1 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants