Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add vignette on nonequi joins #20

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

benmarwick
Copy link

Here is short vignette in response to your call, showing a use that seems in demand, but not easily available elsewhere, cf. tidyverse/dplyr#557 and http://stackoverflow.com/q/41132081/1036500. Let me know what you think!

@codecov-io
Copy link

codecov-io commented Dec 15, 2016

Current coverage is 84.65% (diff: 100%)

Merging #20 into master will increase coverage by 1.16%

@@             master        #20   diff @@
==========================================
  Files             7          9     +2   
  Lines           327        443   +116   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            273        375   +102   
- Misses           54         68    +14   
  Partials          0          0          

Powered by Codecov. Last update 972a7f3...52be550

@dgrtwo
Copy link
Owner

dgrtwo commented Dec 15, 2016

Wow! I really didn't realize it was this simple (I've tried non-equi joins before, but with solutions that weren't as concise).

Before merging, I wonder if there's a way to turn this into a ineq_join family of functions. I suppose it wouldn't end up much more concise than this, but I like that it would make it clearer to the reader what was happening (and not making the end user regularly use match_fun).

@benmarwick
Copy link
Author

Thanks for taking a look, the credit goes to the author of this SO answer.

For a specific function, do you mean something that might work like this?

ineq_join(x, y, 
          join_by = c("x1" >= "y1",   
                      "x1" <= "y2"))

@dgrtwo
Copy link
Owner

dgrtwo commented Jan 16, 2017

I'm looking at this again while planning a CRAN release. I'm starting to think if we do encourage ineq joins, there should be a function provided, and furthermore that it should use data.table as a backend, since it's much faster.

One example, of joining two tables of size 1000 each, indicates data.table can be ~100X faster:

library(nycflights13)
library(fuzzyjoin)
library(data.table)

f <- head(flights, 1000)

library(microbenchmark)

mb <- microbenchmark(fj = fuzzy_left_join(f, f, by = c("hour" = "hour", "minute" = "minute"), match_fun = list(`==`, `>=`)),
                     dt = setDT(f)[setDT(f), on = .(hour == hour, minute >= minute), allow.cartesian = TRUE],
                     times = 5)

mb

Results on my machine:

Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval cld
   fj 2492.45894 2658.27418 2708.33949 2663.66922 2803.16895 2924.12617     5   b
   dt   13.97189   16.14638   28.32267   16.54213   33.72303   61.22995     5  a 

I don't mind taking on data.table as a dependency (probably IMPORTS, though could be SUGGESTS with a check at the start of the function) since there are likely other opportunities to use it to speed up functions.

@benmarwick
Copy link
Author

That's great to know about the speed up from data.table. I'll have a go at making an ineq function and some tests in my fork, but if you best me to it I won't be upset 😄

@dgrtwo
Copy link
Owner

dgrtwo commented Jan 17, 2017

I've got most of this implemented now, so I'll go ahead with finishing it!

@dgrtwo dgrtwo closed this Jan 17, 2017
@dgrtwo dgrtwo reopened this Jan 17, 2017
@benmarwick
Copy link
Author

Sounds good, thanks!

@benmarwick
Copy link
Author

Did this make it into the current CRAN version? I'm not sure that I can find it in the docs. Thanks!

@dgrtwo
Copy link
Owner

dgrtwo commented Jun 19, 2017

Unfortunately not yet- I think I'm going to submit a CRAN version today (there are some long-running bug fixes and new features) and then get back to work on this for the next version. I'd rather have it all complete though, and again I really do appreciate the vignette!

@benmarwick
Copy link
Author

Righto, thanks for the update. I'm looking forward to the next CRAN release!

@CGMossa
Copy link

CGMossa commented Mar 27, 2022

What happened to this update :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants