Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when working with big dataset #152

Open
dtgnn opened this issue Dec 7, 2021 · 5 comments
Open

Error when working with big dataset #152

dtgnn opened this issue Dec 7, 2021 · 5 comments

Comments

@dtgnn
Copy link

dtgnn commented Dec 7, 2021

Hi and thank you for your work on the coder package. I ran into issues while applying the categorize function to a fairly large dataframe (~4GB). The function returns the following error message:

Error in copybig(x, .copy) : 
  Object is > 1 GB. Set argument 'copy' to TRUE' or FALSE to declare wether it should be copied or changed by reference!

But there seems to be no way (judging from the documentation) to actually set the copy argument. I've tried including either copy = TRUE or .copy = TRUE to my calls to categorize(), in both cases without effects. Is there another way to address the issue?

@eribul
Copy link
Contributor

eribul commented Dec 7, 2021

Dear @dtgnn!

I am very happy to hear that you are using te package! And thank you very much!
I realize the documentation here should be improved!

It is a little complex since the .copy argument is used by the coder::copybig() function but passed from coder::categorize() via coder::codify() before it gets there. Hence, it is not documented in ?categorize but in ?copybig and in ?codify. Anyway, arguments passed from coder::categorize() to coder::codify() must be wrapped in a list such as: categorize(..., codify_args = list(.copy = <TRUE/FALSE>)). This is because categorize() can pass arguments both between its methods, as well as both to codify() and set_classcodes() (if x is of class data.table).

Please let me know, if this work or not!?

@dtgnn
Copy link
Author

dtgnn commented Dec 7, 2021

Thank you for your message, @eribul. With your input I now see that the .copy argument is indeed listed in the codify() help page... my bad! I'll try to amend my code and report back with the results.

@dtgnn
Copy link
Author

dtgnn commented Dec 8, 2021

Hello @eribul,

Just a quick update to say that I tried passing both options to codify(), but neither seemed to handle my large dataframe well.

Using categorize(..., codify_args = list(.copy = FALSE)) produced the following error:

Error: cannot allocate vector of size 1.3 Gb
Error during wrapup: cannot allocate vector of size 1.4 Gb
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Using categorize(..., codify_args = list(.copy = TRUE)) brought the R session to eat up all my available memory (>100GB); I interrupted the process to avoid the session to crash.

I have resorted to slicing my dataset and iterating over the samples. It seems to do the job.

Thank you again for your help!

@eribul
Copy link
Contributor

eribul commented Dec 9, 2021

I am sorry to here that!

Is it possible, however, that you might be running a 32 bit version of R? If so, I might suspect that the 1.4 Gb limit might be caused by that, and not by your actual RAM. If you are unsure you can type R.version$arch in the console to find out. (It is also stated on the third line of the start up message when you start R). If possible, I would sugest to use a 64 bit version of R.

And just to rule out the obvious; the > 100 GB is your RAM (not your disk memory) right? :-)

@dtgnn
Copy link
Author

dtgnn commented Dec 10, 2021

R version x86_64. 100GB of RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants