Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to ignore encoding errors #763

Open
brandonpoc opened this issue Jan 21, 2017 · 6 comments
Open

Option to ignore encoding errors #763

brandonpoc opened this issue Jan 21, 2017 · 6 comments

Comments

@brandonpoc
Copy link

I use the tools from 'csvkit' on very, very large files, piped in from stdout - often times in the tens of gigabytes and hundreds of millions of lines - and these processes can take a very long while, sometimes in excess of 12 hours, to run. A major bane in my side is when I check on its progress and find that it bailed out after processing for just a few minutes because of some Python exception that wasn't caught. The older version of csvkit that I had was full of these problems, and the latest version remedied many, but I encountered an error involving a bad UTF character causing the program to bail out. The error was as follows from csvgrep:

Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag. Use the -v flag to see the complete error.

I found the line, and ran it against iconv / uni2ascii / recode etc. and its unanimous - there was some bad byte pattern present in the input file for whatever reason. Using -e to specify different types (e.g. ascii, utf-8, etc) did not work. Ultimately, because it would take too long to use iconv and recode and such on it, and uni2ascii was bailing out, I just piped the file through the "strings" utility before passing into csvgrep as ASCII.

So, in order to prevent these types of errors from causing the program to unequivocally exit (crash, in my opinion!), it would be nice to have an option common to all csvkit tools that forces all errors to be ignored and perhaps just output to stderr or written to a log file along with the content of the accompanying record(s), line number(s), and reason for exception. The line could then be left out of the output, and if needed the particular line could be fixed manually before re-running csvkit tools.

This would make it much, much more friendly for running against large file sets. Again, when it takes for example 12 hours to pipe just one of my data sets against csvgrep, it absolutely crushes me to see an error that stopped it cold in its tracks just 45 minutes in and having to use grep to find the line that it crashed on from the original file, do the subtraction to get the remaining lines to convert, tail that remaining count from the source file to another file, try to figure out the problem line, and then re-run csvkit to AGAIN find that a SINGLE BAD BYTE crashed the dang thing.

I hope you understand my frustration, and why an option to forcefully and explicitly continue in the face of errors, ignore the record(s) in error, and just output them to stderr and/or a log would be helpful.

Thank you!

@jpmckinney
Copy link
Member

Hmm, not quite sure how we'd implement. Currently, encoding errors are caught in _install_exception_handler. We'd instead have to catch them in closer proximity to where they are raised, and then either continue processing or re-raise the error.

@VladimirAlexiev
Copy link

I second this issue. Eg metmuseum/openaccess#11 is a 224Mb MetObjects.csv by the Met Museum, and to extract some statistics I'd like to skip rows with badly encoded chars. The exception there is

'charmap' codec can't encode character '\xa9' in position 73: character maps to <undefined>

@jpmckinney jpmckinney changed the title Forcefully ignore errors to prevent programs bailing out and exiting. Option to ignore errors May 21, 2018
@allentc
Copy link

allentc commented Jun 2, 2018

The world is rife with huge CSV files that contain the occasional "bad" UTF-8 character. The Unicode REPLACEMENT CHARACTER (U+FFFD) exists to handle these situations. While I don't care to address the virtue of ignoring all conceivable errors, it seems reasonable to expect at least csvclean and possibly other csvkit tools (optionally?) handle invalid UTF-8 characters with replacement.

In the mean time, one can work around at least UTF-8 exceptions by eliding the invalid bytes with iconv (where it exists):

$ cat data.csv | iconv -f utf-8 -t utf-8 -c | csvclean

@halloleo
Copy link

halloleo commented Jun 2, 2018

I concur. The codec error caught me a few times and then it is quite a hassle to create a cleaned version of the CSV file before using csvkit. Most of the time I don't care about the special characters and would be happy with a result stripped of the undecodable characters. So I vote strongly for this option!

@jpmckinney jpmckinney added this to the 1.0.4 milestone Jun 2, 2018
@majascules
Copy link

Agreed. With all above. Has cost me mountains of time trying to sanitize. Would be a big help. If I had an ounce of practical skill, I'd make a pull request myself.

@dagoldman
Copy link

I know this is an old thread, but here goes anyway,

I also run into this problem all the time, "special" characters, with a variety of data files. For example hex 96 bd be e0 e9 91 and many others. It seems the OP was seeing even worse problems. But I think the same principle that I outline here may apply.

I want the file to be JUST ASCII. I use a preprocessing step, in my case using a short section of C code, reading a UNIX input data file.

For each character read, if \n newline, I increment the line count and continue. If a tab character and not a TSV file, or if isprint function return value is zero, I replace the character with '!' (my arbitrary choice) and report what happened to a log file (file name, line number, column, character replaced). Otherwise, I just print the input character and continue to next character. This way, I am guaranteed to import the file successfully, and I get a log file showing all the "special characters" that got replaced.

I run as a preprocessing step, but you could also convert the input file to the cleaned version, and then keep using the cleaned version. Anyway, the above works well for me. My experience is that it does not take all that long to run. The bottleneck is almost totally the input and output.

Luckily for me, the "special" characters are ALWAYS in a field that I do not use. For the files I see, there are large text fields with various descriptions that contain the "special" characters. I never use those fields, so I don't care about wiping out the "special" characters. I just want the file to import, and the process not to crash, as others have said.

Daniel

@jpmckinney jpmckinney removed this from the Next version milestone Oct 17, 2023
@jpmckinney jpmckinney changed the title Option to ignore errors Option to ignore encoding errors Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants