Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to force all numbers to be integers (at least for csvjson) #1251

Open
rickhg12hs opened this issue Jun 16, 2024 · 8 comments
Open

Option to force all numbers to be integers (at least for csvjson) #1251

rickhg12hs opened this issue Jun 16, 2024 · 8 comments

Comments

@rickhg12hs
Copy link

rickhg12hs commented Jun 16, 2024

$ csvjson --version
csvjson 2.0.0

This is a strange CSV file, but it does seem to have a header line and three columns. csvjson seems to be having trouble with the data columns.

$ csvjson credits.csv 
ValueError: Row 0 has 45 values, but Table only has 3 columns.

credits.csv:

cast,crew,id
"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}]", "[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}]", 862

I tried various csvjson commandline options, but couldn't find the right recipe.

@sufferfish
Copy link

Can you post a snippet of the CSV file? It would help us troubleshoot.

@rickhg12hs
Copy link
Author

Can you post a snippet of the CSV file? It would help us troubleshoot.

Please see code block for credits.csv above.

@jpmckinney
Copy link
Member

jpmckinney commented Jun 18, 2024

You have initial spaces after the delimiters (commas), so you need to use -S (--skipinitialspace). Your data also is unusual (what looks like Python dictionaries containing commas), so Python's CSV sniffer is getting confused. Disable it with -y0 (--snifflimit 0). You then get csvjson -S -y0, for example:

$ printf "cast,crew,id\n\"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}]\", \"[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}]\", 862" | csvjson -S -y0
[{"cast": "[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}]", "crew": "[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}]", "id": 862.0}]

@rickhg12hs
Copy link
Author

@jpmckinney Super! Is there also a way to maintain integers as integers? 862 gets converted to 862.0.

@jpmckinney
Copy link
Member

Sure, use -I (--no-inference)

@rickhg12hs
Copy link
Author

Sure, use -I (--no-inference)

But then 862 becomes "862". :-(

@jpmckinney
Copy link
Member

jpmckinney commented Jun 24, 2024

Ah, yes. Why do you need it to be an int instead of a float with .0?

Edit: Agate (on which csvkit relies) stores numbers as decimals. When serializing to JSON, it converts the decimal to a float in the jsonify method, which Python represents with .0. The jsonify method takes no arguments. It's maybe possible to add an argument the the Number instantiation to force the use of int instead of float, and then expose that argument via a csvkit command-line argument.

@rickhg12hs
Copy link
Author

rickhg12hs commented Jul 13, 2024

Why do you need it to be an int instead of a float with .0?

For my specific use case, converting (and possibly editing, with e.g., jq) CSV files for inclusion in MongoDB collections. Maintaining ints as ints saves space and unifies data types throughout the collection.

@jpmckinney jpmckinney reopened this Jul 13, 2024
@jpmckinney jpmckinney changed the title csvjson: Trouble with atypical CSV file Option to force all numbers to be integers Jul 13, 2024
@jpmckinney jpmckinney changed the title Option to force all numbers to be integers Option to force all numbers to be integers (at least for csvjson) Jul 13, 2024
@jpmckinney jpmckinney added this to the Priority milestone Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants