Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle more human readable number imports #1107

Closed
mathemancer opened this issue Mar 1, 2022 · 16 comments · Fixed by #1355
Closed

Handle more human readable number imports #1107

mathemancer opened this issue Mar 1, 2022 · 16 comments · Fixed by #1355
Assignees
Labels
affects: architecture Improvements or additions to architecture good first issue Everything in "Help wanted", PLUS being relatively easy and straightforward to implement. help wanted Community contributors can implement this type: enhancement New feature or request work: backend Related to Python, Django, and simple SQL
Milestone

Comments

@mathemancer
Copy link
Contributor

mathemancer commented Mar 1, 2022

Problem

Currently, we only detect a column as a number type if the entries can be cast to a NUMERIC using the default PostgreSQL casting functionality. This is a problem in certain locales, and also can't handle numbers with grouping separators (e.g., the , in 1,000.00).

Proposed solution

We should write custom logic to allow handling more strings when casting to number types.

Additional context

@mathemancer mathemancer added type: enhancement New feature or request affects: architecture Improvements or additions to architecture work: backend Related to Python, Django, and simple SQL work: database ready Ready for implementation labels Mar 1, 2022
@mathemancer mathemancer added this to the [v1] Data-Handling Improvements milestone Mar 1, 2022
@mathemancer mathemancer added good first issue Everything in "Help wanted", PLUS being relatively easy and straightforward to implement. help wanted Community contributors can implement this labels Mar 4, 2022
@mathemancer
Copy link
Contributor Author

Marking as "help wanted" and "good first issue" since it should be largely an extension of the work done for the MATHESAR_MONEY type.

@sahilsaini1107
Copy link
Contributor

@mathemancer may u assign this to me. I would like to work on it.

@kgodey kgodey added status: started and removed ready Ready for implementation labels Mar 11, 2022
@kgodey kgodey modified the milestones: [v1] Data-Handling Improvements, [06.1] 2022-03 improvements Mar 11, 2022
@kgodey
Copy link
Contributor

kgodey commented Mar 11, 2022

@sahilsaini110 Sure, I've assigned it to you. I also updated the issue description to link to the relevant PR with the custom casting code for money, you'll want to emulate that code.

@sahilsaini1107
Copy link
Contributor

@sahilsaini110 Sure, I've assigned it to you. I also updated the issue description to link to the relevant PR with the custom casting code for money, you'll want to emulate that code.

Thanks, @kgodey for assigning this issue to me. Also, I will go through the pr you have linked and try to fix this issue.

@kgodey
Copy link
Contributor

kgodey commented Mar 29, 2022

@sahilsaini110 I wanted to follow up on this issue since it's been a few days and I don't see a PR yet. Please let us know if you need any help or if you are no longer working on the issue.

@sahilsaini1107
Copy link
Contributor

@sahilsaini110 I wanted to follow up on this issue since it's been a few days and I don't see a PR yet. Please let us know if you need any help or if you are no longer working on the issue.

Sorry for the late reply I was trying to solve this issue but was unable to understand how to proceed then I got stuck in some other work sorry for the inconvenience I will take care of that from now on. Also if possible I need some guidance for this issue that like how to start and proceed with this.

@kgodey
Copy link
Contributor

kgodey commented Apr 10, 2022

@sahilsaini110 The related PR should provide guidance since it implements the same logic for the money data type. If you have a specific question, please ask.

If you're confused by this issue (or any other issue), it would be better to instead work on a different issue that you can grasp quickly. That would give you a better idea of the codebase before working on a more complex issue like this one.

@sahilsaini1107
Copy link
Contributor

@sahilsaini110 The related PR should provide guidance since it implements the same logic for the money data type. If you have a specific question, please ask.

If you're confused by this issue (or any other issue), it would be better to instead work on a different issue that you can grasp quickly. That would give you a better idea of the codebase before working on a more complex issue like this one.

Yes, I think you are right this is a little bit complex issue since it involves lots of files to be changed so I think I have to first become more familiar with the codebase. Also right now I am busy writing the proposal so unable to give that much time to this.

@kgodey kgodey added ready Ready for implementation and removed status: started labels Apr 16, 2022
@kgodey
Copy link
Contributor

kgodey commented Apr 16, 2022

@sahilsaini110 Thanks for the update, I unassigned you.

@Jyuart
Copy link
Contributor

Jyuart commented Apr 19, 2022

I'd like to work on this one.
I've read the discussions and see that it's not an easy one. And still, I'd start with getting acquainted with the codebase along with investigating parts that are necessary for this specific task.
Since this issue doesn't have priority labels, can I take some time to work on this? I'll ask questions whenever they occur and let you know if it turns out too difficult for me.
If it's ok, can you please assign this one to me?

@kgodey
Copy link
Contributor

kgodey commented Apr 19, 2022

@Jyuart assigned to you, thanks! Feel free to take your time.

@kgodey kgodey added status: started and removed ready Ready for implementation labels Apr 19, 2022
@Jyuart
Copy link
Contributor

Jyuart commented Apr 20, 2022

@mathemancer I'm making my way through the code, getting knowledge (#1137 is very useful for this one, thanks).
I just wanted to be clear about the specific type. This issue is about casting just the PostgreSQL's NUMERIC type, right? You are pretty clear about that in the description:

Currently, we only detect a column as a number type if the entries can be cast to a NUMERIC using the default PostgreSQL casting functionality.

But the title of the issue still confuses me a bit. Maybe you meant all the numeric types (integer, decimal, etc?)

@kgodey
Copy link
Contributor

kgodey commented Apr 20, 2022

@Jyuart "Number" refers to the Mathesar UI type, which is a grouping of all numeric type. Please read this wiki page for more context: https://wiki.mathesar.org/en/engineering/glossary/ui-types

@mathemancer
Copy link
Contributor Author

@kgodey is correct in that all types under Number in the page she linked are handled. However, my original intent with this issue was about detecting whether a column is of number type. For that, the only relevant target is NUMERIC, since that's the type in the DAG defined in db/columns/operations/infer_types.py.

So, to enable human readable number imports, changing the function for casting to NUMERIC would suffice.

@kgodey Do we have a use-case for casting directly from TEXT (or string-like) to other number types? Somehow I thought the flow pointed towards NUMERIC first, then allowed changing to other more specific number-like types. In that case, we should also change all other casting functions with number target types accordingly. I think we should make that a separate issue, however, to avoid blocking getting the work for handling imports merged.

@kgodey
Copy link
Contributor

kgodey commented Apr 21, 2022

@kgodey Do we have a use-case for casting directly from TEXT (or string-like) to other number types? Somehow I thought the flow pointed towards NUMERIC first, then allowed changing to other more specific number-like types.

For inferring the type when importing data, we decided to only do NUMERIC for now since we can handle most numbers with it. Once we have global inference, we might want to be smarter about this.

We should allow users to manually cast from TEXT to any number type, though.

In that case, we should also change all other casting functions with number target types accordingly. I think we should make that a separate issue, however, to avoid blocking getting the work for handling imports merged.

Please make a separate issue if needed. I agree that this issue shouldn't be blocked.

@seancolsen
Copy link
Contributor

In Matrix, @Jyuart asked about which number formats should be considered and how to find all possible numeric representations in different locales. I'll respond here...

Without grouping separators

Without grouping separators, the problem is fairly straightforward because the decimal separators can either be a dot or a comma.

With grouping separators

I did some research and determined that the following five cases should represent all possible locales, when grouping separators are applied:

example locale example format
'en' -123,456.7
'de' -123.456,7
'fr' -123 456,7
'hi' -1,23,456.7
'de-CH' -123'456.7

Notes:

  • There are many ways you could hypothetically format a number which (according to my research) do not occur for any locale. For example, -123 456.7 or -123'456,7.

  • 'en' and 'hi' differ in the logic use to apply grouping separators.

  • You can test out different numbers with different locales by running js like the following in your browser:

    Intl.NumberFormat('hi').format(987654321.09)

@kgodey kgodey modified the milestones: [08.1] 2022-05 improvements, High Priority Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects: architecture Improvements or additions to architecture good first issue Everything in "Help wanted", PLUS being relatively easy and straightforward to implement. help wanted Community contributors can implement this type: enhancement New feature or request work: backend Related to Python, Django, and simple SQL
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants