Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade text-numeric inferring #1355

Merged

Conversation

Jyuart
Copy link
Contributor

@Jyuart Jyuart commented May 5, 2022

Fixes #1107

It's an upgrade of the text (TEXT, CHAR, VARCHAR) --> numeric (NUMERIC) types inferring (and more comprehensive casting to numeric type accordingly since inferring is based on casting capabilities). Values like:

  • 3.14
  • 123,456.7
  • 123.456,7
  • 123 456,7
  • 1,23,456.7
  • 123'456.7
    are now successfully recognized as of NUMERIC type

Technical details

It was inspired by this PR #1137 and is implemented in the same manner:

  1. A separate function for getting a numeric array in the format: ['number', 'separator', 'floating-point']. For example: ['331,209.05', ',', '.']
  2. Another function that removes locale-specific parts and extracts the actual number in the format with . as a floating-point and with no separators: 331209.05
  3. Supported formats are based on this comment: Handle more human readable number imports #1107 (comment), but can be easily upgraded if necessary. Excerpt:
example locale example format
'en' -123,456.7
'de' -123.456,7
'fr' -123 456,7
'hi' -1,23,456.7
'de-CH' -123'456.7
  1. One test in the test_table_api.py file was updated

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@codecov-commenter
Copy link

codecov-commenter commented May 16, 2022

Codecov Report

Merging #1355 (928c91c) into master (80b8a20) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1355      +/-   ##
==========================================
+ Coverage   92.94%   93.00%   +0.05%     
==========================================
  Files         123      123              
  Lines        5003     5045      +42     
==========================================
+ Hits         4650     4692      +42     
  Misses        353      353              
Flag Coverage Δ
pytest-backend 93.00% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
db/types/categories.py 100.00% <ø> (ø)
db/types/operations/cast.py 99.04% <100.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80b8a20...928c91c. Read the comment docs.

@Jyuart Jyuart marked this pull request as ready for review May 16, 2022 15:53
@Jyuart Jyuart requested review from a team and kgodey and removed request for a team May 16, 2022 15:53
@Jyuart Jyuart changed the title Upgrade numeric inferring Upgrade text-numeric inferring May 16, 2022
@kgodey kgodey requested review from mathemancer and removed request for kgodey May 23, 2022 16:22
@kgodey kgodey added the pr-status: review A PR awaiting review label May 23, 2022
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my specific comments about the regexes.

db/types/operations/cast.py Outdated Show resolved Hide resolved
Comment on lines 1040 to 1044
period_sep_comma_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3})+(?:(,)[0-9]+)?"
comma_sep_period_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3})+(?:(\.)[0-9]+)?"
space_sep_comma_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:(,)[0-9]+)?"
comma_sep_period_decimal_lakh = r"[0-9]{1,3}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?"
apostophe_sep_period_decima = r"[0-9]{1,3}(?:(\'')[0-9]{3})+(?:(\.)[0-9]+)?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines (or many of them) have some subtle bugs similar to what I described in more detail in the other comment. Please consider just using the regexes from the previous PR (sans the ones related specifically to finding currency symbols).

@dmos62
Copy link
Contributor

dmos62 commented May 25, 2022

@mathemancer @Jyuart conversation about regexes reminded me of this blog post about composing regex expressions using f-strings and re.VERBOSE: https://death.andgravity.com/f-re

@mathemancer
Copy link
Contributor

mathemancer commented May 26, 2022

@mathemancer @Jyuart conversation about regexes reminded me of this blog post about composing regex expressions using f-strings and re.VERBOSE: https://death.andgravity.com/f-re

To some extent, that's the strategy taken by the function for money detection. The goal is to decompose into understandable nuggets, then compose them in a reasonable way.

    # An attempt to separate pieces into logical bits for easier
    # understanding and modification
    non_numeric = r"(?:[^.,0-9]+)"
    no_separator_big = r"[0-9]{4,}(?:([,.])[0-9]+)?"
    no_separator_small = r"[0-9]{1,3}(?:([,.])[0-9]{1,2}|[0-9]{4,})?"
    comma_separator_req_decimal = r"[0-9]{1,3}(,)[0-9]{3}(\.)[0-9]+"
    period_separator_req_decimal = r"[0-9]{1,3}(\.)[0-9]{3}(,)[0-9]+"
    comma_separator_opt_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3}){2,}(?:(\.)[0-9]+)?"
    period_separator_opt_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3}){2,}(?:(,)[0-9]+)?"
    space_separator_opt_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:([,.])[0-9]+)?"
    comma_separator_lakh_system = r"[0-9]{1,2}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?"


    inner_number_tree = "|".join(
        [
            no_separator_big,
            no_separator_small,
            comma_separator_req_decimal,
            period_separator_req_decimal,
            comma_separator_opt_decimal,
            period_separator_opt_decimal,
            space_separator_opt_decimal,
            comma_separator_lakh_system,
        ]
    )
    inner_number_group = f"({inner_number_tree})"
    required_currency_beginning = f"{non_numeric}{inner_number_group}{non_numeric}?"
    required_currency_ending = f"{non_numeric}?{inner_number_group}{non_numeric}"
    money_finding_regex = f"^(?:{required_currency_beginning}|{required_currency_ending})$"

I suppose It would be possible to further decompose things; perhaps that would help with readability.

@Jyuart Jyuart requested a review from mathemancer June 2, 2022 15:55
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is almost there. I did have a question about a change you introduced, but I don't think it should break anything.

@@ -72,6 +71,7 @@
NUMERIC_TYPES = frozenset({
*INTEGER_TYPES,
*DECIMAL_TYPES,
PostgresType.NUMERIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? NUMERIC is a decimal type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to exclude it from being used in the _get_decimal_number_type_body_map function which creates 'default' casting function for other decimal types but to still treat it as a numeric type.

@mathemancer mathemancer merged commit 7e3018a into mathesar-foundation:master Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Handle more human readable number imports
5 participants