Upgrade text-numeric inferring #1355

Jyuart · 2022-05-05T18:45:58Z

It's an upgrade of the text (TEXT, CHAR, VARCHAR) --> numeric (NUMERIC) types inferring (and more comprehensive casting to numeric type accordingly since inferring is based on casting capabilities). Values like:

3.14
123,456.7
123.456,7
123 456,7
1,23,456.7
123'456.7
are now successfully recognized as of NUMERIC type

Technical details

It was inspired by this PR #1137 and is implemented in the same manner:

A separate function for getting a numeric array in the format: ['number', 'separator', 'floating-point']. For example: ['331,209.05', ',', '.']
Another function that removes locale-specific parts and extracts the actual number in the format with . as a floating-point and with no separators: 331209.05
Supported formats are based on this comment: Handle more human readable number imports #1107 (comment), but can be easily upgraded if necessary. Excerpt:

example locale	example format
'en'	-123,456.7
'de'	-123.456,7
'fr'	-123 456,7
'hi'	-1,23,456.7
'de-CH'	-123'456.7

One test in the test_table_api.py file was updated

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the master branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

codecov-commenter · 2022-05-16T15:49:34Z

Codecov Report

Merging #1355 (928c91c) into master (80b8a20) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1355      +/-   ##
==========================================
+ Coverage   92.94%   93.00%   +0.05%     
==========================================
  Files         123      123              
  Lines        5003     5045      +42     
==========================================
+ Hits         4650     4692      +42     
  Misses        353      353

Flag	Coverage Δ
pytest-backend	`93.00% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
db/types/categories.py	`100.00% <ø> (ø)`
db/types/operations/cast.py	`99.04% <100.00%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80b8a20...928c91c. Read the comment docs.

mathesar/tests/api/test_table_api.py

mathemancer

See my specific comments about the regexes.

db/types/operations/cast.py

mathemancer · 2022-05-25T06:42:29Z

db/types/operations/cast.py

+    period_sep_comma_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3})+(?:(,)[0-9]+)?"
+    comma_sep_period_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3})+(?:(\.)[0-9]+)?"
+    space_sep_comma_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:(,)[0-9]+)?"
+    comma_sep_period_decimal_lakh = r"[0-9]{1,3}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?"
+    apostophe_sep_period_decima = r"[0-9]{1,3}(?:(\'')[0-9]{3})+(?:(\.)[0-9]+)?"


These lines (or many of them) have some subtle bugs similar to what I described in more detail in the other comment. Please consider just using the regexes from the previous PR (sans the ones related specifically to finding currency symbols).

dmos62 · 2022-05-25T10:55:51Z

@mathemancer @Jyuart conversation about regexes reminded me of this blog post about composing regex expressions using f-strings and re.VERBOSE: https://death.andgravity.com/f-re

mathemancer · 2022-05-26T02:56:12Z

@mathemancer @Jyuart conversation about regexes reminded me of this blog post about composing regex expressions using f-strings and re.VERBOSE: https://death.andgravity.com/f-re

To some extent, that's the strategy taken by the function for money detection. The goal is to decompose into understandable nuggets, then compose them in a reasonable way.

    # An attempt to separate pieces into logical bits for easier
    # understanding and modification
    non_numeric = r"(?:[^.,0-9]+)"
    no_separator_big = r"[0-9]{4,}(?:([,.])[0-9]+)?"
    no_separator_small = r"[0-9]{1,3}(?:([,.])[0-9]{1,2}|[0-9]{4,})?"
    comma_separator_req_decimal = r"[0-9]{1,3}(,)[0-9]{3}(\.)[0-9]+"
    period_separator_req_decimal = r"[0-9]{1,3}(\.)[0-9]{3}(,)[0-9]+"
    comma_separator_opt_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3}){2,}(?:(\.)[0-9]+)?"
    period_separator_opt_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3}){2,}(?:(,)[0-9]+)?"
    space_separator_opt_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:([,.])[0-9]+)?"
    comma_separator_lakh_system = r"[0-9]{1,2}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?"


    inner_number_tree = "|".join(
        [
            no_separator_big,
            no_separator_small,
            comma_separator_req_decimal,
            period_separator_req_decimal,
            comma_separator_opt_decimal,
            period_separator_opt_decimal,
            space_separator_opt_decimal,
            comma_separator_lakh_system,
        ]
    )
    inner_number_group = f"({inner_number_tree})"
    required_currency_beginning = f"{non_numeric}{inner_number_group}{non_numeric}?"
    required_currency_ending = f"{non_numeric}?{inner_number_group}{non_numeric}"
    money_finding_regex = f"^(?:{required_currency_beginning}|{required_currency_ending})$"

I suppose It would be possible to further decompose things; perhaps that would help with readability.

…/mathesar into upgrade-numeric-inferring

mathemancer

Okay, I think this is almost there. I did have a question about a change you introduced, but I don't think it should break anything.

mathemancer · 2022-06-03T08:24:08Z

db/types/categories.py

@@ -72,6 +71,7 @@
 NUMERIC_TYPES = frozenset({
    *INTEGER_TYPES,
    *DECIMAL_TYPES,
+    PostgresType.NUMERIC


Why this change? NUMERIC is a decimal type.

My intention was to exclude it from being used in the _get_decimal_number_type_body_map function which creates 'default' casting function for other decimal types but to still treat it as a numeric type.

Jyuart added 8 commits May 5, 2022 21:44

add casting functions

04117bf

update casting functions to match 5 possible locales

8c30743

add more casting options

0eeea8c

add tests for to-numeric casting

bbc372d

add more to-numeric cast tests

5b6ae89

Merge branch 'master' into upgrade-numeric-inferring

d6cc002

fix typo

5191d3a

fix typos in tests

12a610f

Jyuart marked this pull request as ready for review May 16, 2022 15:53

Jyuart requested review from a team and kgodey and removed request for a team May 16, 2022 15:53

remove copy-paste artifacts

5233bf7

Jyuart changed the title ~~Upgrade numeric inferring~~ Upgrade text-numeric inferring May 16, 2022

Merge branch 'master' into upgrade-numeric-inferring

3276804

dmos62 reviewed May 23, 2022

View reviewed changes

mathesar/tests/api/test_table_api.py Outdated Show resolved Hide resolved

kgodey requested review from mathemancer and removed request for kgodey May 23, 2022 16:22

kgodey assigned mathemancer May 23, 2022

kgodey added the pr-status: review A PR awaiting review label May 23, 2022

mathemancer requested changes May 25, 2022

View reviewed changes

Merge branch 'master' into upgrade-numeric-inferring

5874326

Jyuart added 4 commits June 1, 2022 20:03

use mathesar_money regexes for numeric inferring

e460144

Merge branch 'master' into upgrade-numeric-inferring

2ede403

Merge branch 'upgrade-numeric-inferring' of https://github.com/Jyuart…

c0db614

…/mathesar into upgrade-numeric-inferring

fix test

271d053

Jyuart requested a review from mathemancer June 2, 2022 15:55

mathemancer approved these changes Jun 3, 2022

View reviewed changes

Merge branch 'master' into upgrade-numeric-inferring

928c91c

mathemancer enabled auto-merge June 3, 2022 08:29

mathemancer merged commit 7e3018a into mathesar-foundation:master Jun 3, 2022

Aditramesh mentioned this pull request Feb 13, 2023

Handle human readable number casting in general #2465

Closed

7 tasks

Aditramesh mentioned this pull request Apr 10, 2023

Handle human readable number casting in general #2797

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade text-numeric inferring #1355

Upgrade text-numeric inferring #1355

Jyuart commented May 5, 2022 •

edited

Loading

codecov-commenter commented May 16, 2022 •

edited

Loading

mathemancer left a comment

mathemancer May 25, 2022

dmos62 commented May 25, 2022

mathemancer commented May 26, 2022 •

edited

Loading

mathemancer left a comment

mathemancer Jun 3, 2022

Jyuart Jun 3, 2022

Upgrade text-numeric inferring #1355

Upgrade text-numeric inferring #1355

Conversation

Jyuart commented May 5, 2022 • edited Loading

Checklist

Developer Certificate of Origin

codecov-commenter commented May 16, 2022 • edited Loading

Codecov Report

mathemancer left a comment

Choose a reason for hiding this comment

mathemancer May 25, 2022

Choose a reason for hiding this comment

dmos62 commented May 25, 2022

mathemancer commented May 26, 2022 • edited Loading

mathemancer left a comment

Choose a reason for hiding this comment

mathemancer Jun 3, 2022

Choose a reason for hiding this comment

Jyuart Jun 3, 2022

Choose a reason for hiding this comment

Jyuart commented May 5, 2022 •

edited

Loading

codecov-commenter commented May 16, 2022 •

edited

Loading

mathemancer commented May 26, 2022 •

edited

Loading