-
-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade text-numeric inferring #1355
Upgrade text-numeric inferring #1355
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1355 +/- ##
==========================================
+ Coverage 92.94% 93.00% +0.05%
==========================================
Files 123 123
Lines 5003 5045 +42
==========================================
+ Hits 4650 4692 +42
Misses 353 353
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my specific comments about the regexes.
db/types/operations/cast.py
Outdated
period_sep_comma_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3})+(?:(,)[0-9]+)?" | ||
comma_sep_period_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3})+(?:(\.)[0-9]+)?" | ||
space_sep_comma_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:(,)[0-9]+)?" | ||
comma_sep_period_decimal_lakh = r"[0-9]{1,3}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?" | ||
apostophe_sep_period_decima = r"[0-9]{1,3}(?:(\'')[0-9]{3})+(?:(\.)[0-9]+)?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines (or many of them) have some subtle bugs similar to what I described in more detail in the other comment. Please consider just using the regexes from the previous PR (sans the ones related specifically to finding currency symbols).
@mathemancer @Jyuart conversation about regexes reminded me of this blog post about composing regex expressions using f-strings and |
To some extent, that's the strategy taken by the function for money detection. The goal is to decompose into understandable nuggets, then compose them in a reasonable way. # An attempt to separate pieces into logical bits for easier
# understanding and modification
non_numeric = r"(?:[^.,0-9]+)"
no_separator_big = r"[0-9]{4,}(?:([,.])[0-9]+)?"
no_separator_small = r"[0-9]{1,3}(?:([,.])[0-9]{1,2}|[0-9]{4,})?"
comma_separator_req_decimal = r"[0-9]{1,3}(,)[0-9]{3}(\.)[0-9]+"
period_separator_req_decimal = r"[0-9]{1,3}(\.)[0-9]{3}(,)[0-9]+"
comma_separator_opt_decimal = r"[0-9]{1,3}(?:(,)[0-9]{3}){2,}(?:(\.)[0-9]+)?"
period_separator_opt_decimal = r"[0-9]{1,3}(?:(\.)[0-9]{3}){2,}(?:(,)[0-9]+)?"
space_separator_opt_decimal = r"[0-9]{1,3}(?:( )[0-9]{3})+(?:([,.])[0-9]+)?"
comma_separator_lakh_system = r"[0-9]{1,2}(?:(,)[0-9]{2})+,[0-9]{3}(?:(\.)[0-9]+)?"
inner_number_tree = "|".join(
[
no_separator_big,
no_separator_small,
comma_separator_req_decimal,
period_separator_req_decimal,
comma_separator_opt_decimal,
period_separator_opt_decimal,
space_separator_opt_decimal,
comma_separator_lakh_system,
]
)
inner_number_group = f"({inner_number_tree})"
required_currency_beginning = f"{non_numeric}{inner_number_group}{non_numeric}?"
required_currency_ending = f"{non_numeric}?{inner_number_group}{non_numeric}"
money_finding_regex = f"^(?:{required_currency_beginning}|{required_currency_ending})$" I suppose It would be possible to further decompose things; perhaps that would help with readability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think this is almost there. I did have a question about a change you introduced, but I don't think it should break anything.
@@ -72,6 +71,7 @@ | |||
NUMERIC_TYPES = frozenset({ | |||
*INTEGER_TYPES, | |||
*DECIMAL_TYPES, | |||
PostgresType.NUMERIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? NUMERIC
is a decimal type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to exclude it from being used in the _get_decimal_number_type_body_map
function which creates 'default' casting function for other decimal types but to still treat it as a numeric type.
Fixes #1107
It's an upgrade of the text (
TEXT
,CHAR
,VARCHAR
) --> numeric (NUMERIC
) types inferring (and more comprehensive casting to numeric type accordingly since inferring is based on casting capabilities). Values like:are now successfully recognized as of
NUMERIC
typeTechnical details
It was inspired by this PR #1137 and is implemented in the same manner:
['number', 'separator', 'floating-point']
. For example:['331,209.05', ',', '.']
.
as a floating-point and with no separators:331209.05
test_table_api.py
file was updatedChecklist
Update index.md
).master
branch of the repositoryvisible errors.
Developer Certificate of Origin
Developer Certificate of Origin