Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom URI type #692

Merged
merged 24 commits into from
Oct 12, 2021
Merged

Add custom URI type #692

merged 24 commits into from
Oct 12, 2021

Conversation

mathemancer
Copy link
Contributor

@mathemancer mathemancer commented Oct 4, 2021

Fixes #412

This adds a custom URI type to Mathesar. It also adds functions at the DB layer for getting the parts of a URI.

Technical details

The DB layer functions are based on a regular expression given in the RFC that defines a URI: https://datatracker.ietf.org/doc/html/rfc3986 . Note that similarly to the situation for the email type, the functions to get the parts of a URI are not yet hooked up to the API. The parts that can be gotten are:

  • scheme
  • authority
  • path
  • query
  • fragment
    These are all of the pieces defined in the RFC. The function to get part X is called mathesar_types.uri_<X>, takes a TEXT (or VARCHAR or CHAR) and returns the same.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@mathemancer mathemancer requested review from a team, kgodey and pavish October 4, 2021 17:33
Copy link
Contributor

@kgodey kgodey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathemancer This looks good to me generally but it isn't recognizing URLs without schemes (e.g. centerofci.org). Could we detect those as well and default to scheme HTTP if none is detected?

domains.csv should serve as a good inference test.

@mathemancer
Copy link
Contributor Author

mathemancer commented Oct 5, 2021

@mathemancer This looks good to me generally but it isn't recognizing URLs without schemes (e.g. centerofci.org). Could we detect those as well and default to scheme HTTP if none is detected?

There is no such thing as a URL without a scheme 😉. The problem is that localhost (or abc, etc.) is a valid URL if you prepend http://. We'd want to use some kind of "obvious" subset of URLs at that point, rather than trying to notice every valid URL, or the net will be far too wide. Maybe noticing an authority segment with a suffix of .<tld>, where <tld> is in some abbreviated list of common TLDs? I'll see what I come up with that doesn't seem too sloppy. I agree that it would be a nice feature.

@kgodey
Copy link
Contributor

kgodey commented Oct 5, 2021

@mathemancer I think it's fine if we don't detect localhost or any TLD-less domain. it might be simpler to detect URIs in the form the form <foo>.<bar>, I'm not sure we need a list of TLDs to check <bar> against. We might have some false positives, but that's okay.

@mathemancer
Copy link
Contributor Author

mathemancer commented Oct 6, 2021

I think it would actually be simpler to check the suffix against known TLDs in the long run (i.e., as we fix bugs and extend; we can more easily extend a list of accepted TLDs than an increasingly-arcane list of rules to determine whether a string is a valid URL as per relevant RFCs). Domains (or authorities in the case of URIs) of the form foo.bar are a pretty limited case, and I'm not convinced it's worth handling with special code. For example, in addition to foo.bar, we'd want to be able to find:

  • www.foo.bar (I note that GH seems to pick up this one from the www prefix)
  • baz.foo.bar
  • www3.foo.bar
  • www3.foo12.bar123

Also, the false positive list would be pretty bad once number-ish things are in the mix. We'd want to exclude:

  • 123.45USD and 123.45-USD, but these are both valid URLs if http:// is prepended to them. (Not live addresses, but they satisfy the relevant RFCs. Note that a purely numeric TLD is disallowed by convention (though I can't find a rule against it in an RFC)).

  • Here's a valid URL with numerals and hyphens in both the prefix and suffix: https://xn--h1adlhdnlo2c.xn--p1ai/ (It's a real link, but I can't vouch for the site it links to, since I don't read that language; click with caution). We'd want to include that (without the scheme), but exclude the money examples above. I think that any currently-used TLD with numerals in it has the first four characters xn--, but I'm not sure. And I reiterate that I don't think maintaining a growing list of rules (or a growing regex that we come up with and then extend) will be as easy as a list of supported TLDs.

@kgodey
Copy link
Contributor

kgodey commented Oct 6, 2021

@mathemancer Okay, I'm convinced. :)

@kgodey kgodey added this to the [08] Initial Data Types milestone Oct 8, 2021
@kgodey kgodey added the pr-status: review A PR awaiting review label Oct 12, 2021
@kgodey kgodey self-assigned this Oct 12, 2021
@kgodey kgodey removed the pr-status: review A PR awaiting review label Oct 12, 2021
Copy link
Contributor

@kgodey kgodey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@kgodey kgodey merged commit 0eaa437 into master Oct 12, 2021
@kgodey kgodey deleted the uri_type branch October 12, 2021 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Create URI custom type and handle it in the backend.
2 participants