Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The spec doesn't seem clear on how to handle "incomplete" hostnames #694

Open
rushmorem opened this issue Jul 19, 2018 · 9 comments
Open

Comments

@rushmorem
Copy link
Contributor

By "incomplete" hostname, I mean a hostname that's entirely part of some rule or rules but does not have enough labels to match the rule or rules entirely. Examples of such hostnames are yokohama.jp and kobe.jp.

The relevant rules for those hostnames are:-

jp
*.kobe.jp
*.yokohama.jp
!city.yokohama.jp

I have seen these two, interpreted differently by at least two implementations and I understand how it can go either way. libpsl returns the public suffices for those domains as yokohama.jp and kobe.jp respectively. Servo's net_traits crate, however, returns jp for both, which leads to weird test cases like these.

What's the official position on this?

@sleevi
Copy link
Contributor

sleevi commented Jul 20, 2018

@rushmorem Thanks for opening this. This is the wildcard problem originally captured at https://bugzilla.mozilla.org/show_bug.cgi?id=1124625#c6 and more broadly documented at https://wiki.mozilla.org/Public_Suffix_List/platform.sh_Problem

@rushmorem
Copy link
Contributor Author

Thanks @sleevi. That clarifies it. I'm rewriting my Rust implementation, so I wanted to know the correct way to handle this. I think adding these to the official test case would help iron out the differences in implementations. What do you think, should I submit a pull request?

@sleevi
Copy link
Contributor

sleevi commented Jul 20, 2018 via email

@rushmorem
Copy link
Contributor Author

According to that Wiki, you linked to:-

If we follow the defined PSL algorithm, the above rules should result in the following determinations:

 get_public_suffix(foo.bar.platform.sh) == "bar.platform.sh"
 get_public_suffix(bar.platform.sh) == "bar.platform.sh"
 get_public_suffix(platform.sh) == "sh"
 get_public_suffix(sh) == "sh"

So I thought this was already decided. In any case, I think the spec should clear this up one way or another.

@peterthomassen
Copy link
Contributor

In the following, when I say "loose interpretation", I mean the one where the rule *.platform.sh implies that platform.sh is a public suffix.

On the other hand, there's the "strict interpretation" which takes the current rules literally, such that the rule *.platform.sh does not make a statement about whether platform.sh is a public suffix. Strict interpretation of the current rules gives that the public suffix of platform.sh is sh.

Let's assume that a client has access to a function to look up the public suffix using strict interpretation. If the client is interested in the loose interpretation, it can first look up the public suffix for platform.sh, and if the result is not the same as the query (i.e. it is sh, not platform.sh), then the client can query *.platform.sh to see if the wildcard exists, and if so, draw it's conclusions and e.g. block cookies on platform.sh etc.

If the lookup function implements loose lookup, then the client's ability to determine whether platforms.sh itself is on the PSL is lost entirely.

The strict interpretation (= literal interpretation of the current algorithm) therefore gives greater flexibility to the client, without the list showing prejudice regarding what the use case will be. I think it's a good thing for the list to not make assumptions about the use case.

Based on the documents linked here, the Chrome implementation appears to follow the loose interpretation. One solution for the problem could be to define the algorithm as strict, with Chrome (implicitly) adhering to the "two-tiered lookup" described above. This is equivalent to the loose interpretation, and the contradiction is removed.

(In the case where kobe.jp should be considered a public suffix by all clients, it could be added to the PSL explicitly. I am aware of Firefox' implementation issues, but I would think that could be coordinated, especially if the alternative would be to pay the price of losing the algorithm's generality.)

@sleevi
Copy link
Contributor

sleevi commented Jun 11, 2019 via email

@peterthomassen
Copy link
Contributor

If the lookup function implements loose lookup, then the client's ability to determine whether platforms.sh itself is on the PSL is lost entirely.

I think that is making assumptions about the service that aren’t specified. The service could implement the loose lookup itself and return appropriate results - not allowing for “holes” in the namespace.

There is no assumption about the service here. In my original post, I wrote:

Let's assume that a client has access to a function to look up the public suffix using strict interpretation.

This is the assumption that there may be PSL client / library / other implementation existing already now that outputs the public suffix according to strict interpretation. This is an assumption not about the PSL service, but an assumption about the existence of existing implementations. Actually, it's a fact, as I know of at least one implementation that works this way, and @rushmorem said something similar in the initial post.

If the meaning of the *.platform.sh rule is relaxed to, by definition, imply that platform.sh is a public suffix, then such existing implementations will break (= their behavior changes), and they would need to be fixed.

On the other hand, the strict algorithm allows emulating the loose interpretation by first getting the public suffix of platform.sh (which turns out to be sh), and then getting the public suffix of *.platform.sh (which turns out to be *.platform.sh itself). Thus, with the current (strict) definition of the algorithm, implementations are free to implement the loose interpretation without requiring any changes in the PSL nor in other existing implementations. (This does not even require adding kobe.jp and friends to the PSL; the implementation can decide by itself!)

(Arguably, this is what Chrome does implicitly already, according to the Mozilla Wiki article -- maybe not with the two-step approach, but nevertheless, the implementation has chosen to interpret the PSL like this, and could continue doing so even if the algorithm's definition was clarified to mean the strict interpretation: In this case, even if Chrome decided to migrate to a new, strictly compliant PSL library, the two-step approach described in the previous paragraph would recover the loose interpretation's behavior, resulting in no change as far as Chrome's use case is concerned.)

The converse is not true: If the algorithm was changed to follow the loose interpretation, so that a wildcard rule's parent is always a public suffix as well (barring an exception rule), then that would reduce flexibility in the sense that implementations could not anymore decide which interpretation they want to implement. All implementations would follow the loose interpretation (permanently breaking pre-existing implementations that relied, say, on the non-publicness of kobe.jp). If an exception from the loose rule is required such as in the platform.sh cookie policy case, adding !platform.sh to the PSL would explicitly exempt platform.sh from the loose interpretation; all implementations would then consider platform.sh non-public.

Now, in turn, it is unclear why that would be desirable, as it removes the choice on implementation level. There may be use cases where the strict interpretation is preferable, and those would be thwarted by imposing the loose one.

This stems from the fact that if rules denote public-suffix policy not only about domain names with the same number of labels (dots) as in the rule, but instead also make statements about domains with a different (lower) number of labels (dots) as in the rule, the level of granularity is reduced. In the strict interpretation, granularity is higher. As a result, one can retrieve all "loose statements" from "strict rules" (you may have to check the *. child rule), while the inverse is not true.

So, defining the algorithm by the loose interpretation has the following cons:

  • Breaks pre-existing implementations
  • Requires adding !platform.sh to the PSL
  • Removes the possibility of choosing the interpretation depending on the application's use case

On the other hand, the strict interpretation does not have these downsides, while allowing for either use case: With the strict interpretation, you actually get (the possibility to have) both.

The strict interpretation (= literal interpretation of the current algorithm) therefore gives greater flexibility to the client, without the list showing prejudice regarding what the use case will be. I think it's a good thing for the list to not make assumptions about the use case.

I don’t agree with this being good. I think this would be very bad. Can you explain more why you think it would be good?

It would be good because of the above reasons. Why would it be bad?

@ko-zu
Copy link
Contributor

ko-zu commented Jun 20, 2024

As I commented in #1986, the conflicting rules between the wiki and the test case/linters should be resolved. I believe the test case and linter are correct, supported by implementations and intended use cases.

The existing implementations that do not follow the test case should not be a reason to leave the conflicting rules in this repository.
If someone needs to use a definition from a specific revision of the wiki, they can choose such an implementation regardless of what the current rule is. Clarifying rule that the PSL is following does not prevent users from using any rules and any implementations.

I believe it would be better to have one self-consistent rule and put a notice about the possible differences between implementations instead.

@publicsuffix publicsuffix deleted a comment from Jamirais94 Jun 30, 2024
@dnsguru
Copy link
Member

dnsguru commented Jun 30, 2024

Couple things at play here. Sometimes both at once, but it is often one or the other.

1] Epochs / Legacy entries
We sometimes have something I call "standardization drift epochs", where specs get made more precise but there are legacy entries that were put in place before the precision was added

2] What we proclaim vs implementation choices (aka "Browsers are gonna do what browsers are gonna do")
Essentially, we attempt to document what happens, but different parties who incorporate or use the file are making their own choices about what they will do.

In some applications, the loose interpretation is adequate. In others the strict is much wiser.

This is a ultimately just a text file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants