Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique display names as variants #20

Merged
merged 4 commits into from
Jul 26, 2024
Merged

Add unique display names as variants #20

merged 4 commits into from
Jul 26, 2024

Conversation

kg583
Copy link
Member

@kg583 kg583 commented May 27, 2024

Adds all <display> names as <variant> names for the given token so long as

  • The name uniquely identifies the token
  • The name isn't redundant

Both of those conditions are just the requirements for any variant name. Not every display name satisfies them (e.g. stats "e" and lowercase "e"), hence why display names are not used by trie.py (and hence tivars_lib_py) for tokenization. Adding the ones that are unique as variants is the most straightforward way to make such names available.

The new sheet passed validation on my end, though it's worth double-checking.

@kg583 kg583 added the enhancement New feature or request label May 27, 2024
@rpitasky
Copy link
Member

How did you decide which token to include a conflicting display name under?

@kg583
Copy link
Member Author

kg583 commented May 27, 2024

I didn't; if there's a conflict then nobody gets it.

@rpitasky
Copy link
Member

My intention was for the <display> entry to be used in this way by consumers of the sheet attempting to turn text into tokens; the decisions on conflicts would have to be made downstream.

This change makes a lot of sense, though, and now I think it's worth considering whether the display entry should instead be an attribute like ti-ascii.

@kg583
Copy link
Member Author

kg583 commented May 27, 2024

My intention was for the entry to be used in this way by consumers of the sheet attempting to turn text into tokens; the decisions on conflicts would have to be made downstream.

Hmmm, I didn't think of that angle (suppose I could have dug back through our discussions on the spec). Deciding this downstream for the varslib tokenizer was my intention, but it felt a bit bodgy.

now I think it's worth considering whether the display entry should instead be an attribute like ti-ascii

Good idea. It'd make sense to make an attribute though it's not pressing either way.

@rpitasky
Copy link
Member

rpitasky commented May 27, 2024

I think we either do both #20 (this PR) and make display an attribute, or neither-- definitely worth reading up on our original rationale in #12 when we first added the display names.

@kg583
Copy link
Member Author

kg583 commented May 31, 2024

So I proposed the the initial form of the current format here:

<lang code="en">
  <encodings>
    <ti-ascii>Field A</ti-ascii>
    <!-- are there any other viable encodings?
         cause if not this does seem maybe a tad silly -->
  </encodings>
  <names>
    <display>Field B</display>
    <accessible>Field C</accessible>
    <variant>Field D</variant>
    <!-- more variants as need be -->
  </names>
</lang>

Tari then suggested we simplify the structure a bit by not having a separate <encodings> tag:

<lang code="en" ti-ascii="Field A">
  <display>Field B</display>
  <accessible>Field C</accessible>
  <variant>Field D</variant>
  <!-- additional variants if applicable -->
</lang>

Here's his reasoning, which at least partially resonated with everyone since it's now the format:

I think if we're unable to identify any other relevant encodings, the value of Field A could just as easily be placed in an attribute of the lang element; additional encodings could always be added in a new schema version if a reason were found to include them.

It's clear that it is worthwhile to distinguish encodings and names, named so literally in my proposal. Since there is only one encoding we care about, and the number of distinct names could be pretty big, it made sense to collapse everything down to a single containing tag, with ti-ascii as an attribute. If another encoding were to come around, we'd look into adding it as an attribute as well.

Thus, there is due cause to keep display out of the attributes, since it's not an encoding. However, Tari also said in the same comment:

I suppose we could take the attribute-ification even further and convert all of the non-repeatable elements into attributes of lang which doesn't sacrifice any expressiveness and is even easier to parse.

This bit received basically no reply, neither enthusiasm nor objection. Since display is necessarily unique within a lang tag, it could be made an attribute without issue. Although, one then asks if accessible should receive the same treatment, since it's also unique, and then all of a sudden we don't need variant anymore:

<lang code="en" ti-ascii="..." display="..." accessible="...">
  <name>...</name>
  <name>...</name>
  ...
</lang>

It's certainly "easier" to parse, though human readability starts to diminish as the attribute line can get quite long (e.g. Insert Comment Above).

@adriweb
Copy link
Contributor

adriweb commented May 31, 2024

I'm OK with having all the unique things as attributes as it makes it clearer that they are indeed unique (per <lang>.

We could still have something "visually easy" to parse by having them on separate lines if needed?

<lang code="en"
      ti-ascii="..."
      display="..."
      accessible="...">
  <name>...</name>
  <name>...</name>
  ...
</lang>

@kg583
Copy link
Member Author

kg583 commented May 31, 2024

I do rather like collapsing things down; simplifies parsing and validation for sure.

However, thinking it over I'd vote for making display an attribute but leaving accessible alone. The idea is that ti-ascii and display are both properties of their tokens, which are well-defined (i.e. unique within a lang) but not globally unique. While the accessible and (list of) variant names are also properties, they can instead be thought of as indexes into the sheet, since they are globally unique. You can use them to tokenize and it's never ambiguous.

In Pythonistic terms (if we pretend that versions and langs don't exist for a moment) you can write

print(token.ti_ascii)
print(token.display)

since they are attributes of the token, but you cannot form the dictionaries

{token.ti_ascii: token for token in tokens}
{token.display: token for token in tokens}

without collisions. Indeed, you can only do this with accessible and variant names, which is exactly how TokenTrie more-or-less works:

{name: token for token in tokens for name in token.names}

This is what should be, IMO, the defining difference between attributes and child tags. accessible is a special case, since it is both unique within a lang and the entire sheet, but I think it makes more sense from the POV of tokenization to put it with the variant names.

@kg583 kg583 requested a review from rpitasky July 24, 2024 01:48
@kg583 kg583 merged commit 1d02427 into main Jul 26, 2024
2 checks passed
@kg583 kg583 deleted the display-names branch July 26, 2024 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants