Add unique display names as variants #20

kg583 · 2024-05-27T01:59:23Z

Adds all <display> names as <variant> names for the given token so long as

The name uniquely identifies the token
The name isn't redundant

Both of those conditions are just the requirements for any variant name. Not every display name satisfies them (e.g. stats "e" and lowercase "e"), hence why display names are not used by trie.py (and hence tivars_lib_py) for tokenization. Adding the ones that are unique as variants is the most straightforward way to make such names available.

The new sheet passed validation on my end, though it's worth double-checking.

rpitasky · 2024-05-27T02:02:53Z

How did you decide which token to include a conflicting display name under?

kg583 · 2024-05-27T02:03:38Z

I didn't; if there's a conflict then nobody gets it.

rpitasky · 2024-05-27T02:13:12Z

My intention was for the <display> entry to be used in this way by consumers of the sheet attempting to turn text into tokens; the decisions on conflicts would have to be made downstream.

This change makes a lot of sense, though, and now I think it's worth considering whether the display entry should instead be an attribute like ti-ascii.

kg583 · 2024-05-27T02:25:51Z

My intention was for the entry to be used in this way by consumers of the sheet attempting to turn text into tokens; the decisions on conflicts would have to be made downstream.

Hmmm, I didn't think of that angle (suppose I could have dug back through our discussions on the spec). Deciding this downstream for the varslib tokenizer was my intention, but it felt a bit bodgy.

now I think it's worth considering whether the display entry should instead be an attribute like ti-ascii

Good idea. It'd make sense to make an attribute though it's not pressing either way.

rpitasky · 2024-05-27T02:28:27Z

I think we either do both #20 (this PR) and make display an attribute, or neither-- definitely worth reading up on our original rationale in #12 when we first added the display names.

kg583 · 2024-05-31T03:09:07Z

So I proposed the the initial form of the current format here:

<lang code="en">
  <encodings>
    <ti-ascii>Field A</ti-ascii>
    <!-- are there any other viable encodings?
         cause if not this does seem maybe a tad silly -->
  </encodings>
  <names>
    <display>Field B</display>
    <accessible>Field C</accessible>
    <variant>Field D</variant>
    <!-- more variants as need be -->
  </names>
</lang>

Tari then suggested we simplify the structure a bit by not having a separate <encodings> tag:

<lang code="en" ti-ascii="Field A">
  <display>Field B</display>
  <accessible>Field C</accessible>
  <variant>Field D</variant>
  <!-- additional variants if applicable -->
</lang>

Here's his reasoning, which at least partially resonated with everyone since it's now the format:

I think if we're unable to identify any other relevant encodings, the value of Field A could just as easily be placed in an attribute of the lang element; additional encodings could always be added in a new schema version if a reason were found to include them.

It's clear that it is worthwhile to distinguish encodings and names, named so literally in my proposal. Since there is only one encoding we care about, and the number of distinct names could be pretty big, it made sense to collapse everything down to a single containing tag, with ti-ascii as an attribute. If another encoding were to come around, we'd look into adding it as an attribute as well.

Thus, there is due cause to keep display out of the attributes, since it's not an encoding. However, Tari also said in the same comment:

I suppose we could take the attribute-ification even further and convert all of the non-repeatable elements into attributes of lang which doesn't sacrifice any expressiveness and is even easier to parse.

This bit received basically no reply, neither enthusiasm nor objection. Since display is necessarily unique within a lang tag, it could be made an attribute without issue. Although, one then asks if accessible should receive the same treatment, since it's also unique, and then all of a sudden we don't need variant anymore:

<lang code="en" ti-ascii="..." display="..." accessible="...">
  <name>...</name>
  <name>...</name>
  ...
</lang>

It's certainly "easier" to parse, though human readability starts to diminish as the attribute line can get quite long (e.g. Insert Comment Above).

adriweb · 2024-05-31T06:10:18Z

I'm OK with having all the unique things as attributes as it makes it clearer that they are indeed unique (per <lang>.

We could still have something "visually easy" to parse by having them on separate lines if needed?

<lang code="en"
      ti-ascii="..."
      display="..."
      accessible="...">
  <name>...</name>
  <name>...</name>
  ...
</lang>

kg583 · 2024-05-31T21:07:43Z

I do rather like collapsing things down; simplifies parsing and validation for sure.

However, thinking it over I'd vote for making display an attribute but leaving accessible alone. The idea is that ti-ascii and display are both properties of their tokens, which are well-defined (i.e. unique within a lang) but not globally unique. While the accessible and (list of) variant names are also properties, they can instead be thought of as indexes into the sheet, since they are globally unique. You can use them to tokenize and it's never ambiguous.

In Pythonistic terms (if we pretend that versions and langs don't exist for a moment) you can write

print(token.ti_ascii)
print(token.display)

since they are attributes of the token, but you cannot form the dictionaries

{token.ti_ascii: token for token in tokens}
{token.display: token for token in tokens}

without collisions. Indeed, you can only do this with accessible and variant names, which is exactly how TokenTrie more-or-less works:

{name: token for token in tokens for name in token.names}

This is what should be, IMO, the defining difference between attributes and child tags. accessible is a special case, since it is both unique within a lang and the entire sheet, but I think it makes more sense from the POV of tokenization to put it with the variant names.

Add unique display names as variants

822e6e2

kg583 added the enhancement New feature or request label May 27, 2024

kg583 added 3 commits July 23, 2024 17:19

Merge branch 'main' into display-names

03b06a3

Make display an attribute

7ce2782

Fix build message

f286a9d

kg583 requested a review from rpitasky July 24, 2024 01:48

kg583 assigned kg583 and rpitasky Jul 24, 2024

rpitasky approved these changes Jul 26, 2024

View reviewed changes

kg583 merged commit 1d02427 into main Jul 26, 2024
2 checks passed

kg583 deleted the display-names branch July 26, 2024 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unique display names as variants #20

Add unique display names as variants #20

kg583 commented May 27, 2024

rpitasky commented May 27, 2024

kg583 commented May 27, 2024

rpitasky commented May 27, 2024

kg583 commented May 27, 2024

rpitasky commented May 27, 2024 •

edited

Loading

kg583 commented May 31, 2024

adriweb commented May 31, 2024

kg583 commented May 31, 2024

Add unique display names as variants #20

Add unique display names as variants #20

Conversation

kg583 commented May 27, 2024

rpitasky commented May 27, 2024

kg583 commented May 27, 2024

rpitasky commented May 27, 2024

kg583 commented May 27, 2024

rpitasky commented May 27, 2024 • edited Loading

kg583 commented May 31, 2024

adriweb commented May 31, 2024

kg583 commented May 31, 2024

rpitasky commented May 27, 2024 •

edited

Loading