Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeMath implementation reference #2

Open
znjameswu opened this issue Jul 28, 2020 · 3 comments
Open

UnicodeMath implementation reference #2

znjameswu opened this issue Jul 28, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@znjameswu
Copy link
Owner

znjameswu commented Jul 28, 2020

Here is a dedicated issue for the implementation of UnicodeMath spec.

After some quick experiment, I find some quite burning discrepencies in the UnicodeMath. The following discrepencies will directly affect parser design and behavior. I will document them within this issue and the docs.

Current Plan

There are two areas of UnicodeMath input support currently under consideration: user input, or a string input. The plan is to support both by treating string input as a stream of user character inputs (with the cursor fixed on end). (This behavior is not guaranteed by the spec itself, but it seems good.)

The UnicodeMath spec itself proposed a reference parsing algorithm in Appendix A. However there are some conflicts in the UnicodeMath spec.

Conflicts and Decision

Spec-wise, this reference algorithm deviates from the spec:

  • Handling of the unary operators (Align with reference)
    • '+^2', 'a^-2' as subsup? (Spec says no. Reference says yes.)

Also, MS Word's implementation deviates from the reference:

  • Handling of the unary operators (remains to be decided)
    • '1/+2', '+1/2' as fraction? (Reference disallows both. MS Word permits the first one.)
  • Differentiation between the unary plus and unary minus (remains to be decided)
    • 'a^-1' vs 'a^+1' as subsup? (Reference only allows the first. MS Word allows both.)
  • Differentiation between / and other operators in handling unary operators (remains to be decided)
    • '1/+2' as fraction, 'a\of+b' as function, and '\sqrt+2' as sqrt? (Reference says no no no. MS Word permits the first.)

Reference algorithm proposed by spec


  • char ← Unicode character
  • space ← ASCII space (U+0020)
  • αASCII ← ASCII A-Z a-z
  • nASCII ← ASCII 0-9
  • αnMath ← Unicode math alphanumeric (U+1D400 – U+1D7FF with some Letterlike symbols U+2102 – U+2134)
  • αnOther ← Unicode alphanumeric not including αnMath nor nASCII
  • αn ← αnMath | αnOther

  • diacritic ← Unicode combining mark
  • opArray ← ‘&’ | VT | ‘■’
  • opClose ← ‘)’ | ‘]’ | ‘}’ | ‘⌍’
  • opCloser ← opClose | “\close”
  • opDecimal ← ‘.’ | ‘,’
  • opHbracket ← Unicode math horizontal bracket
  • opNary ← Unicode integrals, summation, product, and other nary ops
  • opOpen ← ‘(’ | ‘[’ | ‘{’ | ‘⌌’
  • opOpener ← opOpen | “\open”
  • opOver ← ‘/’ | “\atop”
  • opBuildup ← ‘_’ | ‘^’ | ‘√’ | ‘∙’ | ‘√’ | ‘□’ | ‘/’ | ‘|’ | opArray | opOpen | opClose | opNary | opOver | opHbracket | opDecimal
  • other ← char – {αn + nASCII + diacritic + opBuildup + CR}

  • diacriticbase ← αn | nASCII | ‘(’ exp ‘)’
  • diacritics ← diacritic | diacritics diacritic
  • atom ← αn | diacriticbase diacritics
  • atoms ← atom | atoms atom
  • digits ← nASCII | digits nASCII
  • number ← digits | digits opDecimal digits

  • expBracket ← opOpener exp opCloser
    ← ‘||’ exp ‘||’
    ← ‘|’ exp ‘|’
  • word ← αASCII | word αASCII
  • scriptbase ← word | word nASCII | αnMath | number | other | expBracket | opNary
  • soperand ← operand | ‘∞’ | ‘-’ operand | “-∞”
  • expSubsup ← scriptbase ‘’ soperand ‘^’ soperand | scriptbase ‘^’ soperand ‘’ soperand
  • expSubscript ← scriptbase ‘_’ soperand
  • expSuperscript ← scriptbase ‘^’ soperand
  • expScript ← expSubsup | expSubscript | expSuperscript

  • entity ← atoms | expBracket | number
  • factor ← entity | entity ‘!’ | entity “!!” | function | expScript
  • operand ← factor | operand factor
  • box ← ‘□’ operand
  • hbrack ← opHbracket operand
  • sqrt ← ‘√’ operand
  • cubert ← ‘∙’ operand
  • fourthrt ← ‘√’ operand
  • nthrt ← “√(” operand ‘&’ operand ‘)’
  • function ← sqrt | cubert | fourthrt | nthrt | box | hbrack
  • numerator ← operand | fraction
  • fraction ← numerator opOver operand

  • row ← exp | row ‘&’ exp
  • rows ← row | rows ‘@’ row
  • array ← “\array(” rows ‘)’

  • element ← fraction | operand | array
  • exp ← element | exp other element
@znjameswu znjameswu pinned this issue Jul 28, 2020
@znjameswu
Copy link
Owner Author

znjameswu commented Jul 28, 2020

Current intended solution is to specialize '/' and '_', '^', so that they automatically raise the precedence of the unary-able operators behind them. I still need more testing on MS Word

@znjameswu znjameswu unpinned this issue Sep 4, 2020
@znjameswu znjameswu added the enhancement New feature or request label Nov 20, 2020
@znjameswu
Copy link
Owner Author

znjameswu commented Dec 30, 2020

Found a really good project to be a reference https://github.com/doersino/UnicodeMathML

After going through its implementation. I think the following solution might work for this project:

  • Modify petitparser so that it can take List<GreenNode> as input.
  • Non-SymbolNode or SymbolNode marked as escaped will be viewed as primitives. Non-escaped SymbolNode will be recognized as operators by the parser.
  • In conversion mode which takes a String as input, we translate the String into List<SymbolNode>, and run the conversion-mode parser in forward order. This parser variant will be exhaustive and will try to parse everything till the end. (This will highly resemble UnicodeMathML project mentioned above)
  • In interactive mode which takes user input, we only supply a sublist of children of the current active EquationRowNode as the input to the parser. This parser variant will try to parse all input into one term, given operator precedence requirement.
    • We starting by giving sublist in the cursorPos-2:cursorPos range.
    • If parser fails, decrease the lower bound by 1 and retry.
    • If parser succeeds, use parsing results to modify the children list. Decrease the lower bound by 1 and try again.
    • The loop terminates when the lower bound crosses zero.
    • (This is not a backtracking process in regular expression! Everytime the parser succeeds, the input gets modified!)

The rationale is that:

  • It is quite clear that we can't straight up use forward parsing for interactive mode (we should only built one term on the right end). Also, forward parsing sometimes would just give counter-intuitive (and non-MS-Office-compliant) results (e.g. \int_i^j_i^j evaluates to (\int_i^j)_i^j in forward parsing)
  • The conversion mode needs to work in forward order. There are certain cases the interactive-mode parsing would just produce spec-breaking results (e.g. \int_i^j_i^j a evaluates to \int_i^(j_i^j) a in backward parsing)

Why not let the parser parse backward for the interactive mode?

  • Advantage: No backtracking. Start from the cursor position and matches leftward. No loop involved.
  • Disadvantage: It seems MS Office disagrees with backward parsing (See separator ambiguity test results below)

@znjameswu
Copy link
Owner Author

znjameswu commented Dec 30, 2020

Some other ambiguities related to separators (Tested in MS Word):

  • (1|2|3) evaluates to a bracket containing 1, |2|, 3 (Documented in the spec)
  • (1||) evaluates to (, 1, |, |\box )
  • (1|2|) evaluates to a bracket containing 1, |2|
  • (1|2|3|4) evaluates to the form of (A|B) where A=1|2|3 and B=4 (This comes as a surprise. Backward parsing shouldn't produce this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant