Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible citation key syntax #6026

Closed
dhimmel opened this issue Jan 3, 2020 · 20 comments
Closed

More flexible citation key syntax #6026

dhimmel opened this issue Jan 3, 2020 · 20 comments

Comments

@dhimmel
Copy link

dhimmel commented Jan 3, 2020

From the Pandoc manual:

Citations go inside square brackets and are separated by semicolons. Each citation must have a key, composed of '@' + the citation identifier from the database, and may optionally have a prefix, a locator, and a suffix. The citation key must begin with a letter, digit, or _, and may contain alphanumerics, _, and internal punctuation characters (:.#$%&-+?<>~/)

The citation key syntax is limited (see as a regex), preventing use of a variety of types of citekeys that various users would like:

One user case that I'm interested in for Manubot is citation-by-persistent-identifier where the citekey is an actual identifier. Oftentimes however, identifiers contain characters forbidden from Pandoc's citation key syntax. For example, we'd like to be able to include citekeys like:

Citekey with parentheses @doi:10.1016/S0022-2836(05)80360-2
Citekey with closing slash @https://www.google.com/
Citekey with equal sign @https://openreview.net/forum?id=HkwoSDPgg

My intended use case would just require more flexible citekeys for markdown input. We would likely use a custom filter to generate new citekeys for the output. However, it also seems from the issues above that some users would like more flexibility for citekeys across the board.

In jgm/pandoc-citeproc#308 (comment), @jgm proposed a syntax like:

@{whatever you want here}
@mb21
Copy link
Collaborator

mb21 commented Jan 3, 2020

Not sure about your usecase, but potentially you could also just use a markdown link or span with custom attributes and use a filter to go from there...

btw. #813 might be related?

@dhimmel
Copy link
Author

dhimmel commented Jan 3, 2020

potentially you could also just use a markdown link or span with custom attributes and use a filter to go from there

The pandoc-url2cite filter by @phiresky has a nice syntax for defining citekey aliases that can contain forbidden characters:

Citekey with closing slash @google

[@google]: https://www.google.com/

I think this is a good workaround, but I still see a benefit in being able to skip the link reference / alias step altogether and do something like:

Citekey with closing slash @{https://www.google.com/}

@jgm
Copy link
Owner

jgm commented Jan 4, 2020

This seems like a reasonable change to me.

@jgm
Copy link
Owner

jgm commented May 17, 2020

Coming back to this in response to the PR, I'm a bit torn.

In many ways I really like the syntax

Citekey with closing slash [@google]

[@google]: https://www.google.com/

It seems to me this would be nicer to write with than including long DOI links inline -- for one thing, it's not obvious what a DOI citation is unless you follow the link, so using mnemonic names would increase source readability, in line with Markdown's goals.

But a drawback is that things like

Citekey with closing slash [@google]

[@google]: https://www.google.com/

already have a well-defined meaning in Markdown (as regular links), which we'd be changing. We'd also be removing the equivalence between this and

Citekey with closing slash [@google](https://www.google.com/)

There's also the question whether to support a bare (author-in-text) @google with this syntax.

Aver1y added a commit to Aver1y/pandoc that referenced this issue May 17, 2020
@phiresky
Copy link

phiresky commented May 17, 2020

I've wanted to open a separate issue about this before, but I guess now it's part of this discussion anyways:

In my opinion, pandoc should change the parsing of both [@foo](bar) and [@foo]\n\n[@foo]: bar to be the same as the parsing of [foo](bar) and [foo]\n[foo]: bar respectively. That is, just regarding parsing it should behave the same as pandoc -f markdown-citations.

This might require an AST change - citations could be a special form of links, or it could be output the same as now just adding some "cite key alias" logic (the "link label" would be the short cite key and the "link target" would be the full cite key (e.g. doi / url)).

We also now have (at least) three competing implementations of exactly the above (converting the AST produced by [@x]\n\n[@x]: https://.... to have it mean citekey alias instead of literal text):

That's because we all wanted some form of using globally unique identifiers like urls, isbns and dois, as cite-keys, in combination with a 100% automated bibliography manager.

All of these three implementations are incompatible (as far as i know), since we handle newlines / paragraphs differently - so it would be great if this was just supported in pandoc core.

Changing the parsing would make writing "polyglot" markdown documents easier that are also parsed by a different parser like Github or commonmark that still use citations.

It would also mean that adding the new syntax proposed above is unnecessary, since that then just becomes [@](https://openreview.net/forum?id=HkwoSDPgg) (or [@x]\n[@x]: https...) with all the same escaping rules like URLs.

I also don't think it would affect existing documents much, since who puts round brackets right after citations or [@x]: y in its own paragraph/line?

@Aver1y
Copy link
Contributor

Aver1y commented May 17, 2020

I'm somewhat confused what exactly the meaning of [@x](y) is supposed to be. How is it different from [@](y)?

Furthermore somewhat related, the parsing of link definitions right now allows them only at the start of a paragraph. I find that rather weird, either they just have to be on their own line or they have to be their own paragraph. I suppose the latter was probably intended but wasn't done, in order to not have to backtrack that far?

@jgm
Copy link
Owner

jgm commented May 18, 2020

It would also mean that adding the new syntax proposed above is unnecessary, since that then just becomes [@](https://openreview.net/forum?id=HkwoSDPgg) (or [@x]\n[@x]: https...) with all the same escaping rules like URLs.

With the proposed brace syntax you could distinguish between a regular citation [@{foo}] and an author-in-text citation @{foo}. I'm not sure how this would work with your proposal. [@](foo) would presumably only correspond to one of these.

@jgm
Copy link
Owner

jgm commented May 18, 2020

Instead of overloading reference link syntax as proposed above, if the point is just to provide short, readable aliases for unreadable citation keys that might be used in a bibliography, we could make pandoc-citeproc's citation lookup sensitive to a table of aliases that could be provided in the metadata:

citation-aliases:
  foo: @{big-long-citation-with-weird-symbols}

@dhimmel
Copy link
Author

dhimmel commented May 18, 2020

Thanks @Aver1y for the implementation in #6373 to support @{citekey}. This will provide sufficient flexibility to include to all the types of citation keys we're interested in.

we could make pandoc-citeproc's citation lookup sensitive to a table of aliases that could be provided in the metadata

For the pandoc-manubot-cite filter, we support metadata.citekey-aliases. Credit to @nichtich for initially suggesting this approach. We also support the reference link syntax.

I think @phiresky likes the reference link syntax because the hyperlinks on citekeys will render in basic markdown engines, like when viewing the .md file on GitHub.

I haven't used other-ids field for references (jgm/pandoc-citeproc#356 / jgm/pandoc-citeproc@8326a10), but this feature is also something to be aware of when thinking about citekey aliases.

@Aver1y
Copy link
Contributor

Aver1y commented May 18, 2020

With url2cite I'm really thinking of these alias definitions as definitions of bibliography entries and I like that I can define them locally to where I use them. Also the parallel to link aliases and footnotes is nice. I think we should also allow mixing definitions of footnotes, link aliases and citekey aliases inside one paragraph like this:

You can mix footnotes[^1], [link] alias and citekey alias [@Hirshfeld2016]
definitions in one paragraph.

[^1]: This is a footnote
[link]: https://example.com
[@Hirshfeld2016]: https://www.ncbi.nlm.nih.gov/pubmed/27672412

@brainchild0
Copy link

Also the parallel to link aliases and footnotes is nice.

Superficially it may seem that the idea completes a symmetry with existing handling of links and footnotes, but cite keys are different because they are keys to data outside the document body. Thus, the effect of the proposition would be to produce a key to a key, where the first key is resolved internally to the body text (which includes the link and footnote definitions), and the second is resolved by the separate citations machinery. Thus the idea by @jgm to consider this first key as an alias that might be given also in the metadata seems more appropriate than the attempt to force citations into the same category as links and footnotes.

@Aver1y
Copy link
Contributor

Aver1y commented May 29, 2020

but cite keys are different because they are keys to data outside the document body

I'm not sure why that is a relevant difference.

@phiresky
Copy link

phiresky commented May 29, 2020

Thus, the effect of the proposition would be to produce a key to a key

This doesn't seem as clear cut to me either, the link syntax is already overloaded into meaning many things: In [a](b), a is a usually short visible name / identifier, and b is a often longer, less readable, sometimes globally unique identifier to specify some external resource, including

  • Relative file paths. Resolved via local file system or as a URL based on the <base> element.
  • Other partially relative URLs: /foo for domain-relative and //foo.com for protocol-relative.
  • Absolute URLs
  • mailto:... email addresses
  • magnet links to torrents (which are not a location specifier, but a content-based description)
  • any other kind of handler the user agent (browser, pdf viewer?) may allow, like javascript:alert("foo") or data:text/plain,hello or steam://open/bigpicture or whatever

All of these are resolved by a "separate machinery" and some but not all of them are "keys to data outside the document body". IMO the content of a URL is that website, not the char-sequence that comprises the link.

As a comparison, footnotes like [^a] are keys to content, kinda similar to [a](data:text/plain,hello).

Citekeys with the above syntax would be similar to links: a is a short locally defined identifier, while b is a often longer, less readable, sometimes globally unique identifier, like:

This long identifier can both be resolved by the user agent, or by a preprocessor that turns them into something more readable or useful.


Mainly in my opinion it makes sense to keep pandoc syntax close to commonmark, especially where there is no real reason to deviate - adding back [@foo]\n[@foo]: xyz syntax is something you intuitively expect to work if you know markdown, and it makes it more compatible with other markdown parsers (where handling those specially could then be implemented as a post-processor).

I wouldn't really care that much about the [@foo](bar) syntax, since while it makes sense for the use case of polyglot documents between commonmark and pandoc-url2cite, I can see why it doesn't have well-defined semantics in general.

@brainchild0
Copy link

brainchild0 commented May 30, 2020

but cite keys are different because they are keys to data outside the document body

I'm not sure why that is a relevant difference.

Yes, well, I was trying to be brief, but now I will offer the details.

In the case of footnotes and links, all of the information for the document is in the document, which includes both the main text and the definitions list for footnotes and links. Placing footnote content or link addresses in a physical position outside the main text is useful because it makes the appearance of the Markdown representation closer to a published target, more fully meeting natural expectations of how a document appears visually.

Although it may be useful to put short-form keys in the text, the use of the footnote and link definitions list for this purpose has the effect adding an intermediary location into the process for resolving the ultimate target of the citation. This design does not advance the original purpose of the list, as a place for items that are part of the text but visually separate from the main text.

Since cite keys ultimately are resolved by citations processors evaluating the metadata, resolving the short-form keys from an alias table also appearing in the metadata preserves the constraint that all the keys are found in the metadata, only adding to the metadata one further table. This approach separates concerns, simplifies design, and clarifies operation. It also appears, as far as I would understand so far, to have minimal or no adverse affect for the user in most cases.

While the distinction is subtle from some perspectives, the details within it are relevant, I would politely argue, for choosing among the design choices that have been offered.

@brainchild0
Copy link

brainchild0 commented May 30, 2020

All of these are resolved by a "separate machinery" and some but not all of them are "keys to data outside the document body". IMO the content of a URL is that website, not the char-sequence that comprises the link.

Yes, formally, but see above. The distinction is that while a human is free to open a website from a hyperlink while reading a document, the document includes only the address itself, not the content of the web site. The link address is a final destination from a standpoint of document processing. The cite key is not.


I think @phiresky likes the reference link syntax because the hyperlinks on citekeys will render in basic markdown engines, like when viewing the .md file on GitHub.

I can see how this effect is useful, but the method described, as I understand, would force a @ prefix into the visible text. Perhaps a cleaner method is to post-process (i.e. filter) links matching some criteria (e.g. pattern match, appearance in a table) into full references. (This behavior appears already to be supported in url2cite.)

@brainchild0
Copy link

brainchild0 commented May 30, 2020

Also at the risk of adding unwanted clutter, I would also give the observation that mapping rules in metadata opens the possibility for much more sophisticated rules, if needed, for example, a pattern rule such that @{OR-HkwoSDPgg} is shorthand for @{https://openreview.net/forum?id=HkwoSDPgg}, without the appearance of the HkwoSDPgg key in any static table. Conventions of this kind might be defined per document or in a general pipeline applied to multiple input documents.

citation-patterns:
  - [ 'OR-(\w+)', 'https://openreview.net/forum?id=$1' ]

I'm not sure how strong the case is for this functionality, but based on the original request, it seems that keeping open possibilities such as this one is compelling.

@nichtich
Copy link
Contributor

nichtich commented May 30, 2020 via email

@jgm
Copy link
Owner

jgm commented May 13, 2021

Now that we have the @{...} syntax, can we close this?

@dhimmel
Copy link
Author

dhimmel commented May 14, 2021

Now that we have the @{...} syntax, can we close this?

Fantastic news! Is there a commit or pull request that added this? Couldn't find anything in the recent history.

Closing since the @{...} syntax provides the required flexibility.

@dhimmel dhimmel closed this as completed May 14, 2021
@jgm jgm reopened this May 14, 2021
@jgm
Copy link
Owner

jgm commented May 14, 2021

Sorry, false alarm! I could have sworn that we'd added this feature, but I guess not; it was only discussed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants