Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a proper tokenization/colorization API to extensions #1967

Closed
dajoh opened this issue Jan 13, 2016 · 15 comments
Closed

Expose a proper tokenization/colorization API to extensions #1967

dajoh opened this issue Jan 13, 2016 · 15 comments
Assignees
Labels
api *duplicate Issue identified as a duplicate of another issue(s) feature-request Request for new features or functionality
Milestone

Comments

@dajoh
Copy link

dajoh commented Jan 13, 2016

Rich context-sensitive syntax colorization is very hard to do (if not impossible) with tmLanguage syntax definitions. The functionality for specifying custom colorizers seems to be there, but not exposed to extensions (ITokenizationSupport).

One way to expose colorization would be to just let the extension provide an ITokenizationSupport implementation, and have that completely override the tmLanguage syntax definition (if any).

Another way is to let multiple tokenizers work in parallel, each classifying different tokens of the program. For example: A tmLanguage based tokenizer is used to classify easy tokens such as keywords, strings, and literals. A custom tokenizer is used to classify tokens such as identifiers (which generally need context information, think type names). A reason for wanting this is is that classifying tokens such as identifiers is generally much slower than keywords, separating the tokenizers allow for instant colorization on easy tokens to classify, but harder tokens like identifiers are classified in the background and will eventually be colorized, when ready. It might make sense to only allow one tokenizer per language, but allow for multiple (potentially async) token classifiers.

@mattacosta
Copy link
Contributor

Rich context-sensitive syntax colorization is very hard to do (if not impossible) with tmLanguage syntax definitions. The functionality for specifying custom colorizers seems to be there, but not exposed to extensions (ITokenizationSupport).

I too support being able to replace the tokenizer implementation. For example, since I was working on first-class features for a language that I wanted to implement, I had to create an AST and its associated lexer/parser (which was based on flex/bison by the way). It's a shame that I can't reuse the lexer for syntax highlighting as well. Returning the token/position and saving the lexer state doesn't seem that difficult.

Another way is to let multiple tokenizers work in parallel...

This part doesn't really make sense to me though.

@isidorn isidorn added feature-request Request for new features or functionality api labels Jan 13, 2016
@Rohansi
Copy link

Rohansi commented Jan 13, 2016

Another way is to let multiple tokenizers work in parallel...

This part doesn't really make sense to me though.

I think they wouldn't be considered as tokenizers from VSCode's point of view. Expanding on @dajoh's example, you could have a tmLanguage tokenizer classifying the simple tokens quickly and have Roslyn parse and return detailed token information from another process. In this case VSCode might only need a way to change a token's style at any time.

@jrieken jrieken added this to the Backlog milestone Jan 13, 2016
@jrieken
Copy link
Member

jrieken commented Jan 13, 2016

@alexandrudima fyi

@alexdima
Copy link
Member

👍 This is a great request.

@jrieken
Copy link
Member

jrieken commented Jan 13, 2016

related to #580

@vilicvane
Copy link

I noticed that the project TypeScript-tmLanguage is not "actively" maintained, can I assume that TypeScript will be one of the first gainer on this set of API?

@tiansivive
Copy link

Just wondering if there's any update on this?

@jrieken
Copy link
Member

jrieken commented Jul 7, 2016

Sorry, nothing to report yet...

@DanTup
Copy link
Contributor

DanTup commented May 21, 2017

I don't suppose it's likely this will happen any time soon?

@happyzhao
Copy link

Hope this feature would have higher priority. It will make vscode the best code editor for me.

@jacehensley-wf
Copy link

Has there been any update on this?

@scrthq
Copy link

scrthq commented Jul 28, 2017

Would love to see this added in to allow syntax highlighting for Powershell at the same level that ISE has!!! This would allow me to never touch ISE again, as I currently have to use it to sanity check certain items that aren't highlighting as expected in Code

@gwk
Copy link

gwk commented Aug 24, 2017

I am very interested in a tokenization / highlighting API. I am developing a lexer generator for multiple output targets and would like to support vscode as a first class output. I put some serious effort into trying to generate regex definitions but because the generated lexer is really a state machine it got very messy.

The key requirement for me would be that the API allow the lexer to maintain state across different lines of source code, e.g. a stack of context. This is necessary to support grammars that are not strictly regular, e.g. nested string interpolation syntaxes (for example, swift lets you nested multiple levels of string interpolation, so the lexer needs a stack to switch between code and string lexing modes; multiline string literals require this state persist across lines).

Another aspect that would be great to see would be documentation on Unicode correctness. For example, I assume that the API would be operating on JavaScript UCS2 strings, and so code points outside of the BMP would be represented as surrogate pairs. Are these counted as 1 or 2 columns by the highlighting engine? This is also important for problem matchers. This stuff gets hard (e.g. deciding how wide a character will render is involved) so I wouldn't expect it to be perfect at first, but it's worth keeping these challenges in mind during the design phase.

@bobbrow
Copy link
Member

bobbrow commented Oct 18, 2017

The Microsoft C++ extension is also very interested in this. At the very least, we would like a way to colorize sections of code to mark them as inactive based on #ifdef/#else/#endif/etc sections. It's something that Visual Studio can do, but unfortunately we can't do this with TextMate grammars since the tokens need to be evaluated by the compiler, not regular expressions.

@jrieken
Copy link
Member

jrieken commented Oct 31, 2017

Actually a dupe of #585

@jrieken jrieken closed this as completed Oct 31, 2017
@jrieken jrieken added the *duplicate Issue identified as a duplicate of another issue(s) label Oct 31, 2017
@vscodebot vscodebot bot locked and limited conversation to collaborators Dec 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api *duplicate Issue identified as a duplicate of another issue(s) feature-request Request for new features or functionality
Projects
None yet
Development

No branches or pull requests