Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtc read textContent does not insert whitespace between elements #31

Open
ComLock opened this issue Oct 26, 2018 · 0 comments
Open

rtc read textContent does not insert whitespace between elements #31

ComLock opened this issue Oct 26, 2018 · 0 comments

Comments

@ComLock
Copy link
Contributor

ComLock commented Oct 26, 2018

The result is multiple words get bunched to getter into long "invalid" words.
This becomes a problem when you index the scraped text and want to use ngram search on it.
https://xp.readthedocs.io/en/stable/developer/search/query-functions/ngram.html

I don't know how similar the cheerio evaluator's textContent works in comparison to the browser variant,
but it might be behaving correctly.
https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent

One might have slightly better results using innerText but that is not supported by surgeon (yet).

Notice that textContent ignores <br/> while innerText does not:
http://perfectionkills.com/the-poor-misunderstood-innerText/

A workaround I could try for this is modify all block elements by adding a single space on the end of them, and then using textContent, but I'm uncertain whether its actually smart or even possible to rely on an elements style.display property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants