Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_content_tags doesn't seem to work on tags other than <script> or <style> #24

Open
mkf62 opened this issue Sep 6, 2023 · 1 comment

Comments

@mkf62
Copy link

mkf62 commented Sep 6, 2023

Either I am misunderstanding what clean_content_tags does or it is not working correctly. I cannot get the clean_content_tags attribute to work on anything other than the two tags <script> and <style>. Using nh3 version 0.2.14, python 3.11.0.

import nh3

testItem = "<script>alert('hello')</script><p>hello</p>"
print(nh3.clean(html=testItem, clean_content_tags={'p'}))

I receive this error:

thread '<unnamed>' panicked at 'assertion failed: !self.tags.contains(tag_name)', /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ammonia-3.3.0/src/lib.rs:1792:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/home/Desktop/import nh3.py", line 4, in <module>
    print(nh3.clean(html=item, tags=None, clean_content_tags={'p'}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: assertion failed: !self.tags.contains(tag_name)

I have been able to reproduce this error with b, br, div, and img tags. I haven't tried any others. script and style tags work as expected.

@mkf62
Copy link
Author

mkf62 commented Sep 6, 2023

So I figured out the problem is that you MUST have a tag in tags in order to get clean_content_tags to work if you're trying to clear anything other than <style> or <script> tags , meaning this works:

item = "<div><b>hi</b><script><style></div>"
print("Output: ", nh3.clean(html=item, tags={'div'}, clean_content_tags={'b', 'script', 'style'}))
#Output: <div></div>

And this works:

item = "<div><b>hi</b><script><style></div>"
print("Output: ", nh3.clean(html=item, clean_content_tags={'script', 'style'}))
#Output: <div><b>hi</b></div>

But this doesn't work because the <div> tag isn't specified in the tags attribute and I'm trying to clear tags that aren't EXCLUSIVELY script or style:

item = "<div><b>hi</b><script><style></div>"
print("Output: ", nh3.clean(html=item, clean_content_tags={'b', 'script'}))
#traceback appears with assertion error previously stated in OP

This is very confusing and I doubt it's intended to work this way. Why would clean_content_tags only work on it's own with script and style but not the others? Why does at least one tag need to be whitelisted in order to get other tags to work in clean_content_tags that aren't style and script?

If you're sanitizing user input though and you don't want to allow any HTML tags at all, I'm still not sure how I would remove them because I don't know what they will be ahead of time. It would be helpful to have an option that strips any and all HTML tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant