[Owasp-antisamy] antisamy problems with  

Joel Worrall jworrall at mzinga.com
Thu Apr 3 18:23:14 EDT 2008


After several additional days of work on this issue, including multiple attempted tweaks to the AntiSamy code itself to resolve the issue, my problem remains the same, despite different output.

I resolved the encoding issues with   that were causing a rendered decoding of á , however, I cannot get AntiSamy to output   as its original value. Instead, the DOMFragmentParser appears to take the liberty of changing the   (OR   for that matter - same result) into a blank space. I have validated that the text input string remains (as I would prefer it) intact until it is parsed and output by the FragmentParser and subsequent OutputFormat.

Note the following input:

<p>Does this work?&nbsp;&nbsp;</p>

And AntiSamy's output (CleanResults.getCleanHTML()):

<p>Does this work?  </p>

This doesn't work for me. I need the &nbsp; to find its way through AntiSamy's scan and cleanup so that I can preserve the "whitespace" in my HTML markup.

Try it yourself by running AntiSamy on my input string above. 

Any additional help you can provide is most appreciated.


From: Arshan Dabirsiaghi [mailto:arshan.dabirsiaghi at aspectsecurity.com] 
Sent: Monday, March 31, 2008 9:29 PM
To: Joel Worrall; owasp-antisamy at lists.owasp.org
Subject: RE: [Owasp-antisamy] antisamy problems with &nbsp;

Sure. Thanks for your input, and I dearly want to address it. Let me use this opportunity to first summarize, for myself mostly, the character set issues we've run into in 1.0:
1. The policy files don't consider non-latin Unicode characters (should be fixed in 1.1.1 or 1.2)
2. Special characters like MS Word's extra long dash were turned into not relevant characters (fixed in 1.1)
3. ASCII entity replacement weirdness, like &nbsp; becoming á
So, the onus was on me to fix #3 for 1.1. I probably should have mentioned on this list that this may be a translation issue from my research.
Consider the following link:
It looked like AntiSamy is correctly generating the data in UTF-8 but that whatever mechanism people are using to view the output is iso-8859-1 (or another set that's not UTF-8). That may be wrong - it also may be that the CyberNeko HTML library that I use to parse the data into a DOM defaults its encoding with a value from the JVM (since AntiSamy doesn't specify) to that of a non-UTF-8 character set.
Given that, and these facts:
a) I'm no encoding expert and may not realize the implications of global character translations
b) The entity replacement wasn't buying any security (that we know of)
c) I want to be as character set neutral as possible
... I ripped out the entity replacement. It was probably my mistake to be character set neutral, and what may end up happening is the addition of a set of parameters to AntiSamy that specify the input and output character sets.
So I guess in summary - the entity replacement was ripped out because I didn't feel like it was adding much, and the translation of &nbsp; to Unicode &#160 in any non-UTF readers was causing a problem. In the future, however, I may put it in, but that enforces UTF-8 on our API users, and I'm not sure if that's a good route to take either.
Am I making any sense? Let me know what you think.

From: owasp-antisamy-bounces at lists.owasp.org on behalf of Joel Worrall
Sent: Mon 3/31/2008 6:04 PM
To: owasp-antisamy at lists.owasp.org
Subject: [Owasp-antisamy] antisamy problems with &nbsp;

I found a previous thread where people were mentioning trouble with HTML "nbsp" being turned into non-standard characters like "á". Also read your reply where you committed to address that issue in 1.1.

I see in the code that the AntiSamyDOMScanner has the method "replaceEntityCodes" commented out. I read from a reply you posted on February 27 that you planned to replace the HTML 1.0 process that is performing that work, but that code (the code I assume you mean to resolve the issue) is commented out in 1.1.

Can you comment or provide further assistance? Is there a reason the HTML entities are not processed by antisamy's own process?

Thanks for the help and for writing this,

Owasp-antisamy mailing list
Owasp-antisamy at lists.owasp.org

More information about the Owasp-antisamy mailing list