[Owasp-antisamy] antisamy problems with  

Arshan Dabirsiaghi arshan.dabirsiaghi at aspectsecurity.com
Thu Apr 3 19:02:41 EDT 2008


I actually noticed this Wednesday. In fact, this must have been an issue since my test page on i8jesus.com displays the same behavior. I will check the CyberNeko lists and investigate. Hopefully their license is good enough that we'll be able to remove that behavior and distribute in 1.1.1.

Thanks for your research Joel. We can figure this out!


-----Original Message-----
From: Joel Worrall [mailto:jworrall at mzinga.com]
Sent: Thu 4/3/2008 6:23 PM
To: Arshan Dabirsiaghi; owasp-antisamy at lists.owasp.org
Subject: RE: [Owasp-antisamy]  antisamy problems with  
 
Arshan, 

After several additional days of work on this issue, including multiple attempted tweaks to the AntiSamy code itself to resolve the issue, my problem remains the same, despite different output.

I resolved the encoding issues with   that were causing a rendered decoding of á , however, I cannot get AntiSamy to output   as its original value. Instead, the DOMFragmentParser appears to take the liberty of changing the   (OR   for that matter - same result) into a blank space. I have validated that the text input string remains (as I would prefer it) intact until it is parsed and output by the FragmentParser and subsequent OutputFormat.

Note the following input:

<p>Does this work?&nbsp;&nbsp;</p>

And AntiSamy's output (CleanResults.getCleanHTML()):

<p>Does this work?  </p>

This doesn't work for me. I need the &nbsp; to find its way through AntiSamy's scan and cleanup so that I can preserve the "whitespace" in my HTML markup.

Try it yourself by running AntiSamy on my input string above. 

Any additional help you can provide is most appreciated.

Sincerely,
joel

________________________________________
From: Arshan Dabirsiaghi [mailto:arshan.dabirsiaghi at aspectsecurity.com] 
Sent: Monday, March 31, 2008 9:29 PM
To: Joel Worrall; owasp-antisamy at lists.owasp.org
Subject: RE: [Owasp-antisamy] antisamy problems with &nbsp;

Joel,
 
Sure. Thanks for your input, and I dearly want to address it. Let me use this opportunity to first summarize, for myself mostly, the character set issues we've run into in 1.0:
 
1. The policy files don't consider non-latin Unicode characters (should be fixed in 1.1.1 or 1.2)
2. Special characters like MS Word's extra long dash were turned into not relevant characters (fixed in 1.1)
3. ASCII entity replacement weirdness, like &nbsp; becoming á
 
So, the onus was on me to fix #3 for 1.1. I probably should have mentioned on this list that this may be a translation issue from my research.
 
Consider the following link:
http://weblogs.java.net/blog/kohsuke/archive/2008/01/a_and_nbsp_myst.html
 
It looked like AntiSamy is correctly generating the data in UTF-8 but that whatever mechanism people are using to view the output is iso-8859-1 (or another set that's not UTF-8). That may be wrong - it also may be that the CyberNeko HTML library that I use to parse the data into a DOM defaults its encoding with a value from the JVM (since AntiSamy doesn't specify) to that of a non-UTF-8 character set.
 
Given that, and these facts:
a) I'm no encoding expert and may not realize the implications of global character translations
b) The entity replacement wasn't buying any security (that we know of)
c) I want to be as character set neutral as possible
 
... I ripped out the entity replacement. It was probably my mistake to be character set neutral, and what may end up happening is the addition of a set of parameters to AntiSamy that specify the input and output character sets.
 
So I guess in summary - the entity replacement was ripped out because I didn't feel like it was adding much, and the translation of &nbsp; to Unicode &#160 in any non-UTF readers was causing a problem. In the future, however, I may put it in, but that enforces UTF-8 on our API users, and I'm not sure if that's a good route to take either.
 
Am I making any sense? Let me know what you think.
 
Arshan
 
 

________________________________________
From: owasp-antisamy-bounces at lists.owasp.org on behalf of Joel Worrall
Sent: Mon 3/31/2008 6:04 PM
To: owasp-antisamy at lists.owasp.org
Subject: [Owasp-antisamy] antisamy problems with &nbsp;
Arshan,

I found a previous thread where people were mentioning trouble with HTML "nbsp" being turned into non-standard characters like "á". Also read your reply where you committed to address that issue in 1.1.

I see in the code that the AntiSamyDOMScanner has the method "replaceEntityCodes" commented out. I read from a reply you posted on February 27 that you planned to replace the HTML 1.0 process that is performing that work, but that code (the code I assume you mean to resolve the issue) is commented out in 1.1.

Can you comment or provide further assistance? Is there a reason the HTML entities are not processed by antisamy's own process?

Thanks for the help and for writing this,
joel

_______________________________________________
Owasp-antisamy mailing list
Owasp-antisamy at lists.owasp.org
https://lists.owasp.org/mailman/listinfo/owasp-antisamy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/owasp-antisamy/attachments/20080403/40996d59/attachment.html 


More information about the Owasp-antisamy mailing list