[owasp-antisamy] UTF-16 multilingual plane characters being stripped

Arshan Dabirsiaghi arshan.dabirsiaghi at aspectsecurity.com
Tue Aug 16 07:51:01 EDT 2011

This was done to prevent the DOM parser from blowing up when receiving
invalid UTF16 sequences.




From: owasp-antisamy-bounces at lists.owasp.org
[mailto:owasp-antisamy-bounces at lists.owasp.org] On Behalf Of Paul Curren
Sent: Tuesday, August 16, 2011 7:09 AM
To: owasp-antisamy at lists.owasp.org
Subject: [owasp-antisamy] UTF-16 multilingual plane characters being


I've run into a problem in AntisamyDOMScanner#stripNonValidCharacters:


\\u000A\\u000D]]", "");


This method will corrupt 2 character code points in the 'in' string.


So the most obvious fix that comes to my mind is to just not perform
this operation. I'm wondering what are the reasons for the introduction
of this method in the first place? What is my risk by not applying it?


I see no direct equivalent in the SAX scanner but I can't easily test it
due to a number of customisations we rely on in our DOM Scanner




Paul C


-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/owasp-antisamy/attachments/20110816/8ad7907e/attachment.html 

More information about the Owasp-antisamy mailing list