[owasp-antisamy] UTF-16 multilingual plane characters being stripped

Paul Curren pcurren at atlassian.com
Tue Aug 16 09:28:13 EDT 2011


On 16/08/2011, at 12:51 PM, Arshan Dabirsiaghi wrote:

> This was done to prevent the DOM parser from blowing up when receiving invalid UTF16 sequences.

Thanks.
I'm not sure what UTF16 would be invalid. If Java was managing to represent them in a String (which is UTF16) does this not mean that they are valid?

I'm in two minds whether to just skip the step altogether or whether to replace it with something more targeted such as:
in.replaceAll("[\\p{Cs}||\\p{Cn}||\\p{Co}]","");

(not that I'm hitting any failures in those categories yet mind you).

Paul C



>  
> Arshan
>  
> From: owasp-antisamy-bounces at lists.owasp.org [mailto:owasp-antisamy-bounces at lists.owasp.org] On Behalf Of Paul Curren
> Sent: Tuesday, August 16, 2011 7:09 AM
> To: owasp-antisamy at lists.owasp.org
> Subject: [owasp-antisamy] UTF-16 multilingual plane characters being stripped
>  
> I've run into a problem in AntisamyDOMScanner#stripNonValidCharacters:
>  
> return in.replaceAll("[\\u0000-\\u001F\\uD800-\\uDFFF\\uFFFE-\\uFFFF&&[^\\u0009\\u000A\\u000D]]", "");
>  
> This method will corrupt 2 character code points in the 'in' string.
>  
> So the most obvious fix that comes to my mind is to just not perform this operation. I'm wondering what are the reasons for the introduction of this method in the first place? What is my risk by not applying it?
>  
> I see no direct equivalent in the SAX scanner but I can't easily test it due to a number of customisations we rely on in our DOM Scanner implementation.
>  
> Cheers,
>  
> Paul C
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/owasp-antisamy/attachments/20110816/976a0ec7/attachment-0001.html 


More information about the Owasp-antisamy mailing list