[owasp-antisamy] UTF-16 multilingual plane characters being stripped

Arshan Dabirsiaghi arshan.dabirsiaghi at aspectsecurity.com
Tue Aug 16 13:19:45 EDT 2011


It's not Java that has the issue, it's the DOM parser. There's some
disconnect there, and I happened upon that fix on some blog. It made
everything work and fixed the test case, so it's survived. It's a cheap
solution. Your targeted Unicode regex is probably a better solution in
the long term, but I'm not aware of the short-term impact in i18ned
environments.

 

OTOH I'm not sure the impact will ever be possible for me to guess
without data from the community, so we may just allow such a change and
wait for community to scream at us.

 

Arshan

 

From: Paul Curren [mailto:pcurren at atlassian.com] 
Sent: Tuesday, August 16, 2011 9:28 AM
To: Arshan Dabirsiaghi
Cc: owasp-antisamy at lists.owasp.org
Subject: Re: [owasp-antisamy] UTF-16 multilingual plane characters being
stripped

 

 

On 16/08/2011, at 12:51 PM, Arshan Dabirsiaghi wrote:





This was done to prevent the DOM parser from blowing up when receiving
invalid UTF16 sequences.

 

Thanks.

I'm not sure what UTF16 would be invalid. If Java was managing to
represent them in a String (which is UTF16) does this not mean that they
are valid?

 

I'm in two minds whether to just skip the step altogether or whether to
replace it with something more targeted such as:

in.replaceAll("[\\p{Cs}||\\p{Cn}||\\p{Co}]","");

 

(not that I'm hitting any failures in those categories yet mind you).

 

Paul C

 

 





 

Arshan

 

From: owasp-antisamy-bounces at lists.owasp.org
[mailto:owasp-antisamy-bounces at lists.owasp.org] On Behalf Of Paul Curren
Sent: Tuesday, August 16, 2011 7:09 AM
To: owasp-antisamy at lists.owasp.org
Subject: [owasp-antisamy] UTF-16 multilingual plane characters being
stripped

 

I've run into a problem in AntisamyDOMScanner#stripNonValidCharacters:

 

return
in.replaceAll("[\\u0000-\\u001F\\uD800-\\uDFFF\\uFFFE-\\uFFFF&&[^\\u0009
\\u000A\\u000D]]", "");

 

This method will corrupt 2 character code points in the 'in' string.

 

So the most obvious fix that comes to my mind is to just not perform
this operation. I'm wondering what are the reasons for the introduction
of this method in the first place? What is my risk by not applying it?

 

I see no direct equivalent in the SAX scanner but I can't easily test it
due to a number of customisations we rely on in our DOM Scanner
implementation.

 

Cheers,

 

Paul C

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/owasp-antisamy/attachments/20110816/79ba723f/attachment.html 


More information about the Owasp-antisamy mailing list