[Esapi-user] HTML and XML encoder do not support unicode codepoints using surrogate pair

Olivier Jaquemet olivier.jaquemet at jalios.com
Thu Apr 23 15:18:23 UTC 2015


Hello all,

The default codec/encoder classes do not properly handle Unicode 
codepoints whose representation requires more than 16 bits.

For example, using the default HTML Encoder the following code :
    String in = "\uD840\uDC0A";// https://codepoints.net/U+2000A
    System.out.println("HTML: " + 
ESAPI.encoder().encodeForHTMLAttribute(in));
    System.out.println("XML : " + ESAPI.encoder().encodeForXML(in));

... outputs the following entities  :
HTML: ��
XML : ��

... whereas the following entity would be expected to correctly 
represents the codepoint in HTML :
HTML: 𠀊
XML : 𠀊

As far as I can see, the problem is located in the Codec implementations
1. method Codec.encode(char[], String ) characters are not properly 
iterated (surrogate pair should be verified or codepoint used)
2. consequently method Codec.encodeCharacter(char[], Character) does not 
handle code points on >16bits
3. in the end, in all codec implementation the method 
encodeCharacter(char[] immune, Character c) *cannot* properly process 
codepoints and this can be observed in both HTMLEntityCodec and 
XMLEntityCodec

That being said :
Is ESAPI stil maintained ?
If so, are you interested in adding such support ?

Olivier Jaquemet

PS : I'm reporting it as I could not find any information regarding this 
matter, in any previous discussions or javadoc, that would indicates it 
is a known limit.


More information about the Esapi-user mailing list