[Esapi-user] HTMLEntityCodec and optional semicolon for named entities

Olivier Jaquemet olivier.jaquemet at jalios.com
Wed Oct 19 09:17:58 EDT 2011


Hello all,

I have a question regarding a discutable behavior of the HTMLEntityCodec 
in ESAPI 2.0.1 (which I think is a bug).

Let's say a user send the following value for an URL input :
http://www.example.com/someservlet?foo=bar&super=great&baz=qux

I use the HTTPURL validator (in the default ESAPI.properties) to make 
sure the input is appropriate :
String validatedUrl = 
org.owasp.esapi.ESAPI.validator().getValidInput("userURL", 
request.getParameter("userURL"), "HTTPURL", 2000, true);

As expected, a "canonicalization" occurs and the HTMLEntityCodec is 
applied to decode characters.
Problem : the start of parameter "&super=great" is consider as an html 
entity (⊇) and the URL will be canonicalized this way :
http://www.example.com/someservlet?foo=bar⊇r=great&baz=qux
(Unicode code point U+2287 described as "superset of or equal to" 
appears after "bar" insead of the expected parameter name)

It is clearly stated in the JavaDoc of the HTMLEntityCodec that this 
behavior is applied :
/> Formats all are legal both with and without semi-colon, upper/lower case/
cf 
http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/codecs/HTMLEntityCodec.html#decodeCharacter%28org.owasp.esapi.codecs.PushbackString%29
/
/However, the wikipedia article "List of XML and HTML character entity 
references" (cf 
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references 
) states :
/> The semicolon is required./
Which is not entirely true. Indeed, if we dive into the HTML 
specification for Character references ( 
http://www.w3.org/TR/REC-html40/charset.html#entities ), we can read :
/*> *In SGML, it is possible to eliminate the final ";" after a 
character reference in some cases (e.g., at a line break or immediately 
before a tag). In other circumstances it may not be eliminated (e.g., in 
the middle of a word). We strongly suggest using the ";" in all cases to 
avoid problems with user agents that require this character to be present.
/
And I think this is the problem of the HTMLEntityCodec.
/--> In other circumstances it may not be eliminated (e.g., in the 
middle of a word)./

In my use case "&super=great", the entity "&supe" is part of another 
word and MUST not have been decoded.
I think the HTMLEntityCodec should be modified to apply this behavior, 
otherwise it leads to invalid data being retrieved.
But as I am no security expert... What do you think ?

Regards,
Olivier Jaquemet


More information about the Esapi-user mailing list