[Esapi-user] HTMLEntityCodec and optional semicolon for named entities
Olivier Jaquemet
olivier.jaquemet at jalios.com
Wed Oct 19 09:17:58 EDT 2011
Hello all,
I have a question regarding a discutable behavior of the HTMLEntityCodec
in ESAPI 2.0.1 (which I think is a bug).
Let's say a user send the following value for an URL input :
http://www.example.com/someservlet?foo=bar&super=great&baz=qux
I use the HTTPURL validator (in the default ESAPI.properties) to make
sure the input is appropriate :
String validatedUrl =
org.owasp.esapi.ESAPI.validator().getValidInput("userURL",
request.getParameter("userURL"), "HTTPURL", 2000, true);
As expected, a "canonicalization" occurs and the HTMLEntityCodec is
applied to decode characters.
Problem : the start of parameter "&super=great" is consider as an html
entity (⊇) and the URL will be canonicalized this way :
http://www.example.com/someservlet?foo=bar⊇r=great&baz=qux
(Unicode code point U+2287 described as "superset of or equal to"
appears after "bar" insead of the expected parameter name)
It is clearly stated in the JavaDoc of the HTMLEntityCodec that this
behavior is applied :
/> Formats all are legal both with and without semi-colon, upper/lower case/
cf
http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/codecs/HTMLEntityCodec.html#decodeCharacter%28org.owasp.esapi.codecs.PushbackString%29
/
/However, the wikipedia article "List of XML and HTML character entity
references" (cf
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
) states :
/> The semicolon is required./
Which is not entirely true. Indeed, if we dive into the HTML
specification for Character references (
http://www.w3.org/TR/REC-html40/charset.html#entities ), we can read :
/*> *In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or immediately
before a tag). In other circumstances it may not be eliminated (e.g., in
the middle of a word). We strongly suggest using the ";" in all cases to
avoid problems with user agents that require this character to be present.
/
And I think this is the problem of the HTMLEntityCodec.
/--> In other circumstances it may not be eliminated (e.g., in the
middle of a word)./
In my use case "&super=great", the entity "&supe" is part of another
word and MUST not have been decoded.
I think the HTMLEntityCodec should be modified to apply this behavior,
otherwise it leads to invalid data being retrieved.
But as I am no security expert... What do you think ?
Regards,
Olivier Jaquemet
More information about the Esapi-user
mailing list