[Esapi-user] HTMLEntityCodec and optional semicolon for namedentities

Olivier Jaquemet olivier.jaquemet at jalios.com
Wed Oct 19 15:12:38 EDT 2011

Thank you Jeff for this detailed explanation.

I do understand the fact that the spec is irrelevant when it comes to 
security and that we must rely on the potential way browsers will 
interpret the data.

That being said, I don't see how anybody could reasonably develop a 
whole webapp without using any parameter starting with an entity name ! 
copy, lang, times, and, sub, sup ne, mu, nu, le, ge...  you get my point...
This is definitely not realistic.

And disabling canonicalization is definitely a dangerous way to go as I 
understand it....
But there are still some documentation to be written on this subject... ;)

I will try to have a look at implementing a custom codec (with the 
implied risks), or check other suggestions made by people on the mailing 
list (no canonicalization + semantic validation of decoded URL).


On 19/10/2011 20:23, Jeff Williams wrote:
> This is an excellent question that came up a few years ago, actually.
> First, whether or not the spec requires a ; at the end of an entity is sort of irrelevant.  The fact is that browsers do actually interpret those characters.  So we need to treat those non-terminated entities as potential injection ingredients.
> The bigger question, IMO, is to figure out what the "canonical" form of the input you sent actually is.  The problem is that if the URL interpreter handles this first, then the&  will be used to separate the URL parameters, and there are no HTML entities.  Alternatively, if the HTML interpreter handles this input first, it will see the&super and decide to decode the special character.
> This behavior is not necessarily standard, as different browsers may have made different choices.  If you want to change the way that the canonicalizer in ESAPI works, you can create your own  encoder with a list of codecs.  The order of the codecs should match the order of the interpreters and/or decoders that you expect the data to visit.
> There are many other possible examples of data that can be canonicalized two different ways given the rich encoding environment created for us in modern web application petri dishes.  One that bit me recently was that "c:\file" decoded into "c:[0x0f]ile" -- beware multiple encoding schemes!!
> Personally, I recommend not using URL parameters that start with HTML entity names.  This is unfortunate, but I don't see a great way around it.
> --Jeff
> Jeff Williams, CEO
> Aspect Security
> 410-707-1487
> -----Original Message-----
> From: esapi-user-bounces at lists.owasp.org [mailto:esapi-user-bounces at lists.owasp.org] On Behalf Of Olivier Jaquemet
> Sent: Wednesday, October 19, 2011 9:18 AM
> To: esapi-user at lists.owasp.org
> Subject: [Esapi-user] HTMLEntityCodec and optional semicolon for namedentities
> Hello all,
> I have a question regarding a discutable behavior of the HTMLEntityCodec in ESAPI 2.0.1 (which I think is a bug).
> Let's say a user send the following value for an URL input :
> http://www.example.com/someservlet?foo=bar&super=great&baz=qux
> I use the HTTPURL validator (in the default ESAPI.properties) to make sure the input is appropriate :
> String validatedUrl =
> org.owasp.esapi.ESAPI.validator().getValidInput("userURL",
> request.getParameter("userURL"), "HTTPURL", 2000, true);
> As expected, a "canonicalization" occurs and the HTMLEntityCodec is applied to decode characters.
> Problem : the start of parameter "&super=great" is consider as an html entity (⊇) and the URL will be canonicalized this way :
> http://www.example.com/someservlet?foo=bar⊇r=great&baz=qux
> (Unicode code point U+2287 described as "superset of or equal to"
> appears after "bar" insead of the expected parameter name)
> It is clearly stated in the JavaDoc of the HTMLEntityCodec that this behavior is applied :
> />  Formats all are legal both with and without semi-colon, upper/lower case/ cf
> http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/codecs/HTMLEntityCodec.html#decodeCharacter%28org.owasp.esapi.codecs.PushbackString%29
> /
> /However, the wikipedia article "List of XML and HTML character entity references" (cf http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
> ) states :
> />  The semicolon is required./
> Which is not entirely true. Indeed, if we dive into the HTML specification for Character references ( http://www.w3.org/TR/REC-html40/charset.html#entities ), we can read :
> /*>  *In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
> /
> And I think this is the problem of the HTMLEntityCodec.
> /-->  In other circumstances it may not be eliminated (e.g., in the middle of a word)./
> In my use case "&super=great", the entity "&supe" is part of another word and MUST not have been decoded.
> I think the HTMLEntityCodec should be modified to apply this behavior, otherwise it leads to invalid data being retrieved.
> But as I am no security expert... What do you think ?
> Regards,
> Olivier Jaquemet
> _______________________________________________
> Esapi-user mailing list
> Esapi-user at lists.owasp.org
> https://lists.owasp.org/mailman/listinfo/esapi-user

Olivier Jaquemet<olivier.jaquemet at jalios.com>
Ingénieur R&D Jalios S.A. - http://www.jalios.com/
@OlivierJaquemet +33970461480

More information about the Esapi-user mailing list