[Esapi-user] HTMLEntityCodec and optional semicolon for namedentities

Jim Manico jim.manico at owasp.org
Wed Oct 19 14:45:44 EDT 2011

So when you are trying to validate and display a complete URL from 
untrusted input, I would suggest the following workflow:

1) (on input) When first accepting the URL from the user, validate that 
URL with the Apache commons URL validation class or something similar:
* If the URL does not pass then reject the input.
PS: If you use ESAPI's validator for this validation then I recommend 
you turn off canonicalization for that one input.

2) (on input) Make sure the URL starts with http:// or https:// or 
otherwise reject the input (this rule will need to be modified for your 
app most likely)
* If the URL does not pass this rule then reject the input.

3) (on output) Render the URL using the normal encoder rules based on 
context of display. If you are putting this URL in a HREF link so it can 
be click, you need to do attribute encoding. If you are putting this URL 
in a body content for display only, then just do normal HTML entity 

This workflow should solve your problem.


PS: Keep in mind I never did any kind of URL encoding. If untrusted data 
lands in just a get parameter, then URL encode. :)

> This is an excellent question that came up a few years ago, actually.
> First, whether or not the spec requires a ; at the end of an entity is sort of irrelevant.  The fact is that browsers do actually interpret those characters.  So we need to treat those non-terminated entities as potential injection ingredients.
> The bigger question, IMO, is to figure out what the "canonical" form of the input you sent actually is.  The problem is that if the URL interpreter handles this first, then the&  will be used to separate the URL parameters, and there are no HTML entities.  Alternatively, if the HTML interpreter handles this input first, it will see the&super and decide to decode the special character.
> This behavior is not necessarily standard, as different browsers may have made different choices.  If you want to change the way that the canonicalizer in ESAPI works, you can create your own  encoder with a list of codecs.  The order of the codecs should match the order of the interpreters and/or decoders that you expect the data to visit.
> There are many other possible examples of data that can be canonicalized two different ways given the rich encoding environment created for us in modern web application petri dishes.  One that bit me recently was that "c:\file" decoded into "c:[0x0f]ile" -- beware multiple encoding schemes!!
> Personally, I recommend not using URL parameters that start with HTML entity names.  This is unfortunate, but I don't see a great way around it.
> --Jeff
> Jeff Williams, CEO
> Aspect Security
> 410-707-1487
> -----Original Message-----
> From: esapi-user-bounces at lists.owasp.org [mailto:esapi-user-bounces at lists.owasp.org] On Behalf Of Olivier Jaquemet
> Sent: Wednesday, October 19, 2011 9:18 AM
> To: esapi-user at lists.owasp.org
> Subject: [Esapi-user] HTMLEntityCodec and optional semicolon for namedentities
> Hello all,
> I have a question regarding a discutable behavior of the HTMLEntityCodec in ESAPI 2.0.1 (which I think is a bug).
> Let's say a user send the following value for an URL input :
> http://www.example.com/someservlet?foo=bar&super=great&baz=qux
> I use the HTTPURL validator (in the default ESAPI.properties) to make sure the input is appropriate :
> String validatedUrl =
> org.owasp.esapi.ESAPI.validator().getValidInput("userURL",
> request.getParameter("userURL"), "HTTPURL", 2000, true);
> As expected, a "canonicalization" occurs and the HTMLEntityCodec is applied to decode characters.
> Problem : the start of parameter "&super=great" is consider as an html entity (⊇) and the URL will be canonicalized this way :
> http://www.example.com/someservlet?foo=bar⊇r=great&baz=qux
> (Unicode code point U+2287 described as "superset of or equal to"
> appears after "bar" insead of the expected parameter name)
> It is clearly stated in the JavaDoc of the HTMLEntityCodec that this behavior is applied :
> />  Formats all are legal both with and without semi-colon, upper/lower case/ cf
> http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/codecs/HTMLEntityCodec.html#decodeCharacter%28org.owasp.esapi.codecs.PushbackString%29
> /
> /However, the wikipedia article "List of XML and HTML character entity references" (cf http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
> ) states :
> />  The semicolon is required./
> Which is not entirely true. Indeed, if we dive into the HTML specification for Character references ( http://www.w3.org/TR/REC-html40/charset.html#entities ), we can read :
> /*>  *In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
> /
> And I think this is the problem of the HTMLEntityCodec.
> /-->  In other circumstances it may not be eliminated (e.g., in the middle of a word)./
> In my use case "&super=great", the entity "&supe" is part of another word and MUST not have been decoded.
> I think the HTMLEntityCodec should be modified to apply this behavior, otherwise it leads to invalid data being retrieved.
> But as I am no security expert... What do you think ?
> Regards,
> Olivier Jaquemet
> _______________________________________________
> Esapi-user mailing list
> Esapi-user at lists.owasp.org
> https://lists.owasp.org/mailman/listinfo/esapi-user
> _______________________________________________
> Esapi-user mailing list
> Esapi-user at lists.owasp.org
> https://lists.owasp.org/mailman/listinfo/esapi-user

More information about the Esapi-user mailing list