[Developer-outreach] FW: Modifying the custom tags and docs for the form, bean:write, and other output tags in SpringMVC, Struts, Wicket, etc and JavaScript Frameworks

Mark Curphey mark at curphey.com
Mon Feb 28 12:59:50 EST 2011

Forwarding as Barry is not subscribed and his mail bounced

From: Barry Dorrans [bdorrans at microsoft.com]
Sent: Monday, February 28, 2011 9:45 AM
To: Mark Curphey; dinis cruz
Cc: Jim Manico; developer-outreach at lists.owasp.org; Jeff Ichnowski; Chris Schmidt
Subject: RE: [Developer-outreach] Modifying the custom tags and docs for the form, bean:write, and other output tags in SpringMVC, Struts, Wicket, etc and JavaScript Frameworks

Yup, ran off my feet this week.

So a couple of things spring to mind

<%: now HTML encodes in ASP.NET/MVC ¡V but it¡¦s HTML encoding only
The SRE plugin only works with webforms, and controls I know about. And the current version is in a CTP space. So I don¡¦t know how it maps.

I think part of the problem I¡¦ve seen is people applying the wrong encoding. Or multiple encodings in the wrong order.

I can have a proper think towards the end of the week, or try to address more specific questions ¡V but remember I¡¦m in no way shape or form can talk for the ASP.NET team, but I can loop them in.

From: Mark Curphey [mailto:mark at curphey.com]
Sent: Monday, February 28, 2011 09:31
To: dinis cruz; barry at idunno.org
Cc: Jim Manico; developer-outreach at lists.owasp.org; Jeff Ichnowski; Chris Schmidt
Subject: RE: [Developer-outreach] Modifying the custom tags and docs for the form, bean:write, and other output tags in SpringMVC, Struts, Wicket, etc and JavaScript Frameworks

Adding Barry as the core Anti-XSS / WPL developer (MVP Summit this week so I doubt he will be super responsive)

From: dinis cruz [dinis.cruz at owasp.org]
Sent: Monday, February 28, 2011 8:46 AM
To: Mark Curphey
Cc: Jim Manico; developer-outreach at lists.owasp.org; Jeff Ichnowski; Chris Schmidt
Subject: Re: [Developer-outreach] Modifying the custom tags and docs for the form, bean:write, and other output tags in SpringMVC, Struts, Wicket, etc and JavaScript Frameworks
A great next-step would be to map the ideas and concepts in Mike Samuel paper with the great work that Microsoft has done with the AntiXss library, namely the Security Runtime Engine (SRE) which applies encoding in-context to ASP.NET<http://ASP.NET> controls.

John W, as per our call a couple minutes ago, this is the type of sub-project (or sub-initiative) that needs to gain its own space, so that the people who are interested are able to just focus on it and get things done

Btw, is Mike Samuel  'Using Type Qualifiers to Make Web Templates Robust Against XSS' paper published? (Google can't seem to find it). I want to read it properly, and was looking for a nicely formated pdf :)

Dinis Cruz

2011/2/28 Mark Curphey <mark at curphey.com<mailto:mark at curphey.com>>
Awesome. Any MSFT people you need me to go talk to or intro you to? I don't think we play in the space but don't know.....

Sent from my Phone

On Feb 28, 2011, at 1:43 AM, "Jim Manico" <jim.manico at owasp.org<mailto:jim.manico at owasp.org>> wrote:

> Mike Samuel (an Auto-escape template author from google) would like to send this letter to several framework communities, starting with Django. Could you kindly review and pass alone any suggestions to msamuel at google.com<mailto:msamuel at google.com> - in the interest of security developer outreach?
> Thanks all,
> Jim
>  Using Type Qualifiers to Make Web Templates Robust Against XSS
>    Contents
>    Motivation
> Scripting vulnerabilities plague web applications today. To streamline the
> output generation from application code, numerous web templating frameworks have
> recently emerged and are gaining widespread adoption. However, existing web
> frameworks fall short in providing mechanisms to automatically and
> context-sensitively sanitize untrusted data.
> For example, a naive web template might look like
> <div>{$name}</div>
> but this template is vulnerable to Cross-site scripting (XSS) vulnerabilities.
> An attacker who controls the value of |name| could pass in
> |<script>document.location = 'http://phishing.com/';</script>| to redirect users
> to a malicious site, steal the users credentials or personal data, or initiate a
> download of malware.
> The template author might manually encode name:
> <div>{$name_  |escapeHTML_}</div>
> making sure that the user sees exactly the value of |name| as per spec, and
> defeating this particular attack.
> A better web templating system might automatically insert the ||escape***|
> directives, relieving the template author of the burden.
> This paper argues that correct sanitization is too important, that manual
> sanitization is an unreasonable burden to place on template authors (and
> especially maintainers), defines goals that any automatic approach should
> satisft, and introduces an automatic approach that is particularly suitable for
> bolting onto existing web templating languages.
>    Abstract
> In this paper, we propose a type-based approach to bolt context-sensitive
> automatic sanitization onto existing widely used web templating languages. In
> particular, we introduce the new notion of "context" type qualifiers to
> represent the contexts in which untrusted data can be embedded. We propose a new
> type system that refines the base type system of a web templating language with
> the context type qualifer. Based on the new type system, we design and develop a
> context-sensitive auto-sanitization (CSAS) engine which runs during the
> compilation stage of a web templating framework to add proper sanitization and
> runtime checks to ensure the correct sanitization. We implement our system in
> Google Closure Templates, a commercially used open-source templating framework
> that is used in GMail, Google Docs and other applications. We evaluate our type
> system on 1035 real-world Closure templates. We demonstrate that our approach
> achieves both better security and performance than previous approaches.
>    Glossary
> Context
>    A parser state in the combined HTML, CSS, and JavaScript grammar used to
>    determine the stack of sanitization routines that need to be applied to any
>    untrusted data interpolated at that point to preserve the security
>    properties outlined here.
> Cross-Site Scripting
>    A quoting confusion <#glossary-quoting_confusion> attack whereby untrusted
>    data naively interpolated into HTML, CSS, or JavaScript causes code to run
>    with the privileges of an origin not owned by the attacker.
>    CSS 2 and 3 plus vendor specific extensions such as |expression:| and
>    comment parsing and error recovery quirks so that our sanitization function
>    definitions survive a worst-case analysis. This paper assumes a basic
>    familiarity with CSS.
> Escaper
>    A sanitization function <#glossary-sanitization_function> that takes content
>    in an input language (usually |text/plain|) and produces content in an
>    output language. E.g. the function |escapeHTML| is an escaper that takes
>    plain text, |'I <3 Ponies'|, and transforms that to semantically equivalent
>    HTML by turning HTML special characters into entities: |'I &lt;3 Ponies'|.
>    (Escapers may, in the process, break hearts.) See also OWASP's definition
>    <http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#Escaping_.28aka_Output_Encoding.29>.
> Filter
>    A sanitization function <#glossary-sanitization_function> that takes a
>    string and either returns it, returns an innocuous string, or aborts
>    template processing. E.g. an untrusted value at the start of a URL can
>    specify a powerful protocol such as |javascript:|. A filter can ensure that
>    an untrusted value at the beginning of a URL either contains no protocol or
>    contains one in a whitelist (|http|, |https|, or |mailto|) and if it finds
>    an untrusted value that violates this rule, might return an innocuous value
>    such as |'#'| which defangs the URL.
>    HTML as parsed by browsers. Typically HTML5 but we need to deal with
>    syntactic quirks from mainstream browser lines as old as IE5. This paper
>    assumes a basic familiarity with HTML.
> JavaScript
>    EcmaScript 5 but including vendor specific extensions such as conditional
>    compilation directives so that our sanitization function definitions survive
>    a worst-case analysis. This paper assumes a basic familiarity with JavaScript.
> Normalizer
>    A sanitization function <#glossary-sanitization_function> that takes content
>    in an input language and produces content in that same language but that can
>    be used in more contexts. E.g. the function |normalizeURI| might make sure
>    that quotes are encoded so that a URI path can be embedded in an HTML
>    attribute unchanged :
>    |'mailto:_<_Mohammed%20_"_The%20Greatest_"_%20Ali_>_%20ali at gmail.com<mailto:20ali at gmail.com>'| ¡÷
>    |'mailto:_%3c_Mohammed%20_%22_The%20Greatest_%22_%20Ali_%3e_%20ali at gmail.com<mailto:3c_Mohammed%2520_%2522_The%2520Greatest_%2522_%2520Ali_%253e_%2520ali at gmail.com>'|
>    and a function that strips tags from valid HTML allows the tagless HTML to
>    be included in an HTML attribute context.
> Quoting Confusion
>    A vulnerability (or an exploitation of such) due to a failure to encode data
>    in one language (such as |text/plain|) before concatenating it with content
>    in another language such as |text/html| in the case of XSS. Other examples
>    of quoting confusion include SQL Injection, Shell Injection, and HTTP header
>    splitting.
>    RunTime Type Information. Reflective access to the type of a value at the
>    time a program is running. RTTI APIs include |typeof| in C++, C#,
>    JavaScript; |instanceof| in Java and JavaScript; |Object.instanceof_of?()|
>    in Ruby, |type| in python, |Object.getClass()| in Java; and
>    |Object.GetType()| in C#.
> Sanitization Function
>    A function that takes untrusted data and returns a snippet of web content.
>    There are several kinds of sanitization functions : escapers, normalizers,
>    and filers.
> Template
>    A function from (typically untrusted) data to a string that is specified as
>    a DSL that clearly separates static trusted snippets of content (usually
>    appearing as literal text) from interpolations of untrusted data (usually
>    appearing as expressions or variable names). In this paper the term is used
>    synonymously with "Web Template" which is a type of template that produces a
>    string of content in a web language : HTML, CSS, or JavaScript.
> Trusted Path
>    The ability of an application or piece of code to establish a channel to the
>    user that the user can be sure leads to that piece of code. E.g. Browsers
>    use an unspoofable dialog box for HTTP auth to gather passwords, Windows
>    uses Ctrl-Alt-Delete for the same purpose, and browsers disallow spoofing of
>    the URL bar so that informed users can reliably tell if a page is secure and
>    using valid certs. See also wikipedia
>    <http://en.wikipedia.org/wiki/Trusted_path>.
>    See Cross-Site Scripting <#glossary-cross_site_scripting>
>    Solution Sketch: A static approach with RTTI to avoid over-escaping.
> First, a brief overview of the proposed solution. Later, while discussing goals
> and alternatives, we will refer back to this.
> The template below specifies a form whose |action| depends on two values |$name|
> and |$target| which may come from untrusted sources. The |{if ¡K}¡K{else}¡K{/if}|
> branches to define a dynamic URL.
> <form action="{if $target}/{$name}/handle?tgt={$target}{else}/{$name}/default{/if}">Hello {$name}¡K
> First, we parse the template to find trusted static content, dynamic "data
> holes" that may be filled by untrusted data, and flow control constructs: |if|,
> |for|, etc. The black portions are the data holes, and the green portions are
> trusted static content.
> <form action="{if $target}/        /handle?tgt=          {else}/        /default{/if}">Hello        ¡K
> Next we do a flow-sensitive analysis, propagating types to determine the context
> in which each data hole appears.
> <form action="{if $target}/        /handle?tgt=          {else}/        /default{/if}">Hello        ¡K
> ¡ôPCDATA    ¡ôAttr URL start    ¡ôAttr URL path    ¡ôAttr URL query    ¡ôAttr URL path    ¡ôPCDATA
> Based on those contexts, we determine the type of content that is expected for
> each hole.
> <form action="{if $target}/(HTML-encoded URL)/handle?tgt=(HTML-encoded query){else}/(HTML-encoded URL)/default{/if}">Hello(text/html)¡K
> Finally we insert calls to sanitizer functions <#glossary-sanitization_function>
> into the template.
> <form action="{if $target}/{escapeHTML($name)}/handle?tgt={encodeURIComponent($target)}{else}/{escapeHTML($name)}/default{/if}">Hello {escapeHTML($name)}¡K
> That is the gist of the solution, though the above example glosses over issues
> with re-entrant templates, templates that are invoked in multiple start
> contexts, and joining branches that end in different contexts; and the exact
> sanitization functions chosen are different than shown in this simplified example.
> The example only shows HTML and URL encoding, but our solution deals with data
> holes that occur inside embedded JavaScript and CSS as any solution for AJAX
> applications must.
>    Problem Definition
> In this section we present several metrics on which any competing sanitization
> scheme should be judged, and a definition of a safe template that can be used to
> prove or disprove the soundness of a sanitization scheme that we think is
> relevant to security properties that web applications commonly want to enforce.
>      Performance
> A sanitization scheme should be judged on several performance metrics:
>   1. Compile- or load-time overhead. The cost of any static analysis.
>   2. Run-time analysis overhead. The cost of any dynamic analysis done when the
>      template is run.
>   3. Run-time sanitization overhead. The cost of sanitizing
>      <#glossary-sanitization_function> untrusted data.
>   4. One-time development overhead. The burden placed on a developer to learn
>      the system.
>   5. Continual development overhead. The burden placed on a developer to add
>      sanitization directives, review code to ensure they are used correctly,
>      debug the resulting template code, and deal with any over- or
>      mis-sanitization.
> Run-time analysis overhead (proportional to overall template runtime) often
> differs substantially by platform. High quality parser-generators exist for C
> and Java, so the overhead may be much lower there than in browser, since
> iterating char by char over a string is slow in JavaScript.
> Our proposal has a modest compile-/load-time cost taking slightly less than 1
> second to do static inference for 1035 templates comprising 782kB of source code
> or about 1ms per template. The runtime analysis for our proposal is zero. The
> runtime sanitization overhead on a benchmark is between 3% and 10% of the total
> template execution time, and is indistinguishable from the overhead when
> non-contextual auto-sanitization is used (all data holes sanitized using HTML
> entity escaping).
> Development overhead is hard to measure but the 1035 templates were migrated by
> an application group in a matter of weeks without stopping application
> development with little coordination, so the one-time overhead ¡X the overhead to
> learn the system ¡X is lower than that to learn and adopt a new templating
> language. Since the system works by inserting function calls, we provided
> debugging tools that diffed templates before and after the inference was run to
> show developers what the system was doing and aid in debugging. Due to the need
> to debug templates written using any approach, the continual development
> overhead can never be zero, but tool support, like diffing can make the system
> transparent and ease debugging.
> Finally, once a bug has been identified, we try to make sure there are simple
> bugfixing recipes.
>    * If the problem is over-escaping of known-safe data values that were
>      already sanitized, then wrap the data value in a |SanitizedContent|
>      wrapper object of the appropriate type as close to where it is sanitized
>      as possible. Ideally, an HTML tag whitelisting sanitizer would return a
>      value of type |SanitizedContent|.
>    * If the sanitizer is choosing the wrong sanitization functions for whatever
>      reason, insert your own. The system recognizes sanitization functions and
>      does not interfere with developer choices.
>      Ease of Adoption/Migration
> What kind of changes, if any, do developers have to make to take an existing
> codebase of templates and have them properly sanitized? For example, adding
> sanitization functions manually is time-consuming and error-prone. Making sure
> that all static content is valid XHTML requires repetitive, time-consuming
> changes, but would not be as error-prone.
> Our proposal allows contextual auto-sanitization to be turned on for some
> templates and not for others; most templating languages allow templates to be
> composed, i.e. templates can call other templates, and standard practice seems
> to be to have a few large templates that call out to many smaller templates.
> Since this can be done per template, a codebase can be migrated piecemeal,
> starting with complicated templates that have known problems.
> Our proposal does not impose an element structure on template boundaries. Many
> top level templates look like:
> {include "common-header.foo"}
> <!-- Body content -->
> {include "common-footer.foo"}
> where the common header opens elements that are closed in the common footer:
> <html>
> <head>
> <!-- Common style and script definitions -->
> ...
> </head>
> <body>
> <!-- Common menus -->
> Approaches that require template code to be well-formed XML, such as XSLT,
> cannot support this idiom. Our proposal works for templating languages that
> allow this idiom because we propagate types as they flow template calls rather
> than inferring types of content based on a DOM derived from a template.
>      Ease of Abandonment
> If a development team adopts a sanitization scheme, and finds that it does not
> meet their needs, how easily can they switch it off, and how much of the effort
> they invested in deploying it can they recover?
> Since our solution works by inserting calls to sanitization function into
> templates, a development team having second thoughts can simply run the type
> inference engine to insert the calls, and print out the resulting templates to
> generate a patch to their codebase and then remove whatever directives turned on
> auto-sanitization. We argued above that cost of adoption is low, and most of the
> work put into verifying that the sanitization functions chosen were reasonable
> is recoverable.
>      Security under Maintenance
> Security measures tend to be removed from code under maintenance. Imagine a
> template that is not auto-sanitized:
> <div>Your friend, {escapeHTML($name)}, thinks you'll like this.</div>
> that is passed a plain text name. While merging two applications, developers add
> a call to this code, passing in a rich HTML signature that has been proven safe
> by a tag whitelister, e.g. |"Alan <font color=green>Green</font>span"|.
> Eventually, Mr. Greenspan notices that his name is misrendered and files a bug.
> A developer might check that the rich text signature is sanitized properly
> before being passed in, but not notice the other caller that doesn't do any
> sanitization. They resolve the bug by removing the call to |escapeHTML| which
> fixes the bug but opens a vulnerability.
> Over-encoding is more likely to be noticed by end-users than XSS
> vulnerabilities, so a project under maintenance is more likely to lose manual
> sanitization directives than to gain them.
> Our proposal addresses this by introducing sanitized content types
> <#sanitized_content_types> as a principled solution to over-encoding problems.
>      Structure Preservation Property
> We define a safe template as one that has several properties: the structure
> preservation property described here, and the code effect and least surprise
> properties defined in later sections.
> Intuitively, this proeprty holds that when a template author writes an HTML tag
> in a safe templating language, the browser will interpret the corresponding
> portion of the output as a tag regardless of the values of untrusted data, and
> similarly for other structures such as attribute boundaries and JS and CSS
> string boundaries.
> This property can be violated in a number of ways. E.g. in the following
> JavaScript, the author is composing a string that they expect will contain a
> single top-level bold element surrounded by text.
> document.write(greeting + ',<b>' + planet +'</b>!');
> and if |greeting| is |"Hello"| and |planet| is |"World"| then this holds as the
> output written is "|Hello, <b>world</b>!|"; but if |greeting| is
> |"<script>alert('pwned');//"| and |planet| is |"</script>"| then this does not
> hold since the structure has changed: the |<b>| should have started a bold
> element but the browser interprets it as part of a JavaScript comment in
> "|<script>alert('pwned');//, <b></script></b>!|".
> Lower level encoding attacks, such as UTF-7
> <http://ha.ckers.org/xss.html#XSS_UTF-7> attacks, may also violate this property.
> More formally, given any template, e.g.
> <div id="{$id}" onclick="alert('{$message}')">{$message}</div>
> we can derive an /innocuous template/ by replacing every untrusted variable with
> an innocuous string, a string that is not empty, is not a keyword in any
> programming language and does not contain special characters in any of the
> languages we're dealing with. We choose our innocuous string so that it is not a
> substring of the concatenation of literal string parts. Using the innocuous
> string |"zzz"|, an innocuous template derived from the above is:
> <div id="zzz" onclick="alert('zzz')">zzz</div>
> Parsing this, we can derive a tree structure where each inner node has a type
> and children, and each leaf has a type and a string value.
> Element
>  ùàName : "div"
>  ùàAttribute
>  ùø  ùàName : "id"
>  ùø  ùãText : "zzz"
>  ùàAttribute
>  ùø  ùàName : "onclick"
>  ùø  ùãJsProgram
>  ùø      ùãFunctionCall
>  ùø          ùàIdentifier : "alert"
>  ùø          ùãString : "zzz"
>  ùãText : "zzz"
> A template has the structure preservation property when for all possible branch
> decisions through a template, and for all possible data table inputs, a template
> either produces no output (fails with an exception) or produces an output that
> can be parsed to a tree that is structurally the same as that produced by the
> innocuous template derived from it for the same set of branch decisions.
> ? branch-decisions ? data, areEquivalent(
> parse(innocuousTemplate(T)(branch-decisions, data))
> parse(T(branch-decisions, data)))
> where parse parses using a combined HTML/JavaScript/CSS grammar to the tree
> structure described above, branch-decisions is a path through flow control
> constructs (the conditions in |for| loops and |if| conditions) and where
> areEquivalent is defined thus:
> def areEquivalent(innocuous_tree, actual_tree):
>   if innocuous_tree.is_leaf:
>     # innocuous_string was 'zzz' in the example above.
>     if innocuous_string in innocuous_tree.leaf_value:
>       # Ignore the contents of actual since it was generated by
>       # a hole.  We only care that it does not interfere with
>       # the structure in which it was embedded.
>       return True
>     # Leaves strucurally the same.
>     # Assumes same node type implies actual is leafy.
>     return (innocuous_tree.node_type is actual_tree.node_type
>             and innocuous_tree.leaf_value == actual_tree.leaf_value)
>   # Require type equivalence for inner nodes.
>   if node_type(innocuous_tree) is not node_type(actual_tree):
>     return False
>   # Zip below will silently drop extras.
>   if len(innocuous_tree.children) != len(actual_tree.children):
>     return False
>   # Recurse to children.
>   for innocuous_child, actual_child in zip(
>       innocuous_tree.children, actual_tree.children):
>     if not areEquivalent(innocuous_child, actual_child):
>       return False
>   return True  # All grounds on which they could be inequivalent disproven.
> This definition is not computationally tractable, but is formal so can be used
> as a basis for correctness proofs, and in practice branch decisions that go
> through loops more than twice or recurse more than twice can be ignored and
> fuzzers do a good job of generating bad data inputs.
> This property is essential to capturing developer intent. When the developer
> writes a tag, the browser should interpret that as a tag, and when the developer
> writes a paired start and end tags, the browser should interpret those as a
> matched pair. It is also important to applications that want to embed sanitized
> data while preserving a trusted path <#glossary-trusted_path> since the
> structure preservation property is a prerequisite for visual containment.
>      Code Effect Property
> Untrusted data may specify data values in code (loosely, string, booleans,
> numbers, JSON) but /only/ code specified by the template author should run as a
> result of injecting the template output into a page and /all/ code specified by
> the template author should run as a result of the same.
> There are a dizzyingly large number of ways this property can fail to hold for a
> template. A non-exhaustive sample of ways to cause extra code to run:
>    * Unencoded text could contain a |<script>| element.
>    * A dynamic attribute name could specify an event handler: |onclick|.
>    * A dynamic |src| or |href| could specify |javascript|, |livescript|, etc.
>      as the protocol in myriad <http://ha.ckers.org/xss.html> ways.
>    * Dynamic CSS
>      <http://code.google.com/p/browsersec/wiki/Part1#Cascading_stylesheets>
>      could use a vendor-specific extension, e.g. |expression| or |-moz-binding|.
>    * A dynamic |<object>| might load flash cross origin and |AllowScriptAccess|.
>    * Dynamic code might reach |eval| or any number of JavaScript APIs that
>      invoke the HTML, CSS, or JavaScript parsers.
> There are also many ways to cause security-critical code to not run. In general,
> it is not wise to rely on JavaScript running in a browser, but many developers,
> not unreasonably, rely on some code having run if other code is running at a
> later time. A non-exhaustive sample of ways to stop code running via XSS:
>    * Change the base href of the page by inserting a |<base>| element disabling
>      |src=|'ed |<script>|s with relative URLs.
>    * Inject improperly encoded data into a data hole in an inline <script>
>      element that causes it to fail to parse. Putting unicode codepoints U+2028
>      or U+2029 into a string body is a favorite since JavaScript treats those
>      as newline characters. (A violation of the Structure Preservation Property).
>    * Cause an inline event handler or |<script>| tag to not be interpreted as
>      such. (A violation of the Structure Preservation Property).
>    * Insert code that disables, removes, or changes critical APIs:
>      |Object.prototype.toString = function () { throw new Error(); };|
>    * Inject a |<noscript>| element around the |<head>|. (A violation of the
>      Structure Preservation Property).
>    * Fool heuristic XSS defense browser plugins into thinking that the security
>      code was injected by manipulating query parameters.
> Our proposal enforces this property by filtering <#glossary-filter> URLs to
> prevent any data hole from specifying an exotic protocol, by filtering CSS
> keywords, and by only allowing data holes in JavaScript contexts to specify
> simple boolean, numeric, and string values, or complex JSON values which cannot
> have free variables. We assume that the JavaScript interpreter will work on
> arbitrarily large inputs.
> Finally, our escapers <#glossary-escaper> are designed to produce output that
> avoids grammatical such as semicolon insertion, non-ASCII newline characters,
> regular-expression/division-operator/line-comment confusion.
> Identifying all places in which a URL might appear in HTML (incl. MathML and
> SVG) is relatively easy compared to CSS. In CSS, it is difficult. For example,
> in |<div style="background: {$bg}">|, |$bg| might specify a URL, a color name, a
> color value like |#000|, a function-like color |rgb(0,0,0)|, a keyword value
> like |transparent|, or a combination of the above. Given how hard it is to
> reliably black-list URLs, when you know the content is a URL, we took the rather
> drastic approach of forbidding anything that might specify a colon in CSS data
> holes. This seems to affect very little in practice, and we could relax this
> constraint to allow colons preceded by a safe word like the name of an element,
> pseudo-element, or innocuous property. Even if we did, it is possible that
> existing code uses colons in data holes to specify list separators a la semantic
> HTML, and we would break that use case:
> ul.inline li { list-style: none; display: inline }
> ul.inline li:before { content: ': ' }  /* ', ' here would give a normal looking list. */
> ul.inline li:first-child:before { content: '' }
> This property is a prerequisite for many application /privacy/ goals. If a
> third-party can cause script to run with the privileges of the origin, it can
> steal user data and phone home. Even if credentials are unavailably to
> JavaScript (HTTPOnly
> <http://www.codinghorror.com/blog/2008/08/protecting-your-cookies-httponly.html>
> cookies), scripts with same-origin privileges can screen scrape (using DOM APIs)
> user names and identifiers and associated page content and phone home.
> This property is also a prerequisite for many /informed consent/ goals. If a
> third-party script can install |onsubmit| handlers, it can rewrite form data
> before it is submitted with the XSRF tokens that are meant to ensure that the
> data submitted was specified by the user.
>      Least Surprise Property
> The last of the security properties that any auto-sanitization scheme should
> preserve is the property of least surprise. This is not a property that can be
> proven mathematically as it depends on developers' intuition, but it is
> nonetheless important..
> Developer intuition is impotant. A developer familar with HTML, CSS, and
> JavaScript; who knows that auto-sanitization is happening should be able to look
> at a template and correctly infer what happens to dynamic values without having
> to read a complex specification document. Simple rules-of-thumb should be
> sufficient to understand the system. E.g. if a mythical average developer sees
> |<script>var msg = '{$msg}';</script>| and their intuition is that |$world|
> /should/ be escaped using JavaScript style |\| sequences, and that is sufficient
> to preserve the other security properties, then that is what the system should
> do. Templates should be both easy to write and to code review.
> Exceptions to the system should be easily audited. SQL prepared statements are
> great, but there's no way to have exceptions to the rule without giving up the
> whole safety net, so sometimes developers work around them by concatenating
> strings. It's hard to |grep| (or craft presubmit triggers) for all the places
> where concatenated strings are passed to SQL APIs, so it's hard for a more
> senior developer to find these after the fact and explain how they can achieve
> their goal working within the system, notice a trend that points to a systemic
> problem with schemas, or agree that the exception to the rule is warranted and
> document it for future security auditors.
> Our proposal was designed with this goal in mind, but we have not managed to
> quantify our success. We can note that 1035 templates were converted within a
> matter of weeks without a flood of questions to the mailing lists we monitor, so
> we infer that most of the parts of the system that were heavily exercised were
> non-controversial. Different communities of developers may have different
> expectations. We worked with a group of developers most of whom knew Java, C++,
> or both before starting web application development, and among whom a high
> proportion have at least a bachelor's degrees in CS or a related field. They may
> differ, intuition-wise, from developers who came to web development from a Ruby,
> Perl, or PHP background.
>    Alternate Approaches
> In this section we introduce a number of alternative proposals, explain why they
> perform worse on the metrics above. We cite real systems as examples of some of
> these alternatives. Many of these systems are well-thought out, reasonable
> solutions to particular problems their authors faced. We merely argue that they
> do not extend well to the criteria we outlined above and explicitly label these
> sections "strawmen" to clarify the difference between our design criteria and
> the contexts in which these systems arose. We do claim though that any
> comprehensive solution to XSS, at a tools level, should meet the criteria above.
>      Strawman 0: Manual sanitization
> Manual sanitization is the state-of-the-art currently. Developers use a suite of
> functions, such as OWASP's open source OSAPI encoders
> <http://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/Encoder.java>
> and every developer must learn when and how to apply them correctly. They must
> apply sanitizers either before data reaches a template or within the template by
> inserting function calls into code.
> This places a significant burden on developers and does not guarantee any of the
> security properties listed above. One lapse can undo all the work put into
> hardening a website because of the all-or-nothing nature of the same-origin policy.
> There is a tradeoff between correctness and simplicity of API that works in the
> attackers favor. Manual sanitization is particularly error-prone because
> developers learn /the good parts/ of the languages they work in, but attackers
> have available to them /the bad parts/ as well. The syntax of HTML, CSS, and
> JavaScript are much gnarlier than most developers imagine, and it is an
> unreasonable burden to expect them to learn and remember obscure syntactic
> corner cases. These corner cases mean that the typical suite of 4-6 escaping
> functions is the most that many developers can reliably choose from, but they
> are insufficient to handle corner cases or nested contexts.
> Changes in language syntax or vendor-specific extensions (e.g. XML4J and
> embedded SVG) may invalidate developers previously valid assumptions. Code that
> was safe before may no longer be safe. With an automated system, a security
> patch and recompile may suffice, but a patch to code that took a team of
> developers years to write will take a team of developers to fix.
> XSS Scanners (e.g. lemon
> <http://googleonlinesecurity.blogspot.com/2007/07/automating-web-application-security.html>)
> can mitigate some of manual sanitization's cons (though they work with any of
> the other solutions here as well to provide defense-in-depth), but there are no
> good scanners for AJAX applications, and, with manual sanitization, scanners
> impose a continual burden on developers to respond to the reported errors.
>      Strawman I: Non-contextual auto-sanitization
> Context-less auto-sanitization is a great improvement over manual sanitization
> and is implemented in a number of languages including Django templates
> <http://code.djangoproject.com/wiki/AutoEscaping>.
> It works by assuming that every data hole should be sanitized the same way,
> usually by HTML entity encoding. As such, it is prone to over-escaping and
> mis-escaping.
> To understand mis-escaping, consider imagine what happens when the following
> template is called with |', alert('XSS'), '| :
> <button onclick="setName('{$name}')">
> The template produces |<button onclick="setName('&apos;, alert(&apos;XSS
> &apos;), &apos;')">| which is exactly the same, to the browser, as |<button
> onclick="setName('', alert('XSS '), '')">| because the browser HTML entity
> decodes the attribute value /before/ invoking the JavaScript parser on it.
> Non-contextual auto-sanitization cannot preserve the structure preservation
> property for JavaScript, CSS, or URLs because it is unaware of those languages.
> It also fails to preserve the code effect property.
> Bolting filters on non-contextual auto-sanitization will not help it to preserve
> the code effect property. It is possible to write bizarre JavaScript that does
> not need even alphanumerics
> <http://securitymusings.com/article/2022/code-with-javascript-letters-and-numbers-optional>.
> Since JavaScript has no regular lexical grammar, regular expressions that are
> less than draconian are insufficient to filter out attacks.
> Non-contextual auto-sanitization, with auditable exceptions like Django's, does
> preserve the least surprise property in a sense. With very little training, a
> developer can predict exactly what it will do, and empirically, 74% of the time
> it does what they want (our system chose some kind of HTML entity encoding for
> 992 out of 1348 data holes).
>      Strawman II: Strict structural containment
> Examples of strict structural containment languages are XSLT
> <http://www.w3.org/TR/xslt>, GXP <http://code.google.com/p/gxp/>, and possibly
> <http://www.facebook.com/notes/facebook-engineering/xhp-a-new-way-to-write-php/294003943919>.
> What they have in common is that the input is (or is coerceable via fancy
> tricks) to a tree structure like XML. So for every data hole, it is obvious to
> the system which element and attribute context the hole appears in?. Then a
> similar structural constraint could be applied in principle to embedded
> JavaScript, CSS, and URIs.
> Strict structual containment is a sound, principled approach to building safe
> templates that is a great approach for anyone planning a new template language.
> It cannot be bolted onto existing languages though because it requires that
> every element and attribute start and end in the same template. This assumption
> is violated by several very common idioms, such as the header-footer
> <#header-footer-example> idiom in ways that often require drastic changes to
> codebase to repair.
> Since it cannot be bolted onto existing languages, limiting ourselves to it
> would doom to insecurity most of the template code existing today. Most project
> managers who know their teams have trouble writing XSS-free code, know this
> because they have existing code written in a language that does not have this
> property.
> ? - modulo mechanisms like |<xsl:element name="...">|
> <http://www.w3schools.com/xsl/el_element.asp> which can, in principle, be
> repaired using equivalence classes of elements and attributes. I.e. one could
> define an equivalence class of elements all of whose attributes have the same
> meaning and which have the same content type: (TBODY, THEAD, TFOOT), (OL, UL),
> (TD, TH), (SPAN, I, B, U), (H1, H2, H3, ¡K) and allow a dynamic element mechanism
> to switch between element types within the same equivalence class. Similar
> approaches can allow selecting among equivalent dynamic attribute types : all
> event handlers are equivalent (modulo perhaps those that imply user interaction
> for some applications).
>      Strawman III: A runtime typing approach
> Prior to this work, the best auto-sanitization scheme was a runtime scheme
> <http://googleonlinesecurity.blogspot.com/2009_03_01_archive.html>.
> A runtime contextual auto-sanitizer plugs into a template runtime at a low
> level. Instead of writing content to an output buffer, the template runtime
> passes trusted and untrusted chunks to the autoescaper. The template:
> <ul>{for $item in $items}<li onclick="alert('{$item}')">{$item}{/for}</ul>
> might produce the output on the left, and by propagating context at runtime,
> infer the context in the middle and choose to apply the escaping directives on
> the right before writing to the output buffer.
> Content    Trusted    Context    Sanitization function
> |<ul>|    Yes    PCDATA    none
> |<li onclick="alert('>|    Yes    PCDATA    none
> |foo|    No    JS string    escapeJSString
> |')">|    Yes    JS string    none
> |foo|    No    PCDATA    escapeHTML
> |<li onclick="alert('>|    Yes    PCDATA    none
> |<script>doEvil()</script>|    No    JS string    escapeJSString
> |')">|    Yes    JS string    none
> |<script>doEvil()</script>|    No    PCDATA    escapeHTML
> |</ul>|    Yes    PCDATA    none
> This works, and with a hand-tuned C parser has been deployed successfully on
> CTemplates
> <http://google-ctemplate.googlecode.com/svn/trunk/doc/auto_escape.html> and
> http://www.clearsilver.net/ <ClearSilver>.
> Writing a highly tuned parser in JavaScript though is difficult so implementing
> this scheme requires making a hard trade-off between flexibility and correctness
> and download-size/speed.
> Our proposal is a factor of 4 faster than a runtime scheme implemented in
> JavaScript and has no download size cost above and beyond the code for the
> sanitization functions and the calls to them.
> Even in languages for which there are efficient parser generators, runtime
> approaches might suffer performance-wise. The overhead for the static approach
> is independent of the number of times a loop is re-entered, so templates that
> take large array inputs might perform worse with even a highly efficient runtime
> scheme.
> Runtime sanitization does do more elegantly in at least one area though. Dynamic
> tag and attribute names pose no problems to a runtime sanitizer. Whereas our
> scheme has to filter attribute names so that |$aname| cannot be |"onclick"| in
> |<button {$aname}=¡K>|, because a static approach must decide that the beginning
> of the attribute value is either a JavaScript context or some other context, a
> runtime approach can take into account the actual value of |$aname|. This is not
> a common problem, and our approach does handle many dynamic attribute situations
> including: |<button on{$handlerType}=¡K>|.
>      Strawman IV: A purely static approach
> We know of no purely static approaches, though they are possible. A purely
> static approach is one that, like our proposal, infers contexts at compile or
> load time, but does not take into account the runtime type of the values that
> fill the data holes.
> This approach has problems with over-escaping. Existing systems often use a mix
> of sanitization in-template and sanitization outside the template in the
> front-end code that calls the template.
> Our solution takes into account the runtime type of the values that fill a hole.
> If the runtime type marks the value as known-safe string of HTML, then an HTML
> entity escaping sanitization function can use that information to decide not to
> re-escape, and instead normalize or do nothing.
> See cavets <#caveats> for other problems that are as equally applicable to pure
> static systems as to our proposal.
>    Definitions and Algorithms
> This section is only relevant to implementors, testers, and others who want to
> understand the implementation. Everyone else, including web application
> developers, can ignore it.
> At a high level, the type system defines four things which are expanded upon below:
>   1. An initial start context for a public template. Typically |HTML_PCDATA|.
>   2. A context propagation algorithm which takes a chunk of literal text from
>      the template and the context at its start and returns the context at its
>      end. |(context * string) ¡÷ context|.
>   3. An algorithm that chooses a sanitization function for a data hole. It
>      takes the context before the hole and returns a sanitization function and
>      the context after the hole. |context ¡÷ ((£\ ¡÷ string) * context)|. If data
>      holes have statically available type info, then the type could be taken
>      into account : |(context * type) ¡÷ ((£\ ¡÷ string) * context)|.
>   4. A context join operator that takes the contexts at the end of branches and
>      yields the context after the branches have joined. This is used to
>      determine the context at the end of a conditional |{if}| by joining the
>      context at the end of the then-branch with the context at the end of the
>      else-branch. It is also used with loops, where (unless proven otherwise)
>      we have to join the context at the start (loop never entered) with a
>      context once through, with a steady state context for many repetitions.
>      |context list ¡÷ context|
> By contrast, the runtime auto-sanitization scheme described in strawman III has
> the same inital context, the same context propagation operator, no context join
> operator and uses a slightly differently shaped sanitization function chooser :
> |context ¡÷ (£\ ¡÷ (string * context))|.
>      Contexts
> A context captures the state of the parser in a combined HTML/CSS/JS lexical
> grammar. It is composed of a number of fields which pack into 2 bytes with room
> to spare:
>    * State ¡X a coarse parser state that distinguishes between
>      CDATA/RCDATA/PCDATA and attributes in HTML, comments, strings, and regular
>      expressions in JavaScript; and between comments, strings, and URLs in CSS.
>    * Element Type ¡X when in an HTML tag (between |<| and |>|), keeps track of
>      whether the tag body is PCDATA, RCDATA, or CDATA; and once in an RCDATA or
>      CDATA tag body, used to keep track of the expected end tag, e.g. inside a
>      |<script>| body we have to find a |</script>| tag, but should ignore any
>      apparent |</style>| tags.
>    * Attribute type ¡X the type of attribute we're in. Distinguishes between
>      script attributes (|onclick|, etc.), |style| attributes, URL attributes
>      (|href|, etc.), and other attributes.
>    * Attribute end delimiter ¡X indicate the termination condition for the
>      attribute value we're in: double quoted, single quotd, unquoted, or none.
>    * JavaScript following slash ¡X for JavaScript states, explains what to do
>      with a |/| that does not start a comment: enter a regular expression
>      literal, or a division operator, or fail with an error message due to
>      ambiguity from context joining.
>    * URI part ¡X for URI states, the part of the URI that we're in: the start,
>      path, query, fragment, or an ambiguous part tdue to context joining.
> Contexts support two operators: join and £`-commit.
> The join operator produces the context at the end of a condition, loop, switch,
> or other flow control construct. This sometimes introduces an ambiguity. In the
> template:
> <form action="{if $target}/{$name}/handle?tgt={$target}{else}/{$name}/default{/if}¡ô">Hello {$name}¡K
> One branch ends in the query portion of a URI, and one ends outside it. If there
> were a data hole at the ¡ô, then we would not be able to determine an appropriate
> sanitization function for it?. So context joining often introduces just enough
> ambiguity, by using do-not-know values for fields, and in the common case, we
> later reach a point where we discard that info. In the URI case, if there were a
> |#| character at the ¡ô we can reliably transition into a URI fragment context,
> and in any case, the end of the attribute moots the question.
> The £`-commit operator is used when we see a data hole. In some cases, we
> introduce parser states to delay decision making. In the template fragment, |<a
> href=|, we could see a quote character next, or space, or the start of an
> unquoted value, or the end of the tag (implying empty href), or a data hole
> specifying the start of an unquoted attribute value. If the next construct is a
> data hole we need to commit to it being an unquoted attribute. The £`-commit
> operator in this case goes from an HTML_BEFORE_ATTRIBUTE_VALUE state with an
> attribute end delimiter of NONE to a state appropriate to the value type (e.g.
> JS for an |onclick| attribute) with an attribute end delimiter of SPACE_OR_TAG_END.
> The precise details of both these operators were determined empircally to come
> up with the simplest semantics that handles cases found in real code that web
> developers do not consider to be badly written or confusing.
> ? ¡X This could be fixed by migrating the problematic data hole and the code
> leading up to it into each branch, but this is tricky to do across template
> boundaries and has not proven to be necessary for the codebase we migrated.
>      Grammar
> The context propagation algorithm uses a combined HTML/CSS and JS lexical
> grammar described below. Click on non-terminal productions for more detail.
>        HTML
>        Attributes
>        JS
>        CSS
>        URI
>        DynamicText
> Converts plain text to HTML by entity encoding unless it's type indicates it is
> known safe HTML.
>    * `I <3 ponies` ¡÷ `I &lt;3 ponies`
>    * |new SanitizedHtml('<b>Hello, World</b>')| ¡÷ `<b>Hello, World!</b>`
> The first case is handled by encoding all PCDATA special characters (<, >, and
> &) as HTML entities (&lt;, &gt;, and &amp;). Other code-points may be escaped,
> but need not be.
> In the second case, the safe HTML is emitted as is. It must be a mixed group of
> complete tags and text nodes such that there exists a safe template that could
> have produced it starting from an HTML PCDATA context and ending in the same
> context, or there exists a safe HTML sanitizer that could have produced it.
>        DynamicRcdata
> Converts plain text to HTML by entity encoding unless it's type indicates it is
> known safe HTML.
>    * `I <3 ponies` ¡÷ `I &lt;3 ponies`
>    * |new SanitizedHtml('<b>Hello, World</b>')| ¡÷ `&lt;b&gt;Hello,
>      World!&lt;/b&gt;`
> The first case is handled by encoding all RCDATA special characters (<, >, and
> &) as HTML entities (&lt;, &gt;, and &amp;). Other code-points may be escaped,
> but need not be.
> In the second case, the safe HTML is normalized. All the HTML special characters
> are escaped except for ampersands (&), which are left as-is. Since all RCDATA
> end tags contain `<`, and `<` is escaped to a string that does not contain it,
> and no other code units are escaped to a string that contains it, no safe HTML
> chunk can cause premature ending of an RCDATA tag. This means that the safety of
> the odd but valid Soy template |<textarea>{$foo}<script>alert('Keystone
> kop');</script></textarea>| will not violate the structure security goal or
> unauthored code security goal even when a chunk of safe HTML contains an RCDATA
> end tag like |</textarea>|.
>        DynamicTagName
> Allows through parts of non-CDATA, non-RCDATA tag names. So the Soy
> |<h{$headerLevel}>| can be used to generate |<h1>|, |<h2>|, ¡K
> To avoid problems where a tag name might be combined with a static part to form
> |script|, |style|, or another |CDATA| or |RCDATA| tag, we impose the following
> restrictions:
>    * must contain only ASCII letters, digits, dashes and colons; and
>    * must
>          o contain a colon (a namespace), or
>          o contain a digit, or
>          o be the full name (case-insensitive) of a non-RCDATA, non-CDATA HTML
>            element.
>        DynamicAttrName
> Allows through parts of a non-special attribute name.
>    * `checked` ¡÷ `checked`
>    * `<script>alert(pwned)</script>` ¡÷ /error/
> TODO: scheme to avoid concatenation from producing |on|*, |style|, |href|, etc.
>        DynamicAttrValue
> Converts plain text to HTML by entity encoding so it can be embedded in an HTML
> attribute. If embedded in a quoteless attribute, then also encodes spaces.
> If the result is known safe HTML, strips tags so that the Soy |<abbr
> title="{$longDesc}">{$shortDesc}</abbr>| works even when both |$longDesc| and
> |$shortDesc| are snippets of sanitized HTML.
>    * `I <3 ponies` ¡÷ `I &lt;3 ponies`
>    * |new SanitizedHtml('<b>Hello, World</b>')| ¡÷ `Hello, World!`
> The first case is handled by encoding all HTML special characters including
> quotes (<, >, &, ", ', and =) as HTML entities (&lt;, &gt;, &amp;, &quot;, and
> &#34;, &#61;).
> The second case is handled by stripping HTML tags and comments from the safe
> HTML, and then normalizing it by applying the same escaping scheme as for the
> first case, but without encoding ampersands (&).
> For both cases, when the HTML attribute is not quoted, we additionally have to
> quote all codepoints that would signal the end of an HTML attribute, including a
> number of space and control characters. This set was derived empirically, and
> includes the backtick (`) which can be used as a quoting character on some
> versions of IE.
>        DynamicJsString
> Escapes plain text so it can be incorporated into part of a JS string literal by
> escaping special characters, e.g. newline ¡÷ |\n|.
>    * `John "The Anonymous" Doe` ¡÷ `John \"The Anonymous\" Doe`
> We escape dynamic JS strings using the following table:
> Codepoint    Glyph    Escape
> 000A_16        \n
> 000D_16        \r
> 0022_16    "    \u0022
> 0027_16    '    \u0027
> 002F_16    /    \/
> 003C_16    <    \u003C
> 003E_16    >    \u003E
> 005C_16    \    \\
> 2028_16        \u2028
> 2029_16        \u2029
> These escapes prevent premature string closing, since all JS quote characters
> are encoded to a sequence that does not contain a quote character and no other
> codepoint is encoded to a sequence containing a quote character. This prevents
> additional JS syntax errors by properly encoding all JS newline codepoints. It
> preserves structure by encoding any sequences that would end a CDATA tag, CDATA
> section, escaping text span, or quoted HTML attribute value. The output can be
> embedded in an HTML attribute value by additionally escaping & to \u0026. In the
> case of unquoted HTML attribute values, just escaping ampersands is not
> sufficient ; the output needs to be HTML entity escaped per DynamicAttrValue
> <#DynamicAttrValue>.
>        DynamicRegExp
> Like DynamicJsString <#DynamicJsString>, but additionally escapes characters
> special in regexp like ? and *.
>    * `John "The Anonymous" Doe + 1` ¡÷ `John \"The Anonymous\" Doe \+ 1`
>        DynamicJsValue
> Quotes strings and encodes them like DynamicJsString <#DynamicJsString>, puts
> spaces around boolean, null, and numeric values.
>    * `John "The Anonymous" Doe + 1` ¡÷ `"John \"The Anonymous\" Doe + 1"`
>    * `42` ¡÷ `42 `
>    * `false` ¡÷ `false `
> Putting spaces around non-string values makes sure that they will be separate
> tokens but will not introduce a function call in the case of the Soy template
>       |var f = function () {}  // Missing semicolon.
>       {$myBoolean}&& sideEffect();|
> where due to semicolon insertion, adding parentheses would cause the template to
> produce the equivalent of
>       |var f = ((function () {})(false))&& sideEffect();|
> given |{ myBoolean: false }|.
>        DynamicCssString
> Escapes plain text so it can be incorporated into part of a CSS string literal
> by escaping special characters, e.g. newline ¡÷ |\10 |.
>    * `John "The Anonymous" Doe` ¡÷ `John \22 The Anonymous\22 Doe`
> We encode *all* CSS special characters using CSS hex escaping. CSS hex escaping
> allows an escape to be followed optionally by a space or tab character so that
> an escape may be followed by an unescaped hex digit. We always emit a following
> space.
> We aggressively encode all CSS special characters to prevent unspecified CSS
> error recovery <http://www.w3.org/TR/css3-syntax/#error> from restarting parsing
> inside quoted strings.
>            9.2.1. Error conditions
>    In general, this document does not specify error handling behavior for user
>    agents (e.g., how they behave when they cannot find a resource designated by
>    a URI).
>    However, user agents must observe the rules for handling parsing errors.
>    Since user agents may vary in how they handle error conditions, authors and
>    users must not rely on specific error recovery behavior.
> We also escape both angle brackets (< and >) (which is already a CSS special) so
> that HTML escaping text spans, CDATA sections, CDATA end tags, etc. cannot be
> introduced into the middle of CSS strings.
>        DynamicCssQuantityOrKeywordOrName
> Allows a CSS keyword, quantity, or ID or class name through, but filter content
> containing special characters. Some use cases:
>    * |color: #{$hashColor}|
>    * |color: {$colorName}|
>    * |border-{$rtlLeft}: ¡K /* left for English, right for Arabic */|
>    * |div.{$className} { ¡K }|
>    * |width: ${width}{$widthUnits}|
> Some example values:
>    * `24px` ¡÷ `24px`
>    * `left` ¡÷ `left`
>    * `background` ¡÷ `background`
>    * `expression` ¡÷ /error/
> TODO: explain the allowed set and its derivation.
>        DynamicSchemeFilteredUriPart
> Whitelists a protocol if present to prevent code execution via |javascript:¡K|,
> and normalizes the URI (encoding all unencoded HTML special characters, quotes,
> spaces, and parentheses) so it can be embedded. E.g. `"` ¡÷ %22.
> URI normalization percent escapes all codepoints escaped by DynamicQueryPart
> <#DynamicQueryPart> except for the percent character (%).
> TODO: Explain the filter details and their derivation.
>        DynamicQueryPart
> Encodes all characters that are special or disallowed in a URI.
> We encode all codepoints encoded by |encodeURIComponent| making the same
> assumption that the URL is UTF-8 encoded.
> Over |encodeURIComponent|, we additionally encode single quotes (') and
> parentheses(( and )) so that the result can be safely embedded in single quoted
> HTML attributes and in single quoted and unquoted CSS |url(¡K)| constructs. Note
> that applying an extra level of CSS escaping using |\27 | style escapes is not
> an option since IE (for interoperability with DOS file paths?) does not
> interpret |\| as the beginning of an escape when it appears inside a |url(¡K)|.
> Each of these characters is significant in a URI as specified in RFC 3986:
>            2.2 <http://www.apps.ietf.org/rfc/rfc3986.html#sec-2.2> Reserved
>            Characters
>    sub-delims  = "!" / "$" / "&" /_"'" / "(" / ")"_
> so escaping them is technically not semantics preserving, but encoding them is
> safe for all schemes that commonly appear in HTML because those codepoints only
> appear in the obsolete mark productions.
>            D.2 <http://www.apps.ietf.org/rfc/rfc3986.html#sec-D.2> Modifications
>    The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
>    [RFC2234]. This change required all rule names that formerly included
>    underscore characters to be renamed with a dash instead. In addition, a
>    number of syntax rules have been eliminated or simplified to make the
>    overall grammar more comprehensible. Specifications that refer to the
>    obsolete grammar rules may be understood by replacing those rules according
>    to the following table: ¡K
>    mark    "-" / "_" / "." / "!" / "~" / "*" /_"'" / "(" / ")"_
>        DynamicUriPart
> Normalizes the URI like URI normalization <#DynamicSchemeFilteredUriPart> so
> that an already encoded path or fragment can be emitted inline but does not
> filter since a protocol part cannot appear here.
>      Context Propagation
> The context propagation algorithm uniquely determines the context at every data
> hole so that a later pass may chose a sanitization function for each hole.
> The algorithm operates at two level, one on the graph of templates, and another
> individually within templates.
> The first deals with identifying the minimal set of templates that need to be
> processed, and might clone templates to deal with templates that are called in
> multiple different contexts.
> The template context propagation algorithm uses an inference object which is
> implemented as a set of nested maps and a pointer to a parent inference object.
> This allows us to speculatively type a template sub-graph, and when we have a
> consistent view of types, we can collapse our conclusions into the parent by
> simply copying maps from children to parent. The maps include maps from holes to
> start contexts, from templates to end contexts used to type calls.
> def autosanitize(templates):
>   inferences = Inferences()
>   for template in templates:
>     if inferences.getEndContext(template) is not None: continue # already done
>     if template.is_public() or template.is_contextually_autosanitized():
>       # By exploring the call graph from only public templates, ones
>       # that can be invoked by front-end code, or ones that must be
>       # contextually sanitized, we do not trigger error checks for
>       # parts of the code-base that don't yet use contexual
>       # auto-sanitization, easing migration.
>       compute_end_context(template, inferences, start_context=HTML_PCDATA?)
>   return inferences
> That algorithm delegates all the hard work to another algorithm below that
> examines the template graph reachable from one particular top-level template.
> def compute_end_context(template, inferences, start_context):
>   # First, assume that the end context is the same as the start context.
>   # Template authors seem to write templates that fit this way.
>   # Empirically, less than 0.2% of templates in our sample violate
>   # this assumption.
>   # The ones that do tend to be some of the gnarliest code that
>   # template authors would rather not refactor.
>   # We need to chose an end context now to avoid infinite regression
>   # if a template recurses.
>   # Start with the optimistic assumption that the above is true.
>   optimistic_assumption_1 = Inferences(parent=inferences)
>   optimistic_assumption_1.template_end_contexts[template] = start_context
>   end_context = propagate_context(
>       template.children, start_context, optimistic_assumption_1)
>   if start_context == end_context:
>     # Our optimistic assumption was warranted.
>     optimistic_assumption_1.commit_into_parent()
>     return end_context
>   # Otherwise, assume that the end_context above is the end_context
>   # and check that we have reached a fixed point.
>   optimistic_assumption_2 = Inferences(parent=inferences)
>   optimistic_assumption_2.template_end_contexts[template] = end_context
>   end_context_fixed_point = propagate_context(
>       template.body, start_context, optimistic_assumption_2)
>   if end_context_fixed_point == end_context:
>     # We have found a fixed point.  Phew!
>     optimistic_assumption_2.commit_into_parent()
>     return end_context_fixed_point
>   # There are various other strategies we could try here, but
>   # we have not seen a need in real template code.
>   raise Error(...)
> Thus far, we have done nothing that is particular to the syntax templating
> language itself. Different languages have different semantics around parameter
> passing, and provide different flow control constructs. The algorithm below is
> an example for one that deals with a simple template language that provides
> calls, conditions, chunks of static template text, and expression interpolations
> which fill data holes. On a call, it may recurse to the compute end context
> algorithm above, which is how we lazily explore the portion of the template call
> graph needed.
> def propagate_context(parse_tree_nodes, context, inferences):
>   for parse_tree_node in parse_tree_nodes:
>     if is_safe_text_node(parse_tree_node):
>       context = apply_html_grammar(parse_tree_node.safe_text, context)
>     elif is_data_hole(parse_tree_node):
>       context =&epsilon_commit(context)  # see definition above
>       inferences.context_for_data_hole[node] = context
>       context =¡K   # compute context after hole.
>     elif is_conditional(parse_tree_node):
>       if_context = propagate_context(parse_tree_node.if_branch, context, inferences)
>       else_context = propagate_context(parse_tree_node.else_branch, context, inferences)
>       context = context_join(if_branch, else_branch)
>     elif is_call_node(parse_tree_node):
>       output_context = None
>       # possible_callees comes up with the templates this might be calling,
>       # and may clone templates if they are called in multiple different contexts.
>       # Most template languages have static call graphs, so in practice, there is
>       # exactly one possible callee.
>       for possible_callee in possible_callees_of(parse_tree_node, context):
>         if possible_callee not in inferences.template_end_contexts:
>           context_after_call = compute_end_context(possible_callee, inferences, context)
>         else:
>           context_after_call = inferences.template_end_contexts[possible_callee]
>         if output_context is None:
>           output_context = context_after_call
>         else:
>           # Since 99% of templates end in their start context, in practice,
>           # this join does little.
>           output_context = context_join(output_context, context_after_call)
>       context = output_context
>   return context
> ? ¡X We make the simplifying assumption that the start context for all public
> templates is HTML_PCDATA. Some templating languages may be used in different
> contexts, and so this assumption might not prove valid. We could choose the
> starting context for public templates based on some kind of annotation or naming
> convention particular to the templating language.
>      Sanitization Functions
> We define a suite of sanitization functions. The table below describes them
> briefly and the context in which they are used. There are significantly more
> than most manual escaping schemes. As noted above, most developers who don't
> work on parsers for HTML/CSS/JS have a simplified mental model of the grammar
> which makes it difficult to choose between this many options. We have many
> sanitization functions because we want to minimize template output size to
> minimize network latency; having more sanitization functions lets us avoid
> escaping common characters like spaces when safe. The naming convention for
> sanitization function reflects the escaper <#glossary-escaper>, filter
> <#glossary-filter>, and normalizer <#glossary-normalizer> definitions from the
> glossary.
> |escapeHTML|    HTML entity escapes plain text, and allows pre-saniized HTML
> content through unchanged
> |normalizeHTML|    Normalizes HTML. Same as HTML, but does not encode ampersands.
> |{escape,normalize}HTMLRcdata|    Like |escapeHTML| but does not allow
> pre-sanitized HTML content through unchanged since tags are not allowed in
> RCDATA contexts, |<title>| and |<textarea>|.
> |{escape,normalize}HTMLAttribute|    Like |escapeHTML| but strips tags from
> pre-sanitized HTML content through unchanged since tags are not allowed in
> RCDATA contexts.
> |filterHtmlElementName|    Rejects any invalid element name or non PCDATA element.
> |filterHtmlAttribName|    Rejects any invalid attribute name or attribute name
> that has JS, CSS, or URI content.
> |{escape,normalize}URI|    Percent encodes (assuming UTF-8) URI, HTML, JS, and CSS
> special characters so that the URL can be safely embedded. This means encoding
> parentheses and single quotes which should not be normalized according to RFC
> 3986, and is not valid for all non-hierarchical URI schemes, but the only
> productions using single quotes or parentheses are obsolete marker productions,
> and normalizing these characters is essential to safely embedding URIs in
> unquoted CSS |url(¡K)| and to make sure that CSS error recovery mode doesn't jump
> into the middle of a quoted string.
> |filterNormalizeURI|    Like |normalizeURI| but first rejects any input that might
> embed a protocol other than |http|, |https|, or |mailto|.
> |{escape,normalize}JSStringChars|    Uses |\\| and |\uABCD| style escapes for any
> code-units special in HTML, JS, or conditional compilation directives.
> |{escape,normalize}JSRegexChars|    Like |{escape,normalize}JSStringChars| but
> also escapes regular expression special characters like |'$'|.
> |{escape,normalize}JSValue|    Encodes a boolean or a number to the string
> representation of that surrounded by spaces. Otherwise escapes a string value
> and wraps it in quotes.
> |escapeCSSStringChars|    Uses |\ABCD| style escapes to escape HTML and CSS
> special characters.
> |filterCssIdentOrValue|    Allows CLASSes and IDs for CSS selectors, parts of
> property names necessary for many BIDI applications, CSS keyword values, color
> literals, and quantities. But disallows property names that might nest
> javascript, and disallows URL schemes.
> |noAutoescape|    Passes its input through unchanged. This is an auditable
> exception to auto-sanitization.
>      Sanitized Content Types
> Sanitized content allows template users to pre-sanitize some content, and allow
> approved structured content.
> |new SanitizedContent('<b>Hello, World!</b>')| specifies a chunk of HTML that
> the creator asserts is safe to embed in HTML PCDATA.
> It is possible for misuse of this feature to violate all the safety properties
> contextual auto-sanitization provides. We assert that allowing this makes it
> easier to migrate code that has no XSS safety net to a better place, and
> satisfies some compelling use cases. But it needs to be used carefully.
> Developers should heed this advice:
>    * Don't roll your own escapers. If you find them in existing code, prefer
>      escaping in the template via the contextual auto-sanitization. This does
>      not apply to filters. Filter early, and filter often.
>    * Put the sanitized content type constructor as close to the code that does
>      the sanitization.
>    * Don't use tag or attribute black-lists.
>    * Be skeptical of "safe" HTML from a database. This is a vector for SQL
>      Injection to turn into XSS.
> Compelling use cases include:
>    * HTML from a trusted source such as translators who are translating strings
>      into foreign languages. Consider using a template system that supports
>      text L10N directly.
>    * HTML from tag whitelisters, wiki-text-to-html converters, rich text
>      editors, etc.
>    Caveats
> As noted above, (in the runtime contextual auto-sanitization strawman) static
> approaches (including ours) cannot handle all possible uses of dynamic attribute
> and element name. These seem rare in real code, and relatively easy to fix, but
> if necessary, a hybrid runtime/static approach could address this problem.
> Static approaches get into corner cases around zero-length untrusted values. For
> example, to preserve the code effect property <#code_effect_property>, we need
> to make sure that no untrusted value specifies a |javascript:| or similar URL
> protocol. In template code like |<img src="{$x}{$y}">| we might naively decide
> that it is sufficient to filter |$x| to make sure that it specifies no protocol
> or an approved one. But if |$x| is the empty string, then |$y| might still
> specify a dangerous protocol. Alternatively |$x| might specify |"javascript"|
> and |$y| start with a colon. This hole can be closed a number of ways, but is a
> source of considerable complexity because the two interpolations might cross
> template boundaries. Other examples of whitespace problems are in JavaScript
> regular expressions: |var myPattern = /{$x}/| where an empty |$x| would turn the
> regular expression literal into a line comment.
> Our JavaScript parser is unsound. JavaScript is a language that does not have a
> regular lexical grammar (even ignoring conditional compilation) because of the
> way it specifies whether a |/| starts a regular expression or a division
> operator. We use a scheme based on a draft JavaScript 1.9 grammar devised by
> Waldemar Horwat that makes that decision based on the last non-comment token.
> This works well for all the code we've seen that people actually write, and
> makes our approach feasible, but there is a known case where it fails: |x++
> /a/i| vs |x = ++/a/i|. The second code snippet, while nonsensical, is valid
> JavaScript that our scheme fails to handle correctly.
> Our parser does not currently recognize HTML5 escaping text spans
> <http://dev.w3.org/html5/markup/aria/syntax.html#escaping-text-span>, the
> regions inside |<script>| and |<style>| bodies delimited by |<!--| and |-->|
> that suppress end-tag processing. This can be fixed if a codebase seems to use
> them. Our santization function choices are designed to not produce content
> containing escaping text span boundaries.
>    Case Study
> We studied 1035 templates that were migrated from an existing codebase to use
> contextually sanitized templates. Most of the templates were relatively small
> but totalled 21098 LOC and 783kB. The compilation load time cost for these 1035
> templates was 998339279 ns on a platform with 2 GB of RAM, an Intel 2.6 MHz
> dual-core processor running Linux 2.6.31.
> 1- 18    ######################################## (685)
> 19- 36    ############ (210)
> 37- 55    #### (78)
> 56- 73    # (33)
> 74- 91    (10)
> 92- 110    (7)
> 111- 128    (4)
> 129- 147    (3)
> 148- 165    (1)
> 166- 183    (1)
> 184- 202    (1)
> 203- 220    (1)
> 221- 238    (0)
> 239- 257    (0)
> 258- 275    (0)
> 276- 294    (0)
> 295- 312    (1)
> Most of the sanitization functions chosen were plain text¡÷HTML, so the
> non-contextual auto-sanitization.
> ||escapeHtml|    602
> ||escapeHtmlAttribute|    380
> ||filterNormalizeUri, |escapeHtmlAttribute|    231
> ||escapeJsValue|    39
> ||filterCssValue|    33
> ||escapeJsString|    27
> ||escapeUri|    15
> ||escapeHtmlRcdata|    10
> ||escapeHtmlAttributeNospace|    7
> ||filterHtmlIdent|    3
> ||filterNormalizeUri|    1
> 268 out of 1348 interpolation sites require runtime filtering (19.9)%, mostly
> |filterNormalizeUri|.
> The benchmark runs over a large template with dummy data that is meant to be
> representative of the application using it. The benchmarks range from 15.2 ms to
> 16.8 ms and the standard-deviation is roughly 6 ms, which puts the runtime-cost
> of the sanitization functions in the noise.
> No sanitization
> ====
> 50% Scenario 16709334.99 ns;£m=615548.54 ns @ 10 trials
> Non-contextual auto-sanitization
> ====
> 50% Scenario 16835324.39 ns;£m=6030836.03 ns @ 10 trials
> Full contextual auto-sanitization
> ====
> 50% Scenario 15227861.39 ns;£m=616193.00 ns @ 10 trials
> In JavaScript, a state-machine based runtime contextual auto-sanitization
> approach shows a 3-4 time slowdown over string concatenation.
> # rows    string +=    Array.join    open(Template(¡K))    DOM    render time
> 1000    54 ms    68 ms    204 ms    508 ms    586 ms
> 5000    267 ms    332 ms    1159 ms    2528 ms    1458 ms
> We ran the same benchmark against a runtime contextual auto-sanitizer we wrote
> for javascript. The "noEscape" case simply appends all the strings to a buffer.
> It does no context inference. The "parseOnly" case appends to a buffer and does
> context inference, but does no escaping. The "dynEscape" does context
> propagation and chooses one of three escaping methods by looking at the context
> from the parser. The cost of applying the escaping directive is about the same
> as a string copy, and the cost of parsing and propagating context at runtime is
> about 6 times that cost. This benchmark is a good comparison for templates where
> the logic that computes values to fill data holes is simple so the cost of
> executing the template should approach string concatenation.
> Totals for 1000 runs:
> noEscape   :    491316000 ns  (1.0)
> parseOnly  :   2979672000 ns  (6.1)
> dynEscape  :   3531971000 ns  (7.2)
> --------------------------------------------------------------------------------
> Last modified: Wed Feb 23 17:06:20 EST 2011
> _______________________________________________
> Developer-outreach mailing list
> Developer-outreach at lists.owasp.org<mailto:Developer-outreach at lists.owasp.org>
> https://lists.owasp.org/mailman/listinfo/developer-outreach

Developer-outreach mailing list
Developer-outreach at lists.owasp.org<mailto:Developer-outreach at lists.owasp.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/developer-outreach/attachments/20110228/f51af8ab/attachment-0001.html 

More information about the Developer-outreach mailing list