[Developer-outreach] Modifying the custom tags and docs for the form, bean:write, and other output tags in SpringMVC, Struts, Wicket, etc and JavaScript Frameworks

dinis cruz dinis.cruz at owasp.org
Mon Feb 28 11:46:08 EST 2011


A great next-step would be to map the ideas and concepts in Mike Samuel
paper with the great work that Microsoft has done with the AntiXss library,
namely the Security Runtime Engine (SRE) which applies encoding in-context
to ASP.NET controls.

John W, as per our call a couple minutes ago, this is the type of
sub-project (or sub-initiative) that needs to gain its own space, so that
the people who are interested are able to just focus on it and get things
done

Btw, is Mike Samuel  'Using Type Qualifiers to Make Web Templates Robust
Against XSS' paper published? (Google can't seem to find it). I want to read
it properly, and was looking for a nicely formated pdf :)

Dinis Cruz


2011/2/28 Mark Curphey <mark at curphey.com>

> Awesome. Any MSFT people you need me to go talk to or intro you to? I don't
> think we play in the space but don't know.....
>
> Sent from my Phone
>
> On Feb 28, 2011, at 1:43 AM, "Jim Manico" <jim.manico at owasp.org> wrote:
>
> > Mike Samuel (an Auto-escape template author from google) would like to
> send this letter to several framework communities, starting with Django.
> Could you kindly review and pass alone any suggestions to
> msamuel at google.com - in the interest of security developer outreach?
> >
> > Thanks all,
> > Jim
> >
> >
> >
> >
> >
> >
> >  Using Type Qualifiers to Make Web Templates Robust Against XSS
> >
> >
> >    Contents
> >
> >
> >    Motivation
> >
> > Scripting vulnerabilities plague web applications today. To streamline
> the
> > output generation from application code, numerous web templating
> frameworks have
> > recently emerged and are gaining widespread adoption. However, existing
> web
> > frameworks fall short in providing mechanisms to automatically and
> > context-sensitively sanitize untrusted data.
> >
> > For example, a naive web template might look like
> >
> > <div>{$name}</div>
> >
> > but this template is vulnerable to Cross-site scripting (XSS)
> vulnerabilities.
> > An attacker who controls the value of |name| could pass in
> > |<script>document.location = 'http://phishing.com/';</script>| to
> redirect users
> > to a malicious site, steal the users credentials or personal data, or
> initiate a
> > download of malware.
> >
> > The template author might manually encode name:
> >
> > <div>{$name_  |escapeHTML_}</div>
> >
> > making sure that the user sees exactly the value of |name| as per spec,
> and
> > defeating this particular attack.
> >
> > A better web templating system might automatically insert the
> ||escape***|
> > directives, relieving the template author of the burden.
> >
> > This paper argues that correct sanitization is too important, that manual
> > sanitization is an unreasonable burden to place on template authors (and
> > especially maintainers), defines goals that any automatic approach should
> > satisft, and introduces an automatic approach that is particularly
> suitable for
> > bolting onto existing web templating languages.
> >
> >
> >    Abstract
> >
> > In this paper, we propose a type-based approach to bolt context-sensitive
> > automatic sanitization onto existing widely used web templating
> languages. In
> > particular, we introduce the new notion of "context" type qualifiers to
> > represent the contexts in which untrusted data can be embedded. We
> propose a new
> > type system that refines the base type system of a web templating
> language with
> > the context type qualifer. Based on the new type system, we design and
> develop a
> > context-sensitive auto-sanitization (CSAS) engine which runs during the
> > compilation stage of a web templating framework to add proper
> sanitization and
> > runtime checks to ensure the correct sanitization. We implement our
> system in
> > Google Closure Templates, a commercially used open-source templating
> framework
> > that is used in GMail, Google Docs and other applications. We evaluate
> our type
> > system on 1035 real-world Closure templates. We demonstrate that our
> approach
> > achieves both better security and performance than previous approaches.
> >
> >
> >    Glossary
> >
> > Context
> >    A parser state in the combined HTML, CSS, and JavaScript grammar used
> to
> >    determine the stack of sanitization routines that need to be applied
> to any
> >    untrusted data interpolated at that point to preserve the security
> >    properties outlined here.
> > Cross-Site Scripting
> >    A quoting confusion <#glossary-quoting_confusion> attack whereby
> untrusted
> >    data naively interpolated into HTML, CSS, or JavaScript causes code to
> run
> >    with the privileges of an origin not owned by the attacker.
> > CSS
> >    CSS 2 and 3 plus vendor specific extensions such as |expression:| and
> >    comment parsing and error recovery quirks so that our sanitization
> function
> >    definitions survive a worst-case analysis. This paper assumes a basic
> >    familiarity with CSS.
> > Escaper
> >    A sanitization function <#glossary-sanitization_function> that takes
> content
> >    in an input language (usually |text/plain|) and produces content in an
> >    output language. E.g. the function |escapeHTML| is an escaper that
> takes
> >    plain text, |'I <3 Ponies'|, and transforms that to semantically
> equivalent
> >    HTML by turning HTML special characters into entities: |'I &lt;3
> Ponies'|.
> >    (Escapers may, in the process, break hearts.) See also OWASP's
> definition
> >    <
> http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#Escaping_.28aka_Output_Encoding.29
> >.
> >
> > Filter
> >    A sanitization function <#glossary-sanitization_function> that takes a
> >    string and either returns it, returns an innocuous string, or aborts
> >    template processing. E.g. an untrusted value at the start of a URL can
> >    specify a powerful protocol such as |javascript:|. A filter can ensure
> that
> >    an untrusted value at the beginning of a URL either contains no
> protocol or
> >    contains one in a whitelist (|http|, |https|, or |mailto|) and if it
> finds
> >    an untrusted value that violates this rule, might return an innocuous
> value
> >    such as |'#'| which defangs the URL.
> > HTML
> >    HTML as parsed by browsers. Typically HTML5 but we need to deal with
> >    syntactic quirks from mainstream browser lines as old as IE5. This
> paper
> >    assumes a basic familiarity with HTML.
> > JavaScript
> >    EcmaScript 5 but including vendor specific extensions such as
> conditional
> >    compilation directives so that our sanitization function definitions
> survive
> >    a worst-case analysis. This paper assumes a basic familiarity with
> JavaScript.
> > Normalizer
> >    A sanitization function <#glossary-sanitization_function> that takes
> content
> >    in an input language and produces content in that same language but
> that can
> >    be used in more contexts. E.g. the function |normalizeURI| might make
> sure
> >    that quotes are encoded so that a URI path can be embedded in an HTML
> >    attribute unchanged :
> >    |'mailto:_<_Mohammed%20_"_The%20Greatest_"_%20Ali_>_%20ali at gmail.com'|
>> >    |'mailto:_%
> 3c_Mohammed%20_%22_The%20Greatest_%22_%20Ali_%3e_%20ali at gmail.com'|
> >    and a function that strips tags from valid HTML allows the tagless
> HTML to
> >    be included in an HTML attribute context.
> > Quoting Confusion
> >    A vulnerability (or an exploitation of such) due to a failure to
> encode data
> >    in one language (such as |text/plain|) before concatenating it with
> content
> >    in another language such as |text/html| in the case of XSS. Other
> examples
> >    of quoting confusion include SQL Injection, Shell Injection, and HTTP
> header
> >    splitting.
> > RTTI
> >    RunTime Type Information. Reflective access to the type of a value at
> the
> >    time a program is running. RTTI APIs include |typeof| in C++, C#,
> >    JavaScript; |instanceof| in Java and JavaScript;
> |Object.instanceof_of?()|
> >    in Ruby, |type| in python, |Object.getClass()| in Java; and
> >    |Object.GetType()| in C#.
> > Sanitization Function
> >    A function that takes untrusted data and returns a snippet of web
> content.
> >    There are several kinds of sanitization functions : escapers,
> normalizers,
> >    and filers.
> > Template
> >    A function from (typically untrusted) data to a string that is
> specified as
> >    a DSL that clearly separates static trusted snippets of content
> (usually
> >    appearing as literal text) from interpolations of untrusted data
> (usually
> >    appearing as expressions or variable names). In this paper the term is
> used
> >    synonymously with "Web Template" which is a type of template that
> produces a
> >    string of content in a web language : HTML, CSS, or JavaScript.
> > Trusted Path
> >    The ability of an application or piece of code to establish a channel
> to the
> >    user that the user can be sure leads to that piece of code. E.g.
> Browsers
> >    use an unspoofable dialog box for HTTP auth to gather passwords,
> Windows
> >    uses Ctrl-Alt-Delete for the same purpose, and browsers disallow
> spoofing of
> >    the URL bar so that informed users can reliably tell if a page is
> secure and
> >    using valid certs. See also wikipedia
> >    <http://en.wikipedia.org/wiki/Trusted_path>.
> > XSS
> >    See Cross-Site Scripting <#glossary-cross_site_scripting>
> >
> >
> >    Solution Sketch: A static approach with RTTI to avoid over-escaping.
> >
> > First, a brief overview of the proposed solution. Later, while discussing
> goals
> > and alternatives, we will refer back to this.
> >
> > The template below specifies a form whose |action| depends on two values
> |$name|
> > and |$target| which may come from untrusted sources. The |{if
> …}…{else}…{/if}|
> > branches to define a dynamic URL.
> >
> > <form action="{if
> $target}/{$name}/handle?tgt={$target}{else}/{$name}/default{/if}">Hello
> {$name}…
> >
> > First, we parse the template to find trusted static content, dynamic
> "data
> > holes" that may be filled by untrusted data, and flow control constructs:
> |if|,
> > |for|, etc. The black portions are the data holes, and the green portions
> are
> > trusted static content.
> >
> > <form action="{if $target}/        /handle?tgt=          {else}/
>  /default{/if}">Hello        …
> >
> > Next we do a flow-sensitive analysis, propagating types to determine the
> context
> > in which each data hole appears.
> >
> > <form action="{if $target}/        /handle?tgt=          {else}/
>  /default{/if}">Hello        …
> >
> > ↑PCDATA    ↑Attr URL start    ↑Attr URL path    ↑Attr URL query    ↑Attr
> URL path    ↑PCDATA
> >
> >
> > Based on those contexts, we determine the type of content that is
> expected for
> > each hole.
> >
> > <form action="{if $target}/(HTML-encoded URL)/handle?tgt=(HTML-encoded
> query){else}/(HTML-encoded URL)/default{/if}">Hello(text/html)…
> >
> > Finally we insert calls to sanitizer functions
> <#glossary-sanitization_function>
> > into the template.
> >
> > <form action="{if
> $target}/{escapeHTML($name)}/handle?tgt={encodeURIComponent($target)}{else}/{escapeHTML($name)}/default{/if}">Hello
> {escapeHTML($name)}…
> >
> > That is the gist of the solution, though the above example glosses over
> issues
> > with re-entrant templates, templates that are invoked in multiple start
> > contexts, and joining branches that end in different contexts; and the
> exact
> > sanitization functions chosen are different than shown in this simplified
> example.
> >
> > The example only shows HTML and URL encoding, but our solution deals with
> data
> > holes that occur inside embedded JavaScript and CSS as any solution for
> AJAX
> > applications must.
> >
> >
> >    Problem Definition
> >
> > In this section we present several metrics on which any competing
> sanitization
> > scheme should be judged, and a definition of a safe template that can be
> used to
> > prove or disprove the soundness of a sanitization scheme that we think is
> > relevant to security properties that web applications commonly want to
> enforce.
> >
> >
> >      Performance
> >
> > A sanitization scheme should be judged on several performance metrics:
> >
> >   1. Compile- or load-time overhead. The cost of any static analysis.
> >   2. Run-time analysis overhead. The cost of any dynamic analysis done
> when the
> >      template is run.
> >   3. Run-time sanitization overhead. The cost of sanitizing
> >      <#glossary-sanitization_function> untrusted data.
> >   4. One-time development overhead. The burden placed on a developer to
> learn
> >      the system.
> >   5. Continual development overhead. The burden placed on a developer to
> add
> >      sanitization directives, review code to ensure they are used
> correctly,
> >      debug the resulting template code, and deal with any over- or
> >      mis-sanitization.
> >
> > Run-time analysis overhead (proportional to overall template runtime)
> often
> > differs substantially by platform. High quality parser-generators exist
> for C
> > and Java, so the overhead may be much lower there than in browser, since
> > iterating char by char over a string is slow in JavaScript.
> >
> > Our proposal has a modest compile-/load-time cost taking slightly less
> than 1
> > second to do static inference for 1035 templates comprising 782kB of
> source code
> > or about 1ms per template. The runtime analysis for our proposal is zero.
> The
> > runtime sanitization overhead on a benchmark is between 3% and 10% of the
> total
> > template execution time, and is indistinguishable from the overhead when
> > non-contextual auto-sanitization is used (all data holes sanitized using
> HTML
> > entity escaping).
> >
> > Development overhead is hard to measure but the 1035 templates were
> migrated by
> > an application group in a matter of weeks without stopping application
> > development with little coordination, so the one-time overhead — the
> overhead to
> > learn the system — is lower than that to learn and adopt a new templating
> > language. Since the system works by inserting function calls, we provided
> > debugging tools that diffed templates before and after the inference was
> run to
> > show developers what the system was doing and aid in debugging. Due to
> the need
> > to debug templates written using any approach, the continual development
> > overhead can never be zero, but tool support, like diffing can make the
> system
> > transparent and ease debugging.
> >
> > Finally, once a bug has been identified, we try to make sure there are
> simple
> > bugfixing recipes.
> >
> >    * If the problem is over-escaping of known-safe data values that were
> >      already sanitized, then wrap the data value in a |SanitizedContent|
> >      wrapper object of the appropriate type as close to where it is
> sanitized
> >      as possible. Ideally, an HTML tag whitelisting sanitizer would
> return a
> >      value of type |SanitizedContent|.
> >    * If the sanitizer is choosing the wrong sanitization functions for
> whatever
> >      reason, insert your own. The system recognizes sanitization
> functions and
> >      does not interfere with developer choices.
> >
> >
> >      Ease of Adoption/Migration
> >
> > What kind of changes, if any, do developers have to make to take an
> existing
> > codebase of templates and have them properly sanitized? For example,
> adding
> > sanitization functions manually is time-consuming and error-prone. Making
> sure
> > that all static content is valid XHTML requires repetitive,
> time-consuming
> > changes, but would not be as error-prone.
> >
> > Our proposal allows contextual auto-sanitization to be turned on for some
> > templates and not for others; most templating languages allow templates
> to be
> > composed, i.e. templates can call other templates, and standard practice
> seems
> > to be to have a few large templates that call out to many smaller
> templates.
> > Since this can be done per template, a codebase can be migrated
> piecemeal,
> > starting with complicated templates that have known problems.
> >
> > Our proposal does not impose an element structure on template boundaries.
> Many
> > top level templates look like:
> >
> > {include "common-header.foo"}
> > <!-- Body content -->
> > {include "common-footer.foo"}
> >
> > where the common header opens elements that are closed in the common
> footer:
> >
> > <html>
> > <head>
> > <!-- Common style and script definitions -->
> > ...
> > </head>
> > <body>
> > <!-- Common menus -->
> >
> > Approaches that require template code to be well-formed XML, such as
> XSLT,
> > cannot support this idiom. Our proposal works for templating languages
> that
> > allow this idiom because we propagate types as they flow template calls
> rather
> > than inferring types of content based on a DOM derived from a template.
> >
> >
> >      Ease of Abandonment
> >
> > If a development team adopts a sanitization scheme, and finds that it
> does not
> > meet their needs, how easily can they switch it off, and how much of the
> effort
> > they invested in deploying it can they recover?
> >
> > Since our solution works by inserting calls to sanitization function into
> > templates, a development team having second thoughts can simply run the
> type
> > inference engine to insert the calls, and print out the resulting
> templates to
> > generate a patch to their codebase and then remove whatever directives
> turned on
> > auto-sanitization. We argued above that cost of adoption is low, and most
> of the
> > work put into verifying that the sanitization functions chosen were
> reasonable
> > is recoverable.
> >
> >
> >      Security under Maintenance
> >
> > Security measures tend to be removed from code under maintenance. Imagine
> a
> > template that is not auto-sanitized:
> >
> > <div>Your friend, {escapeHTML($name)}, thinks you'll like this.</div>
> >
> > that is passed a plain text name. While merging two applications,
> developers add
> > a call to this code, passing in a rich HTML signature that has been
> proven safe
> > by a tag whitelister, e.g. |"Alan <font color=green>Green</font>span"|.
> > Eventually, Mr. Greenspan notices that his name is misrendered and files
> a bug.
> > A developer might check that the rich text signature is sanitized
> properly
> > before being passed in, but not notice the other caller that doesn't do
> any
> > sanitization. They resolve the bug by removing the call to |escapeHTML|
> which
> > fixes the bug but opens a vulnerability.
> >
> > Over-encoding is more likely to be noticed by end-users than XSS
> > vulnerabilities, so a project under maintenance is more likely to lose
> manual
> > sanitization directives than to gain them.
> >
> > Our proposal addresses this by introducing sanitized content types
> > <#sanitized_content_types> as a principled solution to over-encoding
> problems.
> >
> >
> >      Structure Preservation Property
> >
> > We define a safe template as one that has several properties: the
> structure
> > preservation property described here, and the code effect and least
> surprise
> > properties defined in later sections.
> >
> > Intuitively, this proeprty holds that when a template author writes an
> HTML tag
> > in a safe templating language, the browser will interpret the
> corresponding
> > portion of the output as a tag regardless of the values of untrusted
> data, and
> > similarly for other structures such as attribute boundaries and JS and
> CSS
> > string boundaries.
> >
> > This property can be violated in a number of ways. E.g. in the following
> > JavaScript, the author is composing a string that they expect will
> contain a
> > single top-level bold element surrounded by text.
> >
> > document.write(greeting + ',<b>' + planet +'</b>!');
> >
> > and if |greeting| is |"Hello"| and |planet| is |"World"| then this holds
> as the
> > output written is "|Hello, <b>world</b>!|"; but if |greeting| is
> > |"<script>alert('pwned');//"| and |planet| is |"</script>"| then this
> does not
> > hold since the structure has changed: the |<b>| should have started a
> bold
> > element but the browser interprets it as part of a JavaScript comment in
> > "|<script>alert('pwned');//, <b></script></b>!|".
> >
> > Lower level encoding attacks, such as UTF-7
> > <http://ha.ckers.org/xss.html#XSS_UTF-7> attacks, may also violate this
> property.
> >
> > More formally, given any template, e.g.
> >
> > <div id="{$id}" onclick="alert('{$message}')">{$message}</div>
> >
> > we can derive an /innocuous template/ by replacing every untrusted
> variable with
> > an innocuous string, a string that is not empty, is not a keyword in any
> > programming language and does not contain special characters in any of
> the
> > languages we're dealing with. We choose our innocuous string so that it
> is not a
> > substring of the concatenation of literal string parts. Using the
> innocuous
> > string |"zzz"|, an innocuous template derived from the above is:
> >
> > <div id="zzz" onclick="alert('zzz')">zzz</div>
> >
> > Parsing this, we can derive a tree structure where each inner node has a
> type
> > and children, and each leaf has a type and a string value.
> >
> > Element
> >  ╠Name : "div"
> >  ╠Attribute
> >  ║  ╠Name : "id"
> >  ║  ╚Text : "zzz"
> >  ╠Attribute
> >  ║  ╠Name : "onclick"
> >  ║  ╚JsProgram
> >  ║      ╚FunctionCall
> >  ║          ╠Identifier : "alert"
> >  ║          ╚String : "zzz"
> >  ╚Text : "zzz"
> >
> > A template has the structure preservation property when for all possible
> branch
> > decisions through a template, and for all possible data table inputs, a
> template
> > either produces no output (fails with an exception) or produces an output
> that
> > can be parsed to a tree that is structurally the same as that produced by
> the
> > innocuous template derived from it for the same set of branch decisions.
> >
> > ? branch-decisions ? data, areEquivalent(
> > parse(innocuousTemplate(T)(branch-decisions, data))
> > parse(T(branch-decisions, data)))
> >
> > where parse parses using a combined HTML/JavaScript/CSS grammar to the
> tree
> > structure described above, branch-decisions is a path through flow
> control
> > constructs (the conditions in |for| loops and |if| conditions) and where
> > areEquivalent is defined thus:
> >
> > def areEquivalent(innocuous_tree, actual_tree):
> >   if innocuous_tree.is_leaf:
> >     # innocuous_string was 'zzz' in the example above.
> >     if innocuous_string in innocuous_tree.leaf_value:
> >       # Ignore the contents of actual since it was generated by
> >       # a hole.  We only care that it does not interfere with
> >       # the structure in which it was embedded.
> >       return True
> >     # Leaves strucurally the same.
> >     # Assumes same node type implies actual is leafy.
> >     return (innocuous_tree.node_type is actual_tree.node_type
> >             and innocuous_tree.leaf_value == actual_tree.leaf_value)
> >   # Require type equivalence for inner nodes.
> >   if node_type(innocuous_tree) is not node_type(actual_tree):
> >     return False
> >   # Zip below will silently drop extras.
> >   if len(innocuous_tree.children) != len(actual_tree.children):
> >     return False
> >   # Recurse to children.
> >   for innocuous_child, actual_child in zip(
> >       innocuous_tree.children, actual_tree.children):
> >     if not areEquivalent(innocuous_child, actual_child):
> >       return False
> >   return True  # All grounds on which they could be inequivalent
> disproven.
> >
> > This definition is not computationally tractable, but is formal so can be
> used
> > as a basis for correctness proofs, and in practice branch decisions that
> go
> > through loops more than twice or recurse more than twice can be ignored
> and
> > fuzzers do a good job of generating bad data inputs.
> >
> > This property is essential to capturing developer intent. When the
> developer
> > writes a tag, the browser should interpret that as a tag, and when the
> developer
> > writes a paired start and end tags, the browser should interpret those as
> a
> > matched pair. It is also important to applications that want to embed
> sanitized
> > data while preserving a trusted path <#glossary-trusted_path> since the
> > structure preservation property is a prerequisite for visual containment.
> >
> >
> >      Code Effect Property
> >
> > Untrusted data may specify data values in code (loosely, string,
> booleans,
> > numbers, JSON) but /only/ code specified by the template author should
> run as a
> > result of injecting the template output into a page and /all/ code
> specified by
> > the template author should run as a result of the same.
> >
> > There are a dizzyingly large number of ways this property can fail to
> hold for a
> > template. A non-exhaustive sample of ways to cause extra code to run:
> >
> >    * Unencoded text could contain a |<script>| element.
> >    * A dynamic attribute name could specify an event handler: |onclick|.
> >    * A dynamic |src| or |href| could specify |javascript|, |livescript|,
> etc.
> >      as the protocol in myriad <http://ha.ckers.org/xss.html> ways.
> >    * Dynamic CSS
> >      <
> http://code.google.com/p/browsersec/wiki/Part1#Cascading_stylesheets>
> >      could use a vendor-specific extension, e.g. |expression| or
> |-moz-binding|.
> >    * A dynamic |<object>| might load flash cross origin and
> |AllowScriptAccess|.
> >    * Dynamic code might reach |eval| or any number of JavaScript APIs
> that
> >      invoke the HTML, CSS, or JavaScript parsers.
> >
> > There are also many ways to cause security-critical code to not run. In
> general,
> > it is not wise to rely on JavaScript running in a browser, but many
> developers,
> > not unreasonably, rely on some code having run if other code is running
> at a
> > later time. A non-exhaustive sample of ways to stop code running via XSS:
> >
> >    * Change the base href of the page by inserting a |<base>| element
> disabling
> >      |src=|'ed |<script>|s with relative URLs.
> >    * Inject improperly encoded data into a data hole in an inline
> <script>
> >      element that causes it to fail to parse. Putting unicode codepoints
> U+2028
> >      or U+2029 into a string body is a favorite since JavaScript treats
> those
> >      as newline characters. (A violation of the Structure Preservation
> Property).
> >    * Cause an inline event handler or |<script>| tag to not be
> interpreted as
> >      such. (A violation of the Structure Preservation Property).
> >    * Insert code that disables, removes, or changes critical APIs:
> >      |Object.prototype.toString = function () { throw new Error(); };|
> >    * Inject a |<noscript>| element around the |<head>|. (A violation of
> the
> >      Structure Preservation Property).
> >    * Fool heuristic XSS defense browser plugins into thinking that the
> security
> >      code was injected by manipulating query parameters.
> >
> > Our proposal enforces this property by filtering <#glossary-filter> URLs
> to
> > prevent any data hole from specifying an exotic protocol, by filtering
> CSS
> > keywords, and by only allowing data holes in JavaScript contexts to
> specify
> > simple boolean, numeric, and string values, or complex JSON values which
> cannot
> > have free variables. We assume that the JavaScript interpreter will work
> on
> > arbitrarily large inputs.
> >
> > Finally, our escapers <#glossary-escaper> are designed to produce output
> that
> > avoids grammatical such as semicolon insertion, non-ASCII newline
> characters,
> > regular-expression/division-operator/line-comment confusion.
> >
> > Identifying all places in which a URL might appear in HTML (incl. MathML
> and
> > SVG) is relatively easy compared to CSS. In CSS, it is difficult. For
> example,
> > in |<div style="background: {$bg}">|, |$bg| might specify a URL, a color
> name, a
> > color value like |#000|, a function-like color |rgb(0,0,0)|, a keyword
> value
> > like |transparent|, or a combination of the above. Given how hard it is
> to
> > reliably black-list URLs, when you know the content is a URL, we took the
> rather
> > drastic approach of forbidding anything that might specify a colon in CSS
> data
> > holes. This seems to affect very little in practice, and we could relax
> this
> > constraint to allow colons preceded by a safe word like the name of an
> element,
> > pseudo-element, or innocuous property. Even if we did, it is possible
> that
> > existing code uses colons in data holes to specify list separators a la
> semantic
> > HTML, and we would break that use case:
> >
> > ul.inline li { list-style: none; display: inline }
> > ul.inline li:before { content: ': ' }  /* ', ' here would give a normal
> looking list. */
> > ul.inline li:first-child:before { content: '' }
> >
> > This property is a prerequisite for many application /privacy/ goals. If
> a
> > third-party can cause script to run with the privileges of the origin, it
> can
> > steal user data and phone home. Even if credentials are unavailably to
> > JavaScript (HTTPOnly
> > <
> http://www.codinghorror.com/blog/2008/08/protecting-your-cookies-httponly.html
> >
> > cookies), scripts with same-origin privileges can screen scrape (using
> DOM APIs)
> > user names and identifiers and associated page content and phone home.
> >
> > This property is also a prerequisite for many /informed consent/ goals.
> If a
> > third-party script can install |onsubmit| handlers, it can rewrite form
> data
> > before it is submitted with the XSRF tokens that are meant to ensure that
> the
> > data submitted was specified by the user.
> >
> >
> >      Least Surprise Property
> >
> > The last of the security properties that any auto-sanitization scheme
> should
> > preserve is the property of least surprise. This is not a property that
> can be
> > proven mathematically as it depends on developers' intuition, but it is
> > nonetheless important..
> >
> > Developer intuition is impotant. A developer familar with HTML, CSS, and
> > JavaScript; who knows that auto-sanitization is happening should be able
> to look
> > at a template and correctly infer what happens to dynamic values without
> having
> > to read a complex specification document. Simple rules-of-thumb should be
> > sufficient to understand the system. E.g. if a mythical average developer
> sees
> > |<script>var msg = '{$msg}';</script>| and their intuition is that
> |$world|
> > /should/ be escaped using JavaScript style |\| sequences, and that is
> sufficient
> > to preserve the other security properties, then that is what the system
> should
> > do. Templates should be both easy to write and to code review.
> >
> > Exceptions to the system should be easily audited. SQL prepared
> statements are
> > great, but there's no way to have exceptions to the rule without giving
> up the
> > whole safety net, so sometimes developers work around them by
> concatenating
> > strings. It's hard to |grep| (or craft presubmit triggers) for all the
> places
> > where concatenated strings are passed to SQL APIs, so it's hard for a
> more
> > senior developer to find these after the fact and explain how they can
> achieve
> > their goal working within the system, notice a trend that points to a
> systemic
> > problem with schemas, or agree that the exception to the rule is
> warranted and
> > document it for future security auditors.
> >
> > Our proposal was designed with this goal in mind, but we have not managed
> to
> > quantify our success. We can note that 1035 templates were converted
> within a
> > matter of weeks without a flood of questions to the mailing lists we
> monitor, so
> > we infer that most of the parts of the system that were heavily exercised
> were
> > non-controversial. Different communities of developers may have different
> > expectations. We worked with a group of developers most of whom knew
> Java, C++,
> > or both before starting web application development, and among whom a
> high
> > proportion have at least a bachelor's degrees in CS or a related field.
> They may
> > differ, intuition-wise, from developers who came to web development from
> a Ruby,
> > Perl, or PHP background.
> >
> >
> >    Alternate Approaches
> >
> > In this section we introduce a number of alternative proposals, explain
> why they
> > perform worse on the metrics above. We cite real systems as examples of
> some of
> > these alternatives. Many of these systems are well-thought out,
> reasonable
> > solutions to particular problems their authors faced. We merely argue
> that they
> > do not extend well to the criteria we outlined above and explicitly label
> these
> > sections "strawmen" to clarify the difference between our design criteria
> and
> > the contexts in which these systems arose. We do claim though that any
> > comprehensive solution to XSS, at a tools level, should meet the criteria
> above.
> >
> >
> >      Strawman 0: Manual sanitization
> >
> > Manual sanitization is the state-of-the-art currently. Developers use a
> suite of
> > functions, such as OWASP's open source OSAPI encoders
> > <
> http://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/Encoder.java
> >
> > and every developer must learn when and how to apply them correctly. They
> must
> > apply sanitizers either before data reaches a template or within the
> template by
> > inserting function calls into code.
> >
> > This places a significant burden on developers and does not guarantee any
> of the
> > security properties listed above. One lapse can undo all the work put
> into
> > hardening a website because of the all-or-nothing nature of the
> same-origin policy.
> >
> > There is a tradeoff between correctness and simplicity of API that works
> in the
> > attackers favor. Manual sanitization is particularly error-prone because
> > developers learn /the good parts/ of the languages they work in, but
> attackers
> > have available to them /the bad parts/ as well. The syntax of HTML, CSS,
> and
> > JavaScript are much gnarlier than most developers imagine, and it is an
> > unreasonable burden to expect them to learn and remember obscure
> syntactic
> > corner cases. These corner cases mean that the typical suite of 4-6
> escaping
> > functions is the most that many developers can reliably choose from, but
> they
> > are insufficient to handle corner cases or nested contexts.
> >
> > Changes in language syntax or vendor-specific extensions (e.g. XML4J and
> > embedded SVG) may invalidate developers previously valid assumptions.
> Code that
> > was safe before may no longer be safe. With an automated system, a
> security
> > patch and recompile may suffice, but a patch to code that took a team of
> > developers years to write will take a team of developers to fix.
> >
> > XSS Scanners (e.g. lemon
> > <
> http://googleonlinesecurity.blogspot.com/2007/07/automating-web-application-security.html
> >)
> > can mitigate some of manual sanitization's cons (though they work with
> any of
> > the other solutions here as well to provide defense-in-depth), but there
> are no
> > good scanners for AJAX applications, and, with manual sanitization,
> scanners
> > impose a continual burden on developers to respond to the reported
> errors.
> >
> >
> >      Strawman I: Non-contextual auto-sanitization
> >
> > Context-less auto-sanitization is a great improvement over manual
> sanitization
> > and is implemented in a number of languages including Django templates
> > <http://code.djangoproject.com/wiki/AutoEscaping>.
> >
> > It works by assuming that every data hole should be sanitized the same
> way,
> > usually by HTML entity encoding. As such, it is prone to over-escaping
> and
> > mis-escaping.
> >
> > To understand mis-escaping, consider imagine what happens when the
> following
> > template is called with |', alert('XSS'), '| :
> >
> > <button onclick="setName('{$name}')">
> >
> > The template produces |<button onclick="setName('&apos;, alert(&apos;XSS
> > &apos;), &apos;')">| which is exactly the same, to the browser, as
> |<button
> > onclick="setName('', alert('XSS '), '')">| because the browser HTML
> entity
> > decodes the attribute value /before/ invoking the JavaScript parser on
> it.
> >
> > Non-contextual auto-sanitization cannot preserve the structure
> preservation
> > property for JavaScript, CSS, or URLs because it is unaware of those
> languages.
> > It also fails to preserve the code effect property.
> >
> > Bolting filters on non-contextual auto-sanitization will not help it to
> preserve
> > the code effect property. It is possible to write bizarre JavaScript that
> does
> > not need even alphanumerics
> > <
> http://securitymusings.com/article/2022/code-with-javascript-letters-and-numbers-optional
> >.
> > Since JavaScript has no regular lexical grammar, regular expressions that
> are
> > less than draconian are insufficient to filter out attacks.
> >
> > Non-contextual auto-sanitization, with auditable exceptions like
> Django's, does
> > preserve the least surprise property in a sense. With very little
> training, a
> > developer can predict exactly what it will do, and empirically, 74% of
> the time
> > it does what they want (our system chose some kind of HTML entity
> encoding for
> > 992 out of 1348 data holes).
> >
> >
> >      Strawman II: Strict structural containment
> >
> > Examples of strict structural containment languages are XSLT
> > <http://www.w3.org/TR/xslt>, GXP <http://code.google.com/p/gxp/>, and
> possibly
> > XHP
> > <
> http://www.facebook.com/notes/facebook-engineering/xhp-a-new-way-to-write-php/294003943919
> >.
> >
> >
> > What they have in common is that the input is (or is coerceable via fancy
> > tricks) to a tree structure like XML. So for every data hole, it is
> obvious to
> > the system which element and attribute context the hole appears in?. Then
> a
> > similar structural constraint could be applied in principle to embedded
> > JavaScript, CSS, and URIs.
> >
> > Strict structual containment is a sound, principled approach to building
> safe
> > templates that is a great approach for anyone planning a new template
> language.
> >
> > It cannot be bolted onto existing languages though because it requires
> that
> > every element and attribute start and end in the same template. This
> assumption
> > is violated by several very common idioms, such as the header-footer
> > <#header-footer-example> idiom in ways that often require drastic changes
> to
> > codebase to repair.
> >
> > Since it cannot be bolted onto existing languages, limiting ourselves to
> it
> > would doom to insecurity most of the template code existing today. Most
> project
> > managers who know their teams have trouble writing XSS-free code, know
> this
> > because they have existing code written in a language that does not have
> this
> > property.
> >
> > ? - modulo mechanisms like |<xsl:element name="...">|
> > <http://www.w3schools.com/xsl/el_element.asp> which can, in principle,
> be
> > repaired using equivalence classes of elements and attributes. I.e. one
> could
> > define an equivalence class of elements all of whose attributes have the
> same
> > meaning and which have the same content type: (TBODY, THEAD, TFOOT), (OL,
> UL),
> > (TD, TH), (SPAN, I, B, U), (H1, H2, H3, …) and allow a dynamic element
> mechanism
> > to switch between element types within the same equivalence class.
> Similar
> > approaches can allow selecting among equivalent dynamic attribute types :
> all
> > event handlers are equivalent (modulo perhaps those that imply user
> interaction
> > for some applications).
> >
> >
> >      Strawman III: A runtime typing approach
> >
> > Prior to this work, the best auto-sanitization scheme was a runtime
> scheme
> > <http://googleonlinesecurity.blogspot.com/2009_03_01_archive.html>.
> >
> > A runtime contextual auto-sanitizer plugs into a template runtime at a
> low
> > level. Instead of writing content to an output buffer, the template
> runtime
> > passes trusted and untrusted chunks to the autoescaper. The template:
> >
> > <ul>{for $item in $items}<li
> onclick="alert('{$item}')">{$item}{/for}</ul>
> >
> > might produce the output on the left, and by propagating context at
> runtime,
> > infer the context in the middle and choose to apply the escaping
> directives on
> > the right before writing to the output buffer.
> >
> > Content    Trusted    Context    Sanitization function
> > |<ul>|    Yes    PCDATA    none
> > |<li onclick="alert('>|    Yes    PCDATA    none
> > |foo|    No    JS string    escapeJSString
> > |')">|    Yes    JS string    none
> > |foo|    No    PCDATA    escapeHTML
> > |<li onclick="alert('>|    Yes    PCDATA    none
> > |<script>doEvil()</script>|    No    JS string    escapeJSString
> > |')">|    Yes    JS string    none
> > |<script>doEvil()</script>|    No    PCDATA    escapeHTML
> > |</ul>|    Yes    PCDATA    none
> >
> > This works, and with a hand-tuned C parser has been deployed successfully
> on
> > CTemplates
> > <http://google-ctemplate.googlecode.com/svn/trunk/doc/auto_escape.html>
> and
> > http://www.clearsilver.net/ <ClearSilver>.
> >
> > Writing a highly tuned parser in JavaScript though is difficult so
> implementing
> > this scheme requires making a hard trade-off between flexibility and
> correctness
> > and download-size/speed.
> >
> > Our proposal is a factor of 4 faster than a runtime scheme implemented in
> > JavaScript and has no download size cost above and beyond the code for
> the
> > sanitization functions and the calls to them.
> >
> > Even in languages for which there are efficient parser generators,
> runtime
> > approaches might suffer performance-wise. The overhead for the static
> approach
> > is independent of the number of times a loop is re-entered, so templates
> that
> > take large array inputs might perform worse with even a highly efficient
> runtime
> > scheme.
> >
> > Runtime sanitization does do more elegantly in at least one area though.
> Dynamic
> > tag and attribute names pose no problems to a runtime sanitizer. Whereas
> our
> > scheme has to filter attribute names so that |$aname| cannot be
> |"onclick"| in
> > |<button {$aname}=…>|, because a static approach must decide that the
> beginning
> > of the attribute value is either a JavaScript context or some other
> context, a
> > runtime approach can take into account the actual value of |$aname|. This
> is not
> > a common problem, and our approach does handle many dynamic attribute
> situations
> > including: |<button on{$handlerType}=…>|.
> >
> >
> >      Strawman IV: A purely static approach
> >
> > We know of no purely static approaches, though they are possible. A
> purely
> > static approach is one that, like our proposal, infers contexts at
> compile or
> > load time, but does not take into account the runtime type of the values
> that
> > fill the data holes.
> >
> > This approach has problems with over-escaping. Existing systems often use
> a mix
> > of sanitization in-template and sanitization outside the template in the
> > front-end code that calls the template.
> >
> > Our solution takes into account the runtime type of the values that fill
> a hole.
> > If the runtime type marks the value as known-safe string of HTML, then an
> HTML
> > entity escaping sanitization function can use that information to decide
> not to
> > re-escape, and instead normalize or do nothing.
> >
> > See cavets <#caveats> for other problems that are as equally applicable
> to pure
> > static systems as to our proposal.
> >
> >
> >    Definitions and Algorithms
> >
> > This section is only relevant to implementors, testers, and others who
> want to
> > understand the implementation. Everyone else, including web application
> > developers, can ignore it.
> >
> > At a high level, the type system defines four things which are expanded
> upon below:
> >
> >   1. An initial start context for a public template. Typically
> |HTML_PCDATA|.
> >   2. A context propagation algorithm which takes a chunk of literal text
> from
> >      the template and the context at its start and returns the context at
> its
> >      end. |(context * string) → context|.
> >   3. An algorithm that chooses a sanitization function for a data hole.
> It
> >      takes the context before the hole and returns a sanitization
> function and
> >      the context after the hole. |context → ((α → string) * context)|. If
> data
> >      holes have statically available type info, then the type could be
> taken
> >      into account : |(context * type) → ((α → string) * context)|.
> >   4. A context join operator that takes the contexts at the end of
> branches and
> >      yields the context after the branches have joined. This is used to
> >      determine the context at the end of a conditional |{if}| by joining
> the
> >      context at the end of the then-branch with the context at the end of
> the
> >      else-branch. It is also used with loops, where (unless proven
> otherwise)
> >      we have to join the context at the start (loop never entered) with a
> >      context once through, with a steady state context for many
> repetitions.
> >      |context list → context|
> >
> > By contrast, the runtime auto-sanitization scheme described in strawman
> III has
> > the same inital context, the same context propagation operator, no
> context join
> > operator and uses a slightly differently shaped sanitization function
> chooser :
> > |context → (α → (string * context))|.
> >
> >
> >      Contexts
> >
> > A context captures the state of the parser in a combined HTML/CSS/JS
> lexical
> > grammar. It is composed of a number of fields which pack into 2 bytes
> with room
> > to spare:
> >
> >    * State — a coarse parser state that distinguishes between
> >      CDATA/RCDATA/PCDATA and attributes in HTML, comments, strings, and
> regular
> >      expressions in JavaScript; and between comments, strings, and URLs
> in CSS.
> >    * Element Type — when in an HTML tag (between |<| and |>|), keeps
> track of
> >      whether the tag body is PCDATA, RCDATA, or CDATA; and once in an
> RCDATA or
> >      CDATA tag body, used to keep track of the expected end tag, e.g.
> inside a
> >      |<script>| body we have to find a |</script>| tag, but should ignore
> any
> >      apparent |</style>| tags.
> >    * Attribute type — the type of attribute we're in. Distinguishes
> between
> >      script attributes (|onclick|, etc.), |style| attributes, URL
> attributes
> >      (|href|, etc.), and other attributes.
> >    * Attribute end delimiter — indicate the termination condition for the
> >      attribute value we're in: double quoted, single quotd, unquoted, or
> none.
> >    * JavaScript following slash — for JavaScript states, explains what to
> do
> >      with a |/| that does not start a comment: enter a regular expression
> >      literal, or a division operator, or fail with an error message due
> to
> >      ambiguity from context joining.
> >    * URI part — for URI states, the part of the URI that we're in: the
> start,
> >      path, query, fragment, or an ambiguous part tdue to context joining.
> >
> > Contexts support two operators: join and ε-commit.
> >
> > The join operator produces the context at the end of a condition, loop,
> switch,
> > or other flow control construct. This sometimes introduces an ambiguity.
> In the
> > template:
> >
> > <form action="{if
> $target}/{$name}/handle?tgt={$target}{else}/{$name}/default{/if}↑">Hello
> {$name}…
> >
> > One branch ends in the query portion of a URI, and one ends outside it.
> If there
> > were a data hole at the ↑, then we would not be able to determine an
> appropriate
> > sanitization function for it?. So context joining often introduces just
> enough
> > ambiguity, by using do-not-know values for fields, and in the common
> case, we
> > later reach a point where we discard that info. In the URI case, if there
> were a
> > |#| character at the ↑ we can reliably transition into a URI fragment
> context,
> > and in any case, the end of the attribute moots the question.
> >
> > The ε-commit operator is used when we see a data hole. In some cases, we
> > introduce parser states to delay decision making. In the template
> fragment, |<a
> > href=|, we could see a quote character next, or space, or the start of an
> > unquoted value, or the end of the tag (implying empty href), or a data
> hole
> > specifying the start of an unquoted attribute value. If the next
> construct is a
> > data hole we need to commit to it being an unquoted attribute. The
> ε-commit
> > operator in this case goes from an HTML_BEFORE_ATTRIBUTE_VALUE state with
> an
> > attribute end delimiter of NONE to a state appropriate to the value type
> (e.g.
> > JS for an |onclick| attribute) with an attribute end delimiter of
> SPACE_OR_TAG_END.
> >
> > The precise details of both these operators were determined empircally to
> come
> > up with the simplest semantics that handles cases found in real code that
> web
> > developers do not consider to be badly written or confusing.
> >
> > ? — This could be fixed by migrating the problematic data hole and the
> code
> > leading up to it into each branch, but this is tricky to do across
> template
> > boundaries and has not proven to be necessary for the codebase we
> migrated.
> >
> >
> >      Grammar
> >
> > The context propagation algorithm uses a combined HTML/CSS and JS lexical
> > grammar described below. Click on non-terminal productions for more
> detail.
> >
> >
> >        HTML
> >
> >
> >        Attributes
> >
> >
> >        JS
> >
> >
> >        CSS
> >
> >
> >        URI
> >
> >
> >        DynamicText
> >
> > Converts plain text to HTML by entity encoding unless it's type indicates
> it is
> > known safe HTML.
> >
> >    * `I <3 ponies` → `I &lt;3 ponies`
> >    * |new SanitizedHtml('<b>Hello, World</b>')| → `<b>Hello, World!</b>`
> >
> > The first case is handled by encoding all PCDATA special characters (<,
> >, and
> > &) as HTML entities (&lt;, &gt;, and &amp;). Other code-points may be
> escaped,
> > but need not be.
> >
> > In the second case, the safe HTML is emitted as is. It must be a mixed
> group of
> > complete tags and text nodes such that there exists a safe template that
> could
> > have produced it starting from an HTML PCDATA context and ending in the
> same
> > context, or there exists a safe HTML sanitizer that could have produced
> it.
> >
> >
> >        DynamicRcdata
> >
> > Converts plain text to HTML by entity encoding unless it's type indicates
> it is
> > known safe HTML.
> >
> >    * `I <3 ponies` → `I &lt;3 ponies`
> >    * |new SanitizedHtml('<b>Hello, World</b>')| → `&lt;b&gt;Hello,
> >      World!&lt;/b&gt;`
> >
> > The first case is handled by encoding all RCDATA special characters (<,
> >, and
> > &) as HTML entities (&lt;, &gt;, and &amp;). Other code-points may be
> escaped,
> > but need not be.
> >
> > In the second case, the safe HTML is normalized. All the HTML special
> characters
> > are escaped except for ampersands (&), which are left as-is. Since all
> RCDATA
> > end tags contain `<`, and `<` is escaped to a string that does not
> contain it,
> > and no other code units are escaped to a string that contains it, no safe
> HTML
> > chunk can cause premature ending of an RCDATA tag. This means that the
> safety of
> > the odd but valid Soy template |<textarea>{$foo}<script>alert('Keystone
> > kop');</script></textarea>| will not violate the structure security goal
> or
> > unauthored code security goal even when a chunk of safe HTML contains an
> RCDATA
> > end tag like |</textarea>|.
> >
> >
> >        DynamicTagName
> >
> > Allows through parts of non-CDATA, non-RCDATA tag names. So the Soy
> > |<h{$headerLevel}>| can be used to generate |<h1>|, |<h2>|, …
> >
> > To avoid problems where a tag name might be combined with a static part
> to form
> > |script|, |style|, or another |CDATA| or |RCDATA| tag, we impose the
> following
> > restrictions:
> >
> >    * must contain only ASCII letters, digits, dashes and colons; and
> >    * must
> >          o contain a colon (a namespace), or
> >          o contain a digit, or
> >          o be the full name (case-insensitive) of a non-RCDATA, non-CDATA
> HTML
> >            element.
> >
> >
> >        DynamicAttrName
> >
> > Allows through parts of a non-special attribute name.
> >
> >    * `checked` → `checked`
> >    * `<script>alert(pwned)</script>` → /error/
> >
> > TODO: scheme to avoid concatenation from producing |on|*, |style|,
> |href|, etc.
> >
> >
> >        DynamicAttrValue
> >
> > Converts plain text to HTML by entity encoding so it can be embedded in
> an HTML
> > attribute. If embedded in a quoteless attribute, then also encodes
> spaces.
> >
> > If the result is known safe HTML, strips tags so that the Soy |<abbr
> > title="{$longDesc}">{$shortDesc}</abbr>| works even when both |$longDesc|
> and
> > |$shortDesc| are snippets of sanitized HTML.
> >
> >    * `I <3 ponies` → `I &lt;3 ponies`
> >    * |new SanitizedHtml('<b>Hello, World</b>')| → `Hello, World!`
> >
> > The first case is handled by encoding all HTML special characters
> including
> > quotes (<, >, &, ", ', and =) as HTML entities (&lt;, &gt;, &amp;,
> &quot;, and
> > &#34;, &#61;).
> >
> > The second case is handled by stripping HTML tags and comments from the
> safe
> > HTML, and then normalizing it by applying the same escaping scheme as for
> the
> > first case, but without encoding ampersands (&).
> >
> > For both cases, when the HTML attribute is not quoted, we additionally
> have to
> > quote all codepoints that would signal the end of an HTML attribute,
> including a
> > number of space and control characters. This set was derived empirically,
> and
> > includes the backtick (`) which can be used as a quoting character on
> some
> > versions of IE.
> >
> >
> >        DynamicJsString
> >
> > Escapes plain text so it can be incorporated into part of a JS string
> literal by
> > escaping special characters, e.g. newline → |\n|.
> >
> >    * `John "The Anonymous" Doe` → `John \"The Anonymous\" Doe`
> >
> > We escape dynamic JS strings using the following table:
> > Codepoint    Glyph    Escape
> > 000A_16        \n
> > 000D_16        \r
> > 0022_16    "    \u0022
> > 0027_16    '    \u0027
> > 002F_16    /    \/
> > 003C_16    <    \u003C
> > 003E_16    >    \u003E
> > 005C_16    \    \\
> > 2028_16        \u2028
> > 2029_16        \u2029
> >
> > These escapes prevent premature string closing, since all JS quote
> characters
> > are encoded to a sequence that does not contain a quote character and no
> other
> > codepoint is encoded to a sequence containing a quote character. This
> prevents
> > additional JS syntax errors by properly encoding all JS newline
> codepoints. It
> > preserves structure by encoding any sequences that would end a CDATA tag,
> CDATA
> > section, escaping text span, or quoted HTML attribute value. The output
> can be
> > embedded in an HTML attribute value by additionally escaping & to \u0026.
> In the
> > case of unquoted HTML attribute values, just escaping ampersands is not
> > sufficient ; the output needs to be HTML entity escaped per
> DynamicAttrValue
> > <#DynamicAttrValue>.
> >
> >
> >        DynamicRegExp
> >
> > Like DynamicJsString <#DynamicJsString>, but additionally escapes
> characters
> > special in regexp like ? and *.
> >
> >    * `John "The Anonymous" Doe + 1` → `John \"The Anonymous\" Doe \+ 1`
> >
> >
> >        DynamicJsValue
> >
> > Quotes strings and encodes them like DynamicJsString <#DynamicJsString>,
> puts
> > spaces around boolean, null, and numeric values.
> >
> >    * `John "The Anonymous" Doe + 1` → `"John \"The Anonymous\" Doe + 1"`
> >    * `42` → `42 `
> >    * `false` → `false `
> >
> > Putting spaces around non-string values makes sure that they will be
> separate
> > tokens but will not introduce a function call in the case of the Soy
> template
> >
> >       |var f = function () {}  // Missing semicolon.
> >       {$myBoolean}&& sideEffect();|
> >
> > where due to semicolon insertion, adding parentheses would cause the
> template to
> > produce the equivalent of
> >
> >       |var f = ((function () {})(false))&& sideEffect();|
> >
> > given |{ myBoolean: false }|.
> >
> >
> >        DynamicCssString
> >
> > Escapes plain text so it can be incorporated into part of a CSS string
> literal
> > by escaping special characters, e.g. newline → |\10 |.
> >
> >    * `John "The Anonymous" Doe` → `John \22 The Anonymous\22 Doe`
> >
> > We encode *all* CSS special characters using CSS hex escaping. CSS hex
> escaping
> > allows an escape to be followed optionally by a space or tab character so
> that
> > an escape may be followed by an unescaped hex digit. We always emit a
> following
> > space.
> >
> > We aggressively encode all CSS special characters to prevent unspecified
> CSS
> > error recovery <http://www.w3.org/TR/css3-syntax/#error> from restarting
> parsing
> > inside quoted strings.
> >
> >
> >            9.2.1. Error conditions
> >
> >    In general, this document does not specify error handling behavior for
> user
> >    agents (e.g., how they behave when they cannot find a resource
> designated by
> >    a URI).
> >
> >    However, user agents must observe the rules for handling parsing
> errors.
> >
> >    Since user agents may vary in how they handle error conditions,
> authors and
> >    users must not rely on specific error recovery behavior.
> >
> > We also escape both angle brackets (< and >) (which is already a CSS
> special) so
> > that HTML escaping text spans, CDATA sections, CDATA end tags, etc.
> cannot be
> > introduced into the middle of CSS strings.
> >
> >
> >        DynamicCssQuantityOrKeywordOrName
> >
> > Allows a CSS keyword, quantity, or ID or class name through, but filter
> content
> > containing special characters. Some use cases:
> >
> >    * |color: #{$hashColor}|
> >    * |color: {$colorName}|
> >    * |border-{$rtlLeft}: … /* left for English, right for Arabic */|
> >    * |div.{$className} { … }|
> >    * |width: ${width}{$widthUnits}|
> >
> > Some example values:
> >
> >    * `24px` → `24px`
> >    * `left` → `left`
> >    * `background` → `background`
> >    * `expression` → /error/
> >
> > TODO: explain the allowed set and its derivation.
> >
> >
> >        DynamicSchemeFilteredUriPart
> >
> > Whitelists a protocol if present to prevent code execution via
> |javascript:…|,
> > and normalizes the URI (encoding all unencoded HTML special characters,
> quotes,
> > spaces, and parentheses) so it can be embedded. E.g. `"` → %22.
> >
> > URI normalization percent escapes all codepoints escaped by
> DynamicQueryPart
> > <#DynamicQueryPart> except for the percent character (%).
> >
> > TODO: Explain the filter details and their derivation.
> >
> >
> >        DynamicQueryPart
> >
> > Encodes all characters that are special or disallowed in a URI.
> >
> > We encode all codepoints encoded by |encodeURIComponent| making the same
> > assumption that the URL is UTF-8 encoded.
> >
> > Over |encodeURIComponent|, we additionally encode single quotes (') and
> > parentheses(( and )) so that the result can be safely embedded in single
> quoted
> > HTML attributes and in single quoted and unquoted CSS |url(…)|
> constructs. Note
> > that applying an extra level of CSS escaping using |\27 | style escapes
> is not
> > an option since IE (for interoperability with DOS file paths?) does not
> > interpret |\| as the beginning of an escape when it appears inside a
> |url(…)|.
> >
> > Each of these characters is significant in a URI as specified in RFC
> 3986:
> >
> >
> >            2.2 <http://www.apps.ietf.org/rfc/rfc3986.html#sec-2.2>
> Reserved
> >            Characters
> >
> >    sub-delims  = "!" / "$" / "&" /_"'" / "(" / ")"_
> >
> > so escaping them is technically not semantics preserving, but encoding
> them is
> > safe for all schemes that commonly appear in HTML because those
> codepoints only
> > appear in the obsolete mark productions.
> >
> >
> >            D.2 <http://www.apps.ietf.org/rfc/rfc3986.html#sec-D.2>
> Modifications
> >
> >    The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
> >    [RFC2234]. This change required all rule names that formerly included
> >    underscore characters to be renamed with a dash instead. In addition,
> a
> >    number of syntax rules have been eliminated or simplified to make the
> >    overall grammar more comprehensible. Specifications that refer to the
> >    obsolete grammar rules may be understood by replacing those rules
> according
> >    to the following table: …
> >
> >    mark    "-" / "_" / "." / "!" / "~" / "*" /_"'" / "(" / ")"_
> >
> >
> >        DynamicUriPart
> >
> > Normalizes the URI like URI normalization <#DynamicSchemeFilteredUriPart>
> so
> > that an already encoded path or fragment can be emitted inline but does
> not
> > filter since a protocol part cannot appear here.
> >
> >
> >      Context Propagation
> >
> > The context propagation algorithm uniquely determines the context at
> every data
> > hole so that a later pass may chose a sanitization function for each
> hole.
> >
> > The algorithm operates at two level, one on the graph of templates, and
> another
> > individually within templates.
> >
> > The first deals with identifying the minimal set of templates that need
> to be
> > processed, and might clone templates to deal with templates that are
> called in
> > multiple different contexts.
> >
> > The template context propagation algorithm uses an inference object which
> is
> > implemented as a set of nested maps and a pointer to a parent inference
> object.
> > This allows us to speculatively type a template sub-graph, and when we
> have a
> > consistent view of types, we can collapse our conclusions into the parent
> by
> > simply copying maps from children to parent. The maps include maps from
> holes to
> > start contexts, from templates to end contexts used to type calls.
> >
> > def autosanitize(templates):
> >   inferences = Inferences()
> >   for template in templates:
> >     if inferences.getEndContext(template) is not None: continue # already
> done
> >     if template.is_public() or template.is_contextually_autosanitized():
> >       # By exploring the call graph from only public templates, ones
> >       # that can be invoked by front-end code, or ones that must be
> >       # contextually sanitized, we do not trigger error checks for
> >       # parts of the code-base that don't yet use contexual
> >       # auto-sanitization, easing migration.
> >       compute_end_context(template, inferences,
> start_context=HTML_PCDATA?)
> >   return inferences
> >
> > That algorithm delegates all the hard work to another algorithm below
> that
> > examines the template graph reachable from one particular top-level
> template.
> >
> > def compute_end_context(template, inferences, start_context):
> >   # First, assume that the end context is the same as the start context.
> >   # Template authors seem to write templates that fit this way.
> >   # Empirically, less than 0.2% of templates in our sample violate
> >   # this assumption.
> >   # The ones that do tend to be some of the gnarliest code that
> >   # template authors would rather not refactor.
> >
> >   # We need to chose an end context now to avoid infinite regression
> >   # if a template recurses.
> >
> >   # Start with the optimistic assumption that the above is true.
> >   optimistic_assumption_1 = Inferences(parent=inferences)
> >   optimistic_assumption_1.template_end_contexts[template] = start_context
> >   end_context = propagate_context(
> >       template.children, start_context, optimistic_assumption_1)
> >   if start_context == end_context:
> >     # Our optimistic assumption was warranted.
> >     optimistic_assumption_1.commit_into_parent()
> >     return end_context
> >
> >   # Otherwise, assume that the end_context above is the end_context
> >   # and check that we have reached a fixed point.
> >   optimistic_assumption_2 = Inferences(parent=inferences)
> >   optimistic_assumption_2.template_end_contexts[template] = end_context
> >   end_context_fixed_point = propagate_context(
> >       template.body, start_context, optimistic_assumption_2)
> >   if end_context_fixed_point == end_context:
> >     # We have found a fixed point.  Phew!
> >     optimistic_assumption_2.commit_into_parent()
> >     return end_context_fixed_point
> >
> >   # There are various other strategies we could try here, but
> >   # we have not seen a need in real template code.
> >   raise Error(...)
> >
> > Thus far, we have done nothing that is particular to the syntax
> templating
> > language itself. Different languages have different semantics around
> parameter
> > passing, and provide different flow control constructs. The algorithm
> below is
> > an example for one that deals with a simple template language that
> provides
> > calls, conditions, chunks of static template text, and expression
> interpolations
> > which fill data holes. On a call, it may recurse to the compute end
> context
> > algorithm above, which is how we lazily explore the portion of the
> template call
> > graph needed.
> >
> > def propagate_context(parse_tree_nodes, context, inferences):
> >   for parse_tree_node in parse_tree_nodes:
> >     if is_safe_text_node(parse_tree_node):
> >       context = apply_html_grammar(parse_tree_node.safe_text, context)
> >     elif is_data_hole(parse_tree_node):
> >       context =&epsilon_commit(context)  # see definition above
> >       inferences.context_for_data_hole[node] = context
> >       context =…   # compute context after hole.
> >     elif is_conditional(parse_tree_node):
> >       if_context = propagate_context(parse_tree_node.if_branch, context,
> inferences)
> >       else_context = propagate_context(parse_tree_node.else_branch,
> context, inferences)
> >       context = context_join(if_branch, else_branch)
> >     elif is_call_node(parse_tree_node):
> >       output_context = None
> >       # possible_callees comes up with the templates this might be
> calling,
> >       # and may clone templates if they are called in multiple different
> contexts.
> >       # Most template languages have static call graphs, so in practice,
> there is
> >       # exactly one possible callee.
> >       for possible_callee in possible_callees_of(parse_tree_node,
> context):
> >         if possible_callee not in inferences.template_end_contexts:
> >           context_after_call = compute_end_context(possible_callee,
> inferences, context)
> >         else:
> >           context_after_call =
> inferences.template_end_contexts[possible_callee]
> >         if output_context is None:
> >           output_context = context_after_call
> >         else:
> >           # Since 99% of templates end in their start context, in
> practice,
> >           # this join does little.
> >           output_context = context_join(output_context,
> context_after_call)
> >       context = output_context
> >   return context
> >
> > ? — We make the simplifying assumption that the start context for all
> public
> > templates is HTML_PCDATA. Some templating languages may be used in
> different
> > contexts, and so this assumption might not prove valid. We could choose
> the
> > starting context for public templates based on some kind of annotation or
> naming
> > convention particular to the templating language.
> >
> >
> >      Sanitization Functions
> >
> > We define a suite of sanitization functions. The table below describes
> them
> > briefly and the context in which they are used. There are significantly
> more
> > than most manual escaping schemes. As noted above, most developers who
> don't
> > work on parsers for HTML/CSS/JS have a simplified mental model of the
> grammar
> > which makes it difficult to choose between this many options. We have
> many
> > sanitization functions because we want to minimize template output size
> to
> > minimize network latency; having more sanitization functions lets us
> avoid
> > escaping common characters like spaces when safe. The naming convention
> for
> > sanitization function reflects the escaper <#glossary-escaper>, filter
> > <#glossary-filter>, and normalizer <#glossary-normalizer> definitions
> from the
> > glossary.
> >
> > |escapeHTML|    HTML entity escapes plain text, and allows pre-saniized
> HTML
> > content through unchanged
> > |normalizeHTML|    Normalizes HTML. Same as HTML, but does not encode
> ampersands.
> > |{escape,normalize}HTMLRcdata|    Like |escapeHTML| but does not allow
> > pre-sanitized HTML content through unchanged since tags are not allowed
> in
> > RCDATA contexts, |<title>| and |<textarea>|.
> > |{escape,normalize}HTMLAttribute|    Like |escapeHTML| but strips tags
> from
> > pre-sanitized HTML content through unchanged since tags are not allowed
> in
> > RCDATA contexts.
> > |filterHtmlElementName|    Rejects any invalid element name or non PCDATA
> element.
> > |filterHtmlAttribName|    Rejects any invalid attribute name or attribute
> name
> > that has JS, CSS, or URI content.
> > |{escape,normalize}URI|    Percent encodes (assuming UTF-8) URI, HTML,
> JS, and CSS
> > special characters so that the URL can be safely embedded. This means
> encoding
> > parentheses and single quotes which should not be normalized according to
> RFC
> > 3986, and is not valid for all non-hierarchical URI schemes, but the only
> > productions using single quotes or parentheses are obsolete marker
> productions,
> > and normalizing these characters is essential to safely embedding URIs in
> > unquoted CSS |url(…)| and to make sure that CSS error recovery mode
> doesn't jump
> > into the middle of a quoted string.
> > |filterNormalizeURI|    Like |normalizeURI| but first rejects any input
> that might
> > embed a protocol other than |http|, |https|, or |mailto|.
> > |{escape,normalize}JSStringChars|    Uses |\\| and |\uABCD| style escapes
> for any
> > code-units special in HTML, JS, or conditional compilation directives.
> > |{escape,normalize}JSRegexChars|    Like
> |{escape,normalize}JSStringChars| but
> > also escapes regular expression special characters like |'$'|.
> > |{escape,normalize}JSValue|    Encodes a boolean or a number to the
> string
> > representation of that surrounded by spaces. Otherwise escapes a string
> value
> > and wraps it in quotes.
> > |escapeCSSStringChars|    Uses |\ABCD| style escapes to escape HTML and
> CSS
> > special characters.
> > |filterCssIdentOrValue|    Allows CLASSes and IDs for CSS selectors,
> parts of
> > property names necessary for many BIDI applications, CSS keyword values,
> color
> > literals, and quantities. But disallows property names that might nest
> > javascript, and disallows URL schemes.
> > |noAutoescape|    Passes its input through unchanged. This is an
> auditable
> > exception to auto-sanitization.
> >
> >
> >      Sanitized Content Types
> >
> > Sanitized content allows template users to pre-sanitize some content, and
> allow
> > approved structured content.
> >
> > |new SanitizedContent('<b>Hello, World!</b>')| specifies a chunk of HTML
> that
> > the creator asserts is safe to embed in HTML PCDATA.
> >
> > It is possible for misuse of this feature to violate all the safety
> properties
> > contextual auto-sanitization provides. We assert that allowing this makes
> it
> > easier to migrate code that has no XSS safety net to a better place, and
> > satisfies some compelling use cases. But it needs to be used carefully.
> > Developers should heed this advice:
> >
> >    * Don't roll your own escapers. If you find them in existing code,
> prefer
> >      escaping in the template via the contextual auto-sanitization. This
> does
> >      not apply to filters. Filter early, and filter often.
> >    * Put the sanitized content type constructor as close to the code that
> does
> >      the sanitization.
> >    * Don't use tag or attribute black-lists.
> >    * Be skeptical of "safe" HTML from a database. This is a vector for
> SQL
> >      Injection to turn into XSS.
> >
> > Compelling use cases include:
> >
> >    * HTML from a trusted source such as translators who are translating
> strings
> >      into foreign languages. Consider using a template system that
> supports
> >      text L10N directly.
> >    * HTML from tag whitelisters, wiki-text-to-html converters, rich text
> >      editors, etc.
> >
> >
> >    Caveats
> >
> > As noted above, (in the runtime contextual auto-sanitization strawman)
> static
> > approaches (including ours) cannot handle all possible uses of dynamic
> attribute
> > and element name. These seem rare in real code, and relatively easy to
> fix, but
> > if necessary, a hybrid runtime/static approach could address this
> problem.
> >
> > Static approaches get into corner cases around zero-length untrusted
> values. For
> > example, to preserve the code effect property <#code_effect_property>, we
> need
> > to make sure that no untrusted value specifies a |javascript:| or similar
> URL
> > protocol. In template code like |<img src="{$x}{$y}">| we might naively
> decide
> > that it is sufficient to filter |$x| to make sure that it specifies no
> protocol
> > or an approved one. But if |$x| is the empty string, then |$y| might
> still
> > specify a dangerous protocol. Alternatively |$x| might specify
> |"javascript"|
> > and |$y| start with a colon. This hole can be closed a number of ways,
> but is a
> > source of considerable complexity because the two interpolations might
> cross
> > template boundaries. Other examples of whitespace problems are in
> JavaScript
> > regular expressions: |var myPattern = /{$x}/| where an empty |$x| would
> turn the
> > regular expression literal into a line comment.
> >
> > Our JavaScript parser is unsound. JavaScript is a language that does not
> have a
> > regular lexical grammar (even ignoring conditional compilation) because
> of the
> > way it specifies whether a |/| starts a regular expression or a division
> > operator. We use a scheme based on a draft JavaScript 1.9 grammar devised
> by
> > Waldemar Horwat that makes that decision based on the last non-comment
> token.
> > This works well for all the code we've seen that people actually write,
> and
> > makes our approach feasible, but there is a known case where it fails:
> |x++
> > /a/i| vs |x = ++/a/i|. The second code snippet, while nonsensical, is
> valid
> > JavaScript that our scheme fails to handle correctly.
> >
> > Our parser does not currently recognize HTML5 escaping text spans
> > <http://dev.w3.org/html5/markup/aria/syntax.html#escaping-text-span>,
> the
> > regions inside |<script>| and |<style>| bodies delimited by |<!--| and
> |-->|
> > that suppress end-tag processing. This can be fixed if a codebase seems
> to use
> > them. Our santization function choices are designed to not produce
> content
> > containing escaping text span boundaries.
> >
> >
> >    Case Study
> >
> > We studied 1035 templates that were migrated from an existing codebase to
> use
> > contextually sanitized templates. Most of the templates were relatively
> small
> > but totalled 21098 LOC and 783kB. The compilation load time cost for
> these 1035
> > templates was 998339279 ns on a platform with 2 GB of RAM, an Intel 2.6
> MHz
> > dual-core processor running Linux 2.6.31.
> >
> > 1- 18    ######################################## (685)
> > 19- 36    ############ (210)
> > 37- 55    #### (78)
> > 56- 73    # (33)
> > 74- 91    (10)
> > 92- 110    (7)
> > 111- 128    (4)
> > 129- 147    (3)
> > 148- 165    (1)
> > 166- 183    (1)
> > 184- 202    (1)
> > 203- 220    (1)
> > 221- 238    (0)
> > 239- 257    (0)
> > 258- 275    (0)
> > 276- 294    (0)
> > 295- 312    (1)
> >
> > Most of the sanitization functions chosen were plain text→HTML, so the
> > non-contextual auto-sanitization.
> >
> > ||escapeHtml|    602
> > ||escapeHtmlAttribute|    380
> > ||filterNormalizeUri, |escapeHtmlAttribute|    231
> > ||escapeJsValue|    39
> > ||filterCssValue|    33
> > ||escapeJsString|    27
> > ||escapeUri|    15
> > ||escapeHtmlRcdata|    10
> > ||escapeHtmlAttributeNospace|    7
> > ||filterHtmlIdent|    3
> > ||filterNormalizeUri|    1
> >
> > 268 out of 1348 interpolation sites require runtime filtering (19.9)%,
> mostly
> > |filterNormalizeUri|.
> >
> > The benchmark runs over a large template with dummy data that is meant to
> be
> > representative of the application using it. The benchmarks range from
> 15.2 ms to
> > 16.8 ms and the standard-deviation is roughly 6 ms, which puts the
> runtime-cost
> > of the sanitization functions in the noise.
> >
> > No sanitization
> > ====
> > 50% Scenario 16709334.99 ns;σ=615548.54 ns @ 10 trials
> >
> > Non-contextual auto-sanitization
> > ====
> > 50% Scenario 16835324.39 ns;σ=6030836.03 ns @ 10 trials
> >
> > Full contextual auto-sanitization
> > ====
> > 50% Scenario 15227861.39 ns;σ=616193.00 ns @ 10 trials
> >
> > In JavaScript, a state-machine based runtime contextual auto-sanitization
> > approach shows a 3-4 time slowdown over string concatenation.
> >
> > # rows    string +=    Array.join    open(Template(…))    DOM    render
> time
> > 1000    54 ms    68 ms    204 ms    508 ms    586 ms
> > 5000    267 ms    332 ms    1159 ms    2528 ms    1458 ms
> >
> > We ran the same benchmark against a runtime contextual auto-sanitizer we
> wrote
> > for javascript. The "noEscape" case simply appends all the strings to a
> buffer.
> > It does no context inference. The "parseOnly" case appends to a buffer
> and does
> > context inference, but does no escaping. The "dynEscape" does context
> > propagation and chooses one of three escaping methods by looking at the
> context
> > from the parser. The cost of applying the escaping directive is about the
> same
> > as a string copy, and the cost of parsing and propagating context at
> runtime is
> > about 6 times that cost. This benchmark is a good comparison for
> templates where
> > the logic that computes values to fill data holes is simple so the cost
> of
> > executing the template should approach string concatenation.
> >
> > Totals for 1000 runs:
> > noEscape   :    491316000 ns  (1.0)
> > parseOnly  :   2979672000 ns  (6.1)
> > dynEscape  :   3531971000 ns  (7.2)
> >
> >
> --------------------------------------------------------------------------------
> >
> > Last modified: Wed Feb 23 17:06:20 EST 2011
> >
> > _______________________________________________
> > Developer-outreach mailing list
> > Developer-outreach at lists.owasp.org
> > https://lists.owasp.org/mailman/listinfo/developer-outreach
>
> _______________________________________________
> Developer-outreach mailing list
> Developer-outreach at lists.owasp.org
> https://lists.owasp.org/mailman/listinfo/developer-outreach
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/developer-outreach/attachments/20110228/6105be4f/attachment-0001.html 


More information about the Developer-outreach mailing list