[Esapi-dev] Thoughts on a new codec framework

Jeff Williams jeff.williams at aspectsecurity.com
Sun Feb 28 23:37:26 EST 2010


This all seems reasonable to me. I'm curious about why you want to
separate encode and decode. While they're independent, it seems logical
to keep them together in the same class.

--Jeff


-----Original Message-----
From: esapi-dev-bounces at lists.owasp.org
[mailto:esapi-dev-bounces at lists.owasp.org] On Behalf Of Ed Schaller
Sent: Saturday, February 27, 2010 12:11 AM
To: ESAPI dev
Subject: [Esapi-dev] Thoughts on a new codec framework

Introduction:
-------------

In dealing with the codec's internals I started thinking of ways that
the architecture might be improved. I was going to wait until after 2.0
but Jim encouraged me to do it now. My suggestions/proposal here is for
a advanced user API and a SPI. As such it doesn't affect the current
Encoder interface that has received a fair bit of attention of late and
also needs help.

I've probably gone overboard on this but I haven't figured out a way to
make it simpler without having the functionality I would like to see.
Any suggestions for simplification without reduction of functionality
would be appreciated.

It's also worth mentioning that I'm willing to head up implementing this
and that I certainly am not proposing something I want others to
implement;)

Goals:
------

Simple to use codecs and implement codecs yet flexible enough to allow
easy expansion and custom features.

Bases:
------

There are advantages and disadvantages to interfaces vs. abstract
classes as the base. 1.4 uses a interface and 2.0 currently uses a
abstract class. I tend to like the in between of a interface and a
abstract base class that implements it.

One thing that I think should be done is to split up the codec interface
so that a encoder can be implemented without requiring a decoder.
Additionally, the encode and decode do not need to separate interfaces
either as both convert from one type to another. In these thoughts I
refer to one as a Codec that codes but other names may be preferred
(Transformer that transforms).

Such combination leads to four Codec types:

Bytes to Bytes
Characters to Characters
	html, css and js encode and decode
Characters to Bytes
	hex and base64 decode
	base64 decode
Bytes to Characters
	hex and base64 encode

Technically Bytes to Bytes is only codec really needed but not providing
the others requires dealing with character encodings at each steep which
wouldn't be a good idea.

Additionally, there are two general methods of usage for codecs:

buffer to buffer
stream/reader to stream/writer

Currently ESAPI only supports the first one. Either can be implemented
via the other (eg: in a abstract base class). Implementing streaming
with buffers in a generic way is challenging with out buffering the
whole stream into memory (where do you chunk the input). My suggestions
below tend toward the latter method with base methods that implement the
former. This may seem an excessive amount of object wrapping but it
should be pointed out that the compiler itself wraps most string
concatenation in StringBuilder these days.

Methods of the streaming type would read from the input and preform
coding to the output until EOF is received on the input.

API vs. SPI:
------------

Making one public API interface with a public SPI seems a logical choice
here. Some of my suggestions on interfaces below are based on JAXP which
is similar. This also allows simpler SPI interfaces.

Byte Streaming Ins and Outs:
----------------------------

InputStream and OutputStream are the obvious choices here.

Char Streaming Ins and Outs:
----------------------------

On the input there isn't much of a choice: Reader.

Output could be Writer but I'm favoring Appendable as you can plug in
both a Writer or a StringBuilder/Buffer. The latter is really convenient
except for the nasty IOExceptions. The API could include specific
StringBuilder/Buffer methods that wrap the IOException in a
IllegalStateException as they never throw the IOException anyway (this
however would not work if we choose to throw encoding/decoding
exceptions as a subclass of IOException).

Byte Buffer Ins and Outs:
-------------------------

There isn't much to choose here except byte[] which is sufficient. I
think though that offset and length should always be available to as
this prevents copying buffers around all over the place. Methods that
take byte[] without offset and length should be provided in the API
though.

Output to a byte[] is something I think I would pass on for now as this
gets complicated. Not only would the API have to return the amount
actually written to the output but it would also need to be able to
handle the case where there is more input than will fit in the output.
This may be something worth adding in the future if the need presents
itself.

Char Buffer Ins and Outs:
-------------------------

Clearly input and output of Strings needs to be supported. Is it worth
including offset and length on input here as well? As String is
immutable so it must be returned so offset and length aren't really
feasible.

char[] may also be worth supporting too though conversion back and forth
between char[] and String is fairly easy.

One recurring issue is whether to base methods on char, Character or
something else. 1.4 uses char. 2.0 currently uses Character. I suggest
using int. char is simple and fast. Character is faster in 1.5+ than it
was and makes Set's and Map's possible for implementations.

Both char and Character have the drawback though that they can't handle
unicode code points above 0xFFFF making full internationalization very
difficult. encode(char ch...) isn't possible as you may need two chars
for the full code point. This is one of the frustrating effects of the
unicode spec changing after Java adopted it.

Exceptions:
-----------

What should be done if a input cannot be encoded/decoded into an output.
This is probably more obviously a issue with decoding (xml & with no
';') than encoding but it is still a problem with encoding to. A null
character cannot be encoded for CSS as the CSS spec clearly states that
null characters are not supported.

Input of characters also has the possibility of issues with surrogate
pairs. There is nothing that prevents surrogate pair values from being
in invalid sequences in a String. Similarly, should invalid/reserved
code points throw exceptions as well?

How to report such issues is another question. My suggestion is to throw
runtime exceptions here.

Lookahead:
----------

One of the short comings of the current codec implementation is that
there is only one character lookahead. While this is sufficient for most
of the codecs, it is not for HTML. Only recently was it discovered that
the HTML codec couldn't handle entities that started with the same
characters. The previous implementation would happily read characters
until it found a match in a Map. This caused &piv to always be treated
as &pi. When this was originally fixed, using the provided one character
lookahead seemed sufficient until theta and thetasym were found. This
required hacking around the one character lookahead.

My suggestion here is to just use the JRE provided PushbackInputStream
and PushbackReader. These can wrap other InputStreams and Readers if
needed and allow users of the standard classes to have no difficulty
working with ESAPI.

+---+
|API|
+---+

The general API methods would be the abstract factory pattern used to
create individual codecs. Utility classes and methods are not listed
below but would include such functionality as allowing a codec to be
used in a input or output chain.

The Codec interfaces below are fairly extensive in the methods that are
provided. The idea here is to make the codecs easy to use. Most of these
would be implemented in the framework and call the far simpler SPICodecs
described afterwards.

public abstract class CodecFactory
{
	public static BytesTosByteCodecFactory
newByteByteCodecFactory(String name);
	public static BytesTosByteCodecFactory
newByteByteCodecFactory(String name, String clsName);
	public static BytesTosByteCodecFactory
newByteByteCodecFactory(String name, String clsName, ClassLoader
loader);
	public static CharsTosCharCodecFactory
newCharCharCodecFactory(String name);
	public static CharsTosCharCodecFactory
newCharCharCodecFactory(String name, String clsName);
	public static CharsTosCharCodecFactory
newCharCharCodecFactory(String name, String clsName, ClassLoader
loader);
	public static CharsTosByteCodecFactory
newCharByteCodecFactory(String name);
	public static CharsTosByteCodecFactory
newCharByteCodecFactory(String name, String clsName);
	public static CharsTosByteCodecFactory
newCharByteCodecFactory(String name, String clsName, ClassLoader
loader);
	public static BytesTosCharCodecFactory
newByteCharCodecFactory(String name);
	public static BytesTosCharCodecFactory
newByteCharCodecFactory(String name, String clsName);
	public static BytesTosCharCodecFactory
newByteCharCodecFactory(String name, String clsName, ClassLoader
loader);

	public Object getAttribute(String name);
	public void setAttribute(String name, Object value);

	public boolean getFeature(String name);
	public void setFeature(String name);
}

public abstract class BytesToBytesCodecFactory extends CodecFactory {
	public BytesToBytesCodec newInstance(); }

public abstract class CharsToCharsCodecFactory extends CodecFactory {
	public CharsToCharsCodec newInstance(); }

public abstract class BytesToCharsCodecFactory extends CodecFactory {
	public BytesToCharsCodec newInstance(); }

public abstract class CharsToBytesCodecFactory extends CodecFactory {
	public CharsToBytesCodec newInstance(); }

public interface Codec
{
}

public interface BytesToBytesCodec extends Codec {
	// buff to buff
	public byte[] code(byte[] in, int off, int len);
	public byte[] code(byte[] in);
	public byte[] code(int b);

	// buff to stream
	public OutputStream code(byte[] in, int off, int len,
OutputStream out) throws IOException;
	public OutputStream code(byte[] in, OutputStream out) throws
IOException;
	public OutputStream code(int b, OutputStream out) throws
IOException;

	// stream to buff
	public byte[] code(InputStream in) throws IOException; }

public interface CharsToCharsCodec extends Codec {
	// buff to buff
	public String code(char[] in, int off, int len);
	public String code(char[] in);
	public String code(CharSequence in, int off, int len);
	public String code(CharSequence in);
	public String code(int ch);

	// buff to stream
	public Appendable code(char[] in, int off, int len, Appendable
out) throws IOException;
	public Appendable code(char[] in, Appendable out) throws
IOException;
	public Appendable code(CharSequence in, int off, int len,
Appendable out) throws IOException;
	public Appendable code(CharSequence in, Appendable out) throws
IOException;
	public Appendable code(int ch, Appendable out) throws
IOException;

	public StringBuilder code(char[] in, int off, int len,
StringBuilder out);
	public StringBuilder code(char[] in, StringBuilder out);
	public StringBuilder code(CharSequence in, int off, int len,
StringBuilder out);
	public StringBuilder code(CharSequence in, StringBuilder out);
	public StringBuilder code(int ch, StringBuilder out);

	public StringBuffer code(char[] in, int off, int len,
StringBuffer out)
	public StringBuffer code(char[] in, StringBuffer out)
	public StringBuffer code(CharSequence in, int off, int len,
StringBuffer out)
	public StringBuffer code(CharSequence in, StringBuffer out)
	public StringBuffer code(int ch, StringBuffer out)

	// stream to buff
	public String code(Reader in) throws IOException; }

public interface BytesToCharsCodec extends Codec {
	// buff to buff
	public String code(byte[] in, int off, int len);
	public String code(byte[] in);
	public String code(b ch);

	// buff to stream
	public Appendable code(byte[] in, int off, int len, Appendable
out) throws IOException;
	public Appendable code(byte[] in, Appendable out) throws
IOException;
	public Appendable code(int b, Appendable out) throws
IOException;

	public StringBuilder code(byte[] in, int off, int len,
StringBuilder out);
	public StringBuilder code(byte[] in, StringBuilder out);
	public StringBuilder code(int b, StringBuilder out);

	public StringBuffer code(byte[] in, int off, int len,
StringBuffer out)
	public StringBuffer code(byte[] in, StringBuffer out)
	public StringBuffer code(int b, StringBuffer out)

	// stream to buff
	public String code(InputStream in) throws IOException; }

public interface CharsToBytesCodec extends Codec {
	// buff to buff
	public byte[] code(char[] in, int off, int len);
	public byte[] code(char[] in);
	public byte[] code(CharSequence in, int off, int len);
	public byte[] code(CharSequence in);
	public byte[] code(int ch);

	// buff to stream
	public OutputStream code(char[] in, int off, int len,
OutputStream out) throws IOException;
	public OutputStream code(char[] in, OutputStream out) throws
IOException;
	public OutputStream code(CharSequence in, int off, int len,
OutputStream out) throws IOException;
	public OutputStream code(CharSequence in, OutputStream out)
throws IOException;
	public OutputStream code(int ch, OutputStream out) throws
IOException;

	// stream to buff
	public byte[] code(Reader in) throws IOException;

	// stream to stream
	public OutputStream code(Reader in, OutputStream out) throws
IOException; }

+----+
|SPI:|
+----+

The interfaces for the SPI follow. There are far more SPICodec
interfaces than API ones to allow implementation to be fairly simple.
The richer API interfaces are provided by the framework.

Factory classes for the codecs here are also wrapped by the API.

Base:
-----

public interface SPICodecFactory
{
	public Object getAttribute(String name);
	public void setAttribute(String name, Object value);

	public boolean getFeature(String name);
	public void setFeature(String name);
}

public interface SPICodec
{
}

Chars To Chars:
---------------

public interface CharsToCharsSPICodecFactory extends SPICodecFactory {
	public CharsToCharsSPICodec newInstance(); }

public interface CharsToCharsSPICodec extends SPICodec {
	public Appendable tansform(Reader in, Appendable out) throws
IOException; }

public interface CharToCharsSPICodecFactory extends SPICodecFactory {
	public CharToCharsSPICodec newInstance(); }

public interface CharToCharsSPICodec extends SPICodec {
	public Appendable tansform(int ch, Appendable out) throws
IOException; }

Bytes To Bytes:
---------------

public interface BytesToBytesSPICodecFactory extends SPICodecFactory {
	public BytesToBytesSPICodec newInstance(); }

public interface BytesToBytesSPICodec extends SPICodec {
	public OutputStream tansform(InputStream in, OutputStream out)
throws IOException; }

public interface ByteToBytesSPICodecFactory extends SPICodecFactory {
	public ByteToBytesSPICodec newInstance(); }

public interface ByteToBytesSPICodec extends SPICodec {
	public OutputStream tansform(int b, OutputStream out) throws
IOException; }

Bytes To Chars:
---------------

public interface BytesToCharsSPICodecFactory extends SPICodecFactory {
	public BytesToCharsSPICodec newInstance(); }

public interface BytesToCharsSPICodec extends SPICodec {
	public Appendable tansform(InputStream in, Appendable out)
throws IOException; }

public interface ByteToCharsSPICodecFactory extends SPICodecFactory {
	public ByteToCharsSPICodec newInstance(); }

public interface ByteToCharsSPICodec extends SPICodec {
	public Appendable tansform(int b, Appendable out) throws
IOException; }

Chars To Bytes:
---------------

public interface CharsToBytesSPICodecFactory extends SPICodecFactory {
	public CharsToBytesSPICodec newInstance(); }

public interface CharsToBytesSPICodec extends SPICodec {
	public OutputStream tansform(Reader in, OutputStream out) throws
IOException; }

public interface CharToBytesSPICodecFactory extends SPICodecFactory {
	public CharToBytesSPICodec newInstance(); }

public interface CharToBytesToCharsSPICodec extends SPICodec {
	public OutputStream tansform(int ch, OutputStream out) throws
IOException; }

>>>------>


More information about the Esapi-dev mailing list