[Esapi-dev] Encoder#normalize(String) redux

Ed Schaller schallee at darkmist.net
Sat Jan 16 16:10:43 EST 2010


As I've been trying to fixup unit tests in 1.4 today and yesterday (Jim,
sorry, I'm burned out for today but only 2 are failing in linux now)
I came full circle back to issue 74 and normalize.  I'm hoping for some
help in which direction to go on fixing this.

For background, the issue here is that older JVMs had
sun.text.Normalizer that was used in the reference implementation
but 1.6 has java.text.Normlizer and not the previous. The javadoc for
Encoder#normalize(String) says:

"Reduce all non-ascii characters to their ASCII form so that simpler
validation rules can be applied. For example, an accented-e character
will be changed into a regular ASCII e character."

There is one unit test for this, EncoderTest#testNormalize(), with
one assertion:

assertEquals( "e a i _ @ \" < > ", ESAPI.encoder().normalize("� � � _ @ \" < > \u20A0"));

Today I went looking for a solution and found ICU4J which has a Normalizer
that is apparently even more complete than what is in 1.6. I plugged it
in and to my surprise, the unit test still fails. Baffled, I re-enabled
the 1.6 code and it failed. Using the 1.4 sun.text.Normalize version
also fails. Apparently this never worked. All of these implementations
return the same unequal result "   _ @ \" < > " which certainly doesn't
match what the javadoc says it should.

For better or worse, it turns out that the only code in either 1.4 or
2.0 that even calls normalize is the unit test.

My general question for the group is what should be done? It apparently
never worked and isn't used so removing it is probably possible. On the
other hand ambiguities between different representations are a great way
to evade security controls. However, even the javadoc allows for that
because code after the check may actually perform differently based on
whether 'e' is accented or not and after normalize they would be treated
the same.

I guess my only suggestion at this moment is to choose one canonical
form (Probably NFC) and just return that instead of striping it down to
the closest ASCII representation. That however is at odds with what the
javadoc has said for every release so far.

Thoughts? 

>>>------>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
Url : https://lists.owasp.org/pipermail/esapi-dev/attachments/20100116/14769021/attachment.bin 


More information about the Esapi-dev mailing list