[Esapi-user] Localization and InputValidation

David Sklarew david.sklarew at gmail.com
Wed Jan 27 06:11:56 EST 2010


Rob,

>>my regex would need to allow a range of Unicode values

>> a character class (\p{Alpha} and such) would seamlessly match 'letters' of any langauge.

 

I am a newbie to unicode, but looked into this a while back. My research pointed to that Java does not support character classes in regular expressions ( see www.regexbuddy.com - their regex product has a help file discussing different languages suport for regex and also a tool for debugging regex). Everyone, if I am mistaken about this Java regex limitation, please let me know…..

 

I only had simple needs and wrote a simple isValidString() function that uses the Java Character class to validate that all characters in the input are from particular unicode character classes ( i.e. letters, punctuation, accents…).  If you need to constrain each Unicode character against either a range of Unicode values or a character class, it may be easiest to loop through each code point in the input  and check that it either meets a regex or that the type of the code point is an acceptable character class.  The character classes allowed could be externalized in a properties file.

 

David

 

 

From: esapi-user-bounces at lists.owasp.org [mailto:esapi-user-bounces at lists.owasp.org] On Behalf Of Rob Spremulli
Sent: Tuesday, January 26, 2010 3:19 PM
To: esapi-user at lists.owasp.org
Subject: [Esapi-user] Localization and InputValidation

 

Hi guys, a question has arisen re: input validation

 

I should prefix this by stating we are on 1.4, not 2.0.

 

Let's say I want to pass "グ" in my input.  For those of you who can't read that, it's a Japanese Katakana with Unicode value 30B0

 http://www.fileformat.info/info/unicode/char/30b0/index.htm

 

I want to allow this in my input, so I need to create a regex that will permit it.  What I'm not sure about is:

1) what canonicalize is going to do to that string, and 

2) if there's a locale-aware way of identifying characters in a regex.

 

I can see this potentially showing up as 

\u30b0, where I would need to permit \ characters, 

\u30b0, where the slash is encoded, though I doubt this.

グ

 

the latter can lead to two possibilities

1) my regex would need to allow a range of Unicode values

2) a character class (\p{Alpha} and such) would seamlessly match 'letters' of any langauge.

 

The confusion on my end is due to lack of knowledge on characters outside the typical US character set.  Can anyone shed some light on this issue, as to the expected canonicalization and recommended whitelist regex?

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/esapi-user/attachments/20100127/2bdc70cd/attachment.html 


More information about the Esapi-user mailing list