[Esapi-user] Localization and InputValidation

David Sklarew david.sklarew at gmail.com
Wed Jan 27 06:12:09 EST 2010


>>my regex would need to allow a range of Unicode values

>> a character class (\p{Alpha} and such) would seamlessly match 'letters' of any langauge.



I am a newbie to unicode, but looked into this a while back. My research pointed to that Java does not support character classes in regular expressions ( see www.regexbuddy.com - imho their regex product has a great help file and tool for creating and debugging regex).


I only had simple needs and wrote a simple isValidString() function that uses the Java Character class to validate that all characters in the input are from particular character classes ( i.e. letters, punctuation, accents…).  If you need to constrain each Unicode character --Java calls them code points - against either a range of Unicode values or a character class, it may be easiest to loop through each code point in the input  and check that it meets the regex or the type of the code point is an acceptable character class.




From: esapi-user-bounces at lists.owasp.org [mailto:esapi-user-bounces at lists.owasp.org] On Behalf Of Rob Spremulli
Sent: Tuesday, January 26, 2010 3:19 PM
To: esapi-user at lists.owasp.org
Subject: [Esapi-user] Localization and InputValidation


Hi guys, a question has arisen re: input validation


I should prefix this by stating we are on 1.4, not 2.0.


Let's say I want to pass "グ" in my input.  For those of you who can't read that, it's a Japanese Katakana with Unicode value 30B0



I want to allow this in my input, so I need to create a regex that will permit it.  What I'm not sure about is:

1) what canonicalize is going to do to that string, and 

2) if there's a locale-aware way of identifying characters in a regex.


I can see this potentially showing up as 

\u30b0, where I would need to permit \ characters, 

\u30b0, where the slash is encoded, though I doubt this.



the latter can lead to two possibilities

1) my regex would need to allow a range of Unicode values

2) a character class (\p{Alpha} and such) would seamlessly match 'letters' of any langauge.


The confusion on my end is due to lack of knowledge on characters outside the typical US character set.  Can anyone shed some light on this issue, as to the expected canonicalization and recommended whitelist regex?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.owasp.org/pipermail/esapi-user/attachments/20100127/8ed4a722/attachment.html 

More information about the Esapi-user mailing list