[Opa] Getting rid of the "invalid UTF8 opcode"
Nicolas Glondu
nicolas.glondu at telecom-bretagne.eu
Fri Mar 23 09:55:41 UTC 2012
Hi, I'm trying to parse content from websites to extract pictures. My
problem is that, on some websites, my parsing function generates a lot of
"bslCactutf.lenbytes : invalid UTF8 opcode: NNN" with 128 < NNN < 192.
For instance, I get more than 3 Gigabytes of such logs for the website
http://www.tudou.com/ (Google's 23/1000 most-visited website on the web).
It is extremely annoying and it useless as a log.
I have looked at when these codes are generated, it seems they happen when
character codes from 128 to 192 are met. UTF-8 characters from 128 to 160
are control chars, so should really not happen. Those from 161 to 192 are
valid so I'm a bit puzzled.
For instance, I can generate this error message with :
_ = Cactutf.look("££", 1)
If you want an example generating 3Gb of logs, I join one. It's a simple
program which removes all chars in the bad range so I can work peacefully
on them. I know that tudou.com may be an extreme example since it's not
even encoded in UTF-8 but I'd prefer my parser to return an empty result
than hanging my server like currently.
import stdlib.io.file
function fold8(callback, source, accumulator) {
len = Cactutf.length(source);
recursive function aux(offset, accu) {
if (offset < len) {
char = Cactutf.sub(source, offset, 1)
l = Cactutf.look(char, 0)
|> Cactutf.lenbytes
|> `?`(_, 1)
aux(offset + l, callback(char, accu))
} else accu
};
aux(0, accumulator)
}
function clean(s) {
fold8(
function(c, acc) {
n = Cactutf.look(c, 0)
if (n > 127 && n < 192) acc
else Text.insert_right(acc, c)
}, s, Text.cons(""))
|> Text.to_string
}
_ = {
// index.html is the result of: wget http://www.tudou.com/
s = File.content_opt("index.html") ? ""
r = clean(s)
jlog("{r}")
void
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/opa/attachments/20120323/1711a442/attachment.html>
More information about the Opa
mailing list