[Opa] Getting rid of the "invalid UTF8 opcode"

Nicolas Glondu nicolas.glondu at telecom-bretagne.eu
Fri Mar 23 09:55:41 UTC 2012


Hi, I'm trying to parse content from websites to extract pictures. My
problem is that, on some websites, my parsing function generates a lot of
"bslCactutf.lenbytes : invalid UTF8 opcode: NNN" with  128 < NNN < 192.

For instance, I get more than 3 Gigabytes of such logs for the website
http://www.tudou.com/ (Google's 23/1000 most-visited website on the web).
It is extremely annoying and it useless as a log.

I have looked at when these codes are generated, it seems they happen when
character codes from 128 to 192 are met. UTF-8 characters from 128 to 160
are control chars, so should really not happen. Those from 161 to 192 are
valid so I'm a bit puzzled.

For instance, I can generate this error message with :

_ = Cactutf.look("££", 1)

If you want an example generating 3Gb of logs, I join one. It's a simple
program which removes all chars in the bad range so I can work peacefully
on them. I know that tudou.com may be an extreme example since it's not
even encoded in UTF-8 but I'd prefer my parser to return an empty result
than hanging my server like currently.

import stdlib.io.file

function fold8(callback, source, accumulator) {
    len = Cactutf.length(source);
    recursive function aux(offset, accu) {
if (offset < len) {
    char = Cactutf.sub(source, offset, 1)
    l = Cactutf.look(char, 0)
|> Cactutf.lenbytes
|> `?`(_, 1)
    aux(offset + l, callback(char, accu))
} else accu
    };
    aux(0, accumulator)
}

function clean(s) {
    fold8(
function(c, acc) {
    n = Cactutf.look(c, 0)
    if (n > 127 && n < 192) acc
    else Text.insert_right(acc, c)
}, s, Text.cons(""))
|> Text.to_string
}

_ = {
    // index.html is the result of: wget http://www.tudou.com/
    s = File.content_opt("index.html") ? ""
    r = clean(s)
    jlog("{r}")
    void
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/opa/attachments/20120323/1711a442/attachment.html>


More information about the Opa mailing list