[Owasp-antisamy] antisamy problems with  

Joel Worrall jworrall at mzinga.com
Thu Apr 10 09:36:12 EDT 2008


I found a viable workaround to my formatting and output issues with HTML
characters ( ,  , &, etc). I had additional issues with
AntiSamy's assumptions regarding the output formatting for line
separators and indents. Both of those issues are resolved as well with
this method.

Instead of using the CleanResults.getCleanHTML() method, use the
following code:

import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.HTMLSerializer;

...


OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
format.setOmitXMLDeclaration(true);
format.setOmitDocumentType(true);
format.setIndent(0);
format.setIndenting(false);
format.setLineSeparator("");
format.setPreserveEmptyAttributes(true);
format.setOmitComments(true);
format.setPreserveSpace(true);
format.setStandalone(true);

HTMLSerializer serializer = new HTMLSerializer(format);

CleanResults cr = antisamy.scan(stringToBeCleaned,
Policy.getInstance());

StringWriter sw = new StringWriter();
serializer.setOutputCharStream(sw);
serializer.serialize(cr.getCleanXMLDocumentFragment());
cleanString = sw.toString();

In this case, the input:

<p>Does this work?&nbsp;&nbsp;&nbsp;</p><br/>

Becomes the output:

<p>Does this work?&nbsp;&nbsp;&nbsp;</p><br />

For my purposes, that's close enough.

The downside with this approach is that HTMLSerializer is deprecated, so
the code is effective but will end-of-life at some point in the future.



More information about the Owasp-antisamy mailing list