[Owasp-leaders] Working toward a resolution on the Constrast Security / OWASP Benchmark fiasco

Kevin W. Wall kevin.w.wall at gmail.com
Sat Nov 28 23:39:42 UTC 2015

Until very recently, I've been following at a distance this dispute between
various OWASP members and Contrast Security over the latter's advertising
references to the OWASP Benchmark Project

While I too believe that mistakes were made, I believe that we all need to
take a step back and not throw out the baby with the bath water.

While unlike Johanna, I have not executed the OWASP Benchmark Project for
any given SAST or DAST tool, having used many such commercial tools, I feel
qualified to render a reasoned opinion of the OWASP Benchmark Project, and
perhaps some steps that we can take towards amicable resolution.

Let me start with the OWASP Benchmark Project. I find the idea of having an
extensive baseline of tests against we can gauge the effectiveness of SAST
and DAST software quite sound. In a way, these tests are analogous to unit
tests that we, as developers, use to find bugs in our code and help us
improve it, where here the discovered false positives and false negatives
being revealed are being used as the PASS / FAIL criteria for the tests. Just
as in unit testing, where the ideal is to have extensive tests to broaden one's
"test coverage" of the software under test, the Benchmark Project strives
to have a broad set of tests to assist in revealing deficiencies (with
the goal of removing these "defects") in various SAST and DAST tools.

This is all well and good, and I whole-heartedly applaud this effort.

However, I see several ways that this Benchmark Project fails. For one,
we have no way to measure the "test coverage" of the vulnerabilities that
the Benchmark Project claims to measure. There are (by figures that I've
seen claimed) something like 21,000 different test cases. How do we, as AppSec
people, know if these 21k 'tests' provide "even" test coverage? For
instance, it is not unreasonable to think that they may be heavy coverage
on tests that are easy to create (e.g., SQLi, buffer overflows, XSS) and
a much lesser emphasis on "test cases" for things like cryptographic
weaknesses. (This would not be surprising in the least, since the coverage
of every SAST and DAST tool that I've ever used seems to excel in some
areas and absolutely suck in others.)

Another way that the Benchmark Project is lacking is one that is admitted
on the Benchmark Project wiki page under the "Benchmark Validity" section:
    The Benchmark tests are not exactly like real applications. The
    tests are derived from coding patterns observed in real
    applications, but the majority of them are considerably *simpler*
    than real applications. That is, most real world applications will
    be considerably harder to successfully analyse than the OWASP
    Benchmark Test Suite. Although the tests are based on real code,
    it is possible that some tests may have coding patterns that don't
    occur frequently in real code.

A lot of tools are great at detecting data and control flows that are simple,
but fail completely when facing "real code" that uses complex MVC frameworks
like Spring Framework or Apache Struts. The bottom line is that we need
realistic tests. While we can be fairly certain that if a SAST or DAST tool
misses the low bar of one of the existing Benchmark Project test cases, if
they are able to _pass_ those tests, it still says *absolutely nothing* about
their ability to detect vulnerabilities in real world code where the code
is often orders of magnitude more complex. (And I would argue that this is
one reason we see the false positive rate so high for SAST and DAST tools;
rather than err on the side of false negatives, they flag "issues" that
they are generally unreliable and then rely on appsec analysts to discern which
are real and which are red herrings. This is still easier than if they
appsec engineers had to hunt down these potential issues manually and then
analyze them, so it is not entirely inappropriate. As long as the tool
provides some sort of "confidence" indicator for the various issues that it
finds, an analyst can easily decide whether they are worth spending effort on
further investigation.)

This brings me to what I see as the third major area of where the Benchmark
Project is lacking. In striving to be simple, it attempts to distill all the
findings into a single metric. The nicest thing I can think of saying about
this is that it is woefully naive and misguided. I think where it is misguided
is that it assumes that every IT organization in every company weights
everything equally. For instance, false positives and false negatives are both
_equally_ bad. However, in reality, most organizations that I've been involved
in AppSec would highly prefer false positives over false negatives. Likewise,
all categories (e.g., buffer overflows, heap corruption, SQLi, XSS, CSRF,
etc.) are all weighted equally. Every appsec engineer knows that this is
generally unrealistic; indeed it is _one_ reason that we have different risk
ratings for different findings. Also, if a company writes all of their
applications in "safe" programming languages like C# or Java, then categories
like buffer overflows or heap corruption completely disappear. What that means
is that those companies don't care at all whether or not a given SAST or DAST
tool can find those categories of vulnerabilities or not because they are
completely irrelevant for them. However, because there is no way to customize
the weighting of Benchmark Project findings when run for a given tool,
everything is attempted to be shoe-horned into a single magical figure. The
result is that that magical Benchmark Project figure becomes almost
meaningless. At best, it's meaning is very subjective and not at all as
objective as Contrast's advertising is attempting to lead people to believe.

I believe that the general reaction to all of this has been negative, at
least based on the comments that I've read not only in the OWASP mailing
lists, but also on Twitter. In the end, this will be damaging to either
OWASP's overall reputation or at the very least, the reputation of the
OWASP Benchmark Project, both of which I think most of us agreed is
bad for the appsec community in general.

Therefore, I have a simple proposal towards resolution. I would appeal to
the OWASP project leaders to appeal to the OWASP Board to simply mark the
OWASP Benchmark Project Wiki page (and ideally, its GitHub site) as noting
that the findings are being disputed. For the wiki page, we could do this
in a manner that Wikipedia marks disputes, using a Template:Disputed tag
(see https://en.wikipedia.org/wiki/Template:Disputed_tag) or their
"Accurracy Disputes" (for example, see
and https://en.wikipedia.org/wiki/Category:Accuracy_disputes)

At a mininum, we should have this tag result in rendering something like:
    "The use and accuracy of this page is currently being disputed.
    OWASP does not support any vendor endorsing any of their
    software according to the scores resulting in execution of
    the OWASP Benchmark."
that the OWASP Board should apply (so that no one is permitted to
remove it without proper authorization).

I will leave the exact wording up to the board. But just like disputed
pages on Wikipedia, OWASP must take action on this or I think they are
likely to have credibility issues in the future.

Thank you for listening,
-kevin wall
Blog: http://off-the-wall-security.blogspot.com/
NSA: All your crypto bit are belong to us.

More information about the OWASP-Leaders mailing list