[Owasp-leaders] [Owasp-board] Working toward a resolution on the Constrast Security / OWASP Benchmark fiasco

Michael Coates michael.coates at owasp.org
Tue Dec 1 15:20:34 UTC 2015

I think that's a critical point Simon and tend to agree. In a situation
where a project compares and critiques tools/approaches/etc we need
independence in the project leadership.

Does this mean a vendor could never lead this kind of project and we lose
all the merits of the benchmark? I think if the vendor could also get 2
other independent project leaders that aren't from the same vendor then
maybe that would work.


On Tuesday, December 1, 2015, psiinon <psiinon at gmail.com> wrote:

> I actually disagree.
> I'm fine with vendors leading most types of projects - we should be
> encouraging more vendor involvement / sponsorship.
> But I now dont think its a good idea for any vendor to lead a project
> which is designed to evaluate competing commercial and open source projects.
> Cheers,
> Simon
> On Mon, Nov 30, 2015 at 11:17 AM, Eoin Keary <eoin.keary at owasp.org
> <javascript:_e(%7B%7D,'cvml','eoin.keary at owasp.org');>> wrote:
>> I don't believe vendors should lead any project.
>> Contribute? yes, Lead? No.
>> This goes for all projects and shall help with independence and
>> objectivity.
>> Eoin Keary
>> OWASP Volunteer
>> @eoinkeary
>> On 28 Nov 2015, at 11:39 p.m., Kevin W. Wall <kevin.w.wall at gmail.com
>> <javascript:_e(%7B%7D,'cvml','kevin.w.wall at gmail.com');>> wrote:
>> Until very recently, I've been following at a distance this dispute
>> between
>> various OWASP members and Contrast Security over the latter's advertising
>> references to the OWASP Benchmark Project
>> While I too believe that mistakes were made, I believe that we all need to
>> take a step back and not throw out the baby with the bath water.
>> While unlike Johanna, I have not executed the OWASP Benchmark Project for
>> any given SAST or DAST tool, having used many such commercial tools, I
>> feel
>> qualified to render a reasoned opinion of the OWASP Benchmark Project, and
>> perhaps some steps that we can take towards amicable resolution.
>> Let me start with the OWASP Benchmark Project. I find the idea of having
>> an
>> extensive baseline of tests against we can gauge the effectiveness of SAST
>> and DAST software quite sound. In a way, these tests are analogous to unit
>> tests that we, as developers, use to find bugs in our code and help us
>> improve it, where here the discovered false positives and false negatives
>> being revealed are being used as the PASS / FAIL criteria for the tests.
>> Just
>> as in unit testing, where the ideal is to have extensive tests to broaden
>> one's
>> "test coverage" of the software under test, the Benchmark Project strives
>> to have a broad set of tests to assist in revealing deficiencies (with
>> the goal of removing these "defects") in various SAST and DAST tools.
>> This is all well and good, and I whole-heartedly applaud this effort.
>> However, I see several ways that this Benchmark Project fails. For one,
>> we have no way to measure the "test coverage" of the vulnerabilities that
>> the Benchmark Project claims to measure. There are (by figures that I've
>> seen claimed) something like 21,000 different test cases. How do we, as
>> AppSec
>> people, know if these 21k 'tests' provide "even" test coverage? For
>> instance, it is not unreasonable to think that they may be heavy coverage
>> on tests that are easy to create (e.g., SQLi, buffer overflows, XSS) and
>> a much lesser emphasis on "test cases" for things like cryptographic
>> weaknesses. (This would not be surprising in the least, since the coverage
>> of every SAST and DAST tool that I've ever used seems to excel in some
>> areas and absolutely suck in others.)
>> Another way that the Benchmark Project is lacking is one that is admitted
>> on the Benchmark Project wiki page under the "Benchmark Validity" section:
>>    The Benchmark tests are not exactly like real applications. The
>>    tests are derived from coding patterns observed in real
>>    applications, but the majority of them are considerably *simpler*
>>    than real applications. That is, most real world applications will
>>    be considerably harder to successfully analyse than the OWASP
>>    Benchmark Test Suite. Although the tests are based on real code,
>>    it is possible that some tests may have coding patterns that don't
>>    occur frequently in real code.
>> A lot of tools are great at detecting data and control flows that are
>> simple,
>> but fail completely when facing "real code" that uses complex MVC
>> frameworks
>> like Spring Framework or Apache Struts. The bottom line is that we need
>> realistic tests. While we can be fairly certain that if a SAST or DAST
>> tool
>> misses the low bar of one of the existing Benchmark Project test cases, if
>> they are able to _pass_ those tests, it still says *absolutely nothing*
>> about
>> their ability to detect vulnerabilities in real world code where the code
>> is often orders of magnitude more complex. (And I would argue that this is
>> one reason we see the false positive rate so high for SAST and DAST tools;
>> rather than err on the side of false negatives, they flag "issues" that
>> they are generally unreliable and then rely on appsec analysts to discern
>> which
>> are real and which are red herrings. This is still easier than if they
>> appsec engineers had to hunt down these potential issues manually and then
>> analyze them, so it is not entirely inappropriate. As long as the tool
>> provides some sort of "confidence" indicator for the various issues that
>> it
>> finds, an analyst can easily decide whether they are worth spending
>> effort on
>> further investigation.)
>> This brings me to what I see as the third major area of where the
>> Benchmark
>> Project is lacking. In striving to be simple, it attempts to distill all
>> the
>> findings into a single metric. The nicest thing I can think of saying
>> about
>> this is that it is woefully naive and misguided. I think where it is
>> misguided
>> is that it assumes that every IT organization in every company weights
>> everything equally. For instance, false positives and false negatives are
>> both
>> _equally_ bad. However, in reality, most organizations that I've been
>> involved
>> in AppSec would highly prefer false positives over false negatives.
>> Likewise,
>> all categories (e.g., buffer overflows, heap corruption, SQLi, XSS, CSRF,
>> etc.) are all weighted equally. Every appsec engineer knows that this is
>> generally unrealistic; indeed it is _one_ reason that we have different
>> risk
>> ratings for different findings. Also, if a company writes all of their
>> applications in "safe" programming languages like C# or Java, then
>> categories
>> like buffer overflows or heap corruption completely disappear. What that
>> means
>> is that those companies don't care at all whether or not a given SAST or
>> tool can find those categories of vulnerabilities or not because they are
>> completely irrelevant for them. However, because there is no way to
>> customize
>> the weighting of Benchmark Project findings when run for a given tool,
>> everything is attempted to be shoe-horned into a single magical figure.
>> The
>> result is that that magical Benchmark Project figure becomes almost
>> meaningless. At best, it's meaning is very subjective and not at all as
>> objective as Contrast's advertising is attempting to lead people to
>> believe.
>> I believe that the general reaction to all of this has been negative, at
>> least based on the comments that I've read not only in the OWASP mailing
>> lists, but also on Twitter. In the end, this will be damaging to either
>> OWASP's overall reputation or at the very least, the reputation of the
>> OWASP Benchmark Project, both of which I think most of us agreed is
>> bad for the appsec community in general.
>> Therefore, I have a simple proposal towards resolution. I would appeal to
>> the OWASP project leaders to appeal to the OWASP Board to simply mark the
>> OWASP Benchmark Project Wiki page (and ideally, its GitHub site) as noting
>> that the findings are being disputed. For the wiki page, we could do this
>> in a manner that Wikipedia marks disputes, using a Template:Disputed tag
>> (see https://en.wikipedia.org/wiki/Template:Disputed_tag) or their
>> "Accurracy Disputes" (for example, see
>> https://en.wikipedia.org/wiki/Wikipedia:Accuracy_dispute
>> and https://en.wikipedia.org/wiki/Category:Accuracy_disputes)
>> At a mininum, we should have this tag result in rendering something like:
>>    "The use and accuracy of this page is currently being disputed.
>>    OWASP does not support any vendor endorsing any of their
>>    software according to the scores resulting in execution of
>>    the OWASP Benchmark."
>> that the OWASP Board should apply (so that no one is permitted to
>> remove it without proper authorization).
>> I will leave the exact wording up to the board. But just like disputed
>> pages on Wikipedia, OWASP must take action on this or I think they are
>> likely to have credibility issues in the future.
>> Thank you for listening,
>> -kevin wall
>> --
>> Blog: http://off-the-wall-security.blogspot.com/
>> NSA: All your crypto bit are belong to us.
>> _______________________________________________
>> Owasp-board mailing list
>> Owasp-board at lists.owasp.org
>> <javascript:_e(%7B%7D,'cvml','Owasp-board at lists.owasp.org');>
>> https://lists.owasp.org/mailman/listinfo/owasp-board
>> _______________________________________________
>> Owasp-board mailing list
>> Owasp-board at lists.owasp.org
>> <javascript:_e(%7B%7D,'cvml','Owasp-board at lists.owasp.org');>
>> https://lists.owasp.org/mailman/listinfo/owasp-board
> --
> OWASP ZAP <https://www.owasp.org/index.php/ZAP> Project leader


Michael Coates | @_mwc <https://twitter.com/intent/user?screen_name=_mwc>
OWASP Global Board
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-leaders/attachments/20151201/9b41f6b2/attachment.html>

More information about the OWASP-Leaders mailing list