[Owasp-leaders] [Owasp-board] Working toward a resolution on the Constrast Security / OWASP Benchmark fiasco

Eoin Keary eoin.keary at owasp.org
Tue Dec 1 16:15:44 UTC 2015


Point taken. And a better solution Simon. But at least a little bit of governance here would be nice.

Eoin Keary
OWASP Volunteer
@eoinkeary



> On 1 Dec 2015, at 9:19 a.m., psiinon <psiinon at gmail.com> wrote:
> 
> I actually disagree.
> I'm fine with vendors leading most types of projects - we should be encouraging more vendor involvement / sponsorship.
> But I now dont think its a good idea for any vendor to lead a project which is designed to evaluate competing commercial and open source projects.
> 
> Cheers,
> 
> Simon
> 
>> On Mon, Nov 30, 2015 at 11:17 AM, Eoin Keary <eoin.keary at owasp.org> wrote:
>> I don't believe vendors should lead any project. 
>> 
>> Contribute? yes, Lead? No.
>> 
>> This goes for all projects and shall help with independence and objectivity. 
>> 
>> 
>> Eoin Keary
>> OWASP Volunteer
>> @eoinkeary
>> 
>> 
>> 
>>> On 28 Nov 2015, at 11:39 p.m., Kevin W. Wall <kevin.w.wall at gmail.com> wrote:
>>> 
>>> Until very recently, I've been following at a distance this dispute between
>>> various OWASP members and Contrast Security over the latter's advertising
>>> references to the OWASP Benchmark Project
>>> 
>>> While I too believe that mistakes were made, I believe that we all need to
>>> take a step back and not throw out the baby with the bath water.
>>> 
>>> While unlike Johanna, I have not executed the OWASP Benchmark Project for
>>> any given SAST or DAST tool, having used many such commercial tools, I feel
>>> qualified to render a reasoned opinion of the OWASP Benchmark Project, and
>>> perhaps some steps that we can take towards amicable resolution.
>>> 
>>> Let me start with the OWASP Benchmark Project. I find the idea of having an
>>> extensive baseline of tests against we can gauge the effectiveness of SAST
>>> and DAST software quite sound. In a way, these tests are analogous to unit
>>> tests that we, as developers, use to find bugs in our code and help us
>>> improve it, where here the discovered false positives and false negatives
>>> being revealed are being used as the PASS / FAIL criteria for the tests. Just
>>> as in unit testing, where the ideal is to have extensive tests to broaden one's
>>> "test coverage" of the software under test, the Benchmark Project strives
>>> to have a broad set of tests to assist in revealing deficiencies (with
>>> the goal of removing these "defects") in various SAST and DAST tools.
>>> 
>>> This is all well and good, and I whole-heartedly applaud this effort.
>>> 
>>> However, I see several ways that this Benchmark Project fails. For one,
>>> we have no way to measure the "test coverage" of the vulnerabilities that
>>> the Benchmark Project claims to measure. There are (by figures that I've
>>> seen claimed) something like 21,000 different test cases. How do we, as AppSec
>>> people, know if these 21k 'tests' provide "even" test coverage? For
>>> instance, it is not unreasonable to think that they may be heavy coverage
>>> on tests that are easy to create (e.g., SQLi, buffer overflows, XSS) and
>>> a much lesser emphasis on "test cases" for things like cryptographic
>>> weaknesses. (This would not be surprising in the least, since the coverage
>>> of every SAST and DAST tool that I've ever used seems to excel in some
>>> areas and absolutely suck in others.)
>>> 
>>> Another way that the Benchmark Project is lacking is one that is admitted
>>> on the Benchmark Project wiki page under the "Benchmark Validity" section:
>>>    The Benchmark tests are not exactly like real applications. The
>>>    tests are derived from coding patterns observed in real
>>>    applications, but the majority of them are considerably *simpler*
>>>    than real applications. That is, most real world applications will
>>>    be considerably harder to successfully analyse than the OWASP
>>>    Benchmark Test Suite. Although the tests are based on real code,
>>>    it is possible that some tests may have coding patterns that don't
>>>    occur frequently in real code.
>>> 
>>> A lot of tools are great at detecting data and control flows that are simple,
>>> but fail completely when facing "real code" that uses complex MVC frameworks
>>> like Spring Framework or Apache Struts. The bottom line is that we need
>>> realistic tests. While we can be fairly certain that if a SAST or DAST tool
>>> misses the low bar of one of the existing Benchmark Project test cases, if
>>> they are able to _pass_ those tests, it still says *absolutely nothing* about
>>> their ability to detect vulnerabilities in real world code where the code
>>> is often orders of magnitude more complex. (And I would argue that this is
>>> one reason we see the false positive rate so high for SAST and DAST tools;
>>> rather than err on the side of false negatives, they flag "issues" that
>>> they are generally unreliable and then rely on appsec analysts to discern which
>>> are real and which are red herrings. This is still easier than if they
>>> appsec engineers had to hunt down these potential issues manually and then
>>> analyze them, so it is not entirely inappropriate. As long as the tool
>>> provides some sort of "confidence" indicator for the various issues that it
>>> finds, an analyst can easily decide whether they are worth spending effort on
>>> further investigation.)
>>> 
>>> This brings me to what I see as the third major area of where the Benchmark
>>> Project is lacking. In striving to be simple, it attempts to distill all the
>>> findings into a single metric. The nicest thing I can think of saying about
>>> this is that it is woefully naive and misguided. I think where it is misguided
>>> is that it assumes that every IT organization in every company weights
>>> everything equally. For instance, false positives and false negatives are both
>>> _equally_ bad. However, in reality, most organizations that I've been involved
>>> in AppSec would highly prefer false positives over false negatives. Likewise,
>>> all categories (e.g., buffer overflows, heap corruption, SQLi, XSS, CSRF,
>>> etc.) are all weighted equally. Every appsec engineer knows that this is
>>> generally unrealistic; indeed it is _one_ reason that we have different risk
>>> ratings for different findings. Also, if a company writes all of their
>>> applications in "safe" programming languages like C# or Java, then categories
>>> like buffer overflows or heap corruption completely disappear. What that means
>>> is that those companies don't care at all whether or not a given SAST or DAST
>>> tool can find those categories of vulnerabilities or not because they are
>>> completely irrelevant for them. However, because there is no way to customize
>>> the weighting of Benchmark Project findings when run for a given tool,
>>> everything is attempted to be shoe-horned into a single magical figure. The
>>> result is that that magical Benchmark Project figure becomes almost
>>> meaningless. At best, it's meaning is very subjective and not at all as
>>> objective as Contrast's advertising is attempting to lead people to believe.
>>> 
>>> I believe that the general reaction to all of this has been negative, at
>>> least based on the comments that I've read not only in the OWASP mailing
>>> lists, but also on Twitter. In the end, this will be damaging to either
>>> OWASP's overall reputation or at the very least, the reputation of the
>>> OWASP Benchmark Project, both of which I think most of us agreed is
>>> bad for the appsec community in general.
>>> 
>>> Therefore, I have a simple proposal towards resolution. I would appeal to
>>> the OWASP project leaders to appeal to the OWASP Board to simply mark the
>>> OWASP Benchmark Project Wiki page (and ideally, its GitHub site) as noting
>>> that the findings are being disputed. For the wiki page, we could do this
>>> in a manner that Wikipedia marks disputes, using a Template:Disputed tag
>>> (see https://en.wikipedia.org/wiki/Template:Disputed_tag) or their
>>> "Accurracy Disputes" (for example, see
>>> https://en.wikipedia.org/wiki/Wikipedia:Accuracy_dispute
>>> and https://en.wikipedia.org/wiki/Category:Accuracy_disputes)
>>> 
>>> At a mininum, we should have this tag result in rendering something like:
>>>    "The use and accuracy of this page is currently being disputed.
>>>    OWASP does not support any vendor endorsing any of their
>>>    software according to the scores resulting in execution of
>>>    the OWASP Benchmark."
>>> that the OWASP Board should apply (so that no one is permitted to
>>> remove it without proper authorization).
>>> 
>>> I will leave the exact wording up to the board. But just like disputed
>>> pages on Wikipedia, OWASP must take action on this or I think they are
>>> likely to have credibility issues in the future.
>>> 
>>> Thank you for listening,
>>> -kevin wall
>>> -- 
>>> Blog: http://off-the-wall-security.blogspot.com/
>>> NSA: All your crypto bit are belong to us.
>>> _______________________________________________
>>> Owasp-board mailing list
>>> Owasp-board at lists.owasp.org
>>> https://lists.owasp.org/mailman/listinfo/owasp-board
>> 
>> _______________________________________________
>> Owasp-board mailing list
>> Owasp-board at lists.owasp.org
>> https://lists.owasp.org/mailman/listinfo/owasp-board
> 
> 
> 
> -- 
> OWASP ZAP Project leader
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.owasp.org/pipermail/owasp-leaders/attachments/20151201/6be63151/attachment-0001.html>


More information about the OWASP-Leaders mailing list