What is the OWASP Benchmark?
I’m sure that most of you are familiar with OWASP (Open Web Application Security Project), or at least you have heard about their famous Top 10 list of vulnerabilities affecting web applications today. If you are not familiar with it or haven’t heard of it, and you develop software, you’d better start with the links above now, the security of your application is at stake here.
To give you a little context, OWASP operates as a non-profit and is not affiliated with any technology company, which means it is in a unique position to provide impartial, practical information about AppSec to individuals, corporations, universities, government agencies and anyone interested in security from the application perspective. It is a very reliable source of information, Kiuwan as an active OWASP contributor member, is very proud to align our products with what the organization has to say. In fact we do implement a lot of functionality that covers specific aspects defined by OWASP like the aforementioned Top 10 vulnerability list.
But today I want to introduce another OWASP Project that is not that well known, but can be a very useful tool if you are in the position to decide what tools to use to ensure you develop secure applications: The OWASP Benchmark.
From their home page: “The OWASP Benchmark for Security Automation is a free and open test suite designed to evaluate the speed, coverage, and accuracy of automated software vulnerability detection tools and services . Without the ability to measure these tools, it is difficult to understand their strengths and weaknesses, and compare them to each other. Each version of the OWASP Benchmark contains thousands of test cases that are fully runnable and exploitable, each of which maps to the appropriate CWE number for that vulnerability.”
As a Static Application Security Testing (SAST) provider we just couldn’t resist to see how our Kiuwan Code Security product performs with the OWASP Benchmark. It makes sense and will give you objective information about Kiuwan compared with other tools out there.
So we set out for it.
With Kiuwan ease of use analyzing all test case was a walk in the park the only “extra” thing we had to do was transforming Kiuwan’s output to the benchmark expected results format (basically a CSV file) so we can use the automatic procedure provided by OWASP to generate the scorecard. If you have a Kiuwan account and you are interested in running a Kiuwan analysis on the benchmark yourself and generate the scorecard, stay tuned. We will be publishing a post with everything you need to do to reproduce the results I’m about to talk about.
The Kiuwan results
First things first. After running the analysis and generating the scorecard this is how Kiuwan’s results look like:
These are the details corresponding of the points in the graph:
Not bad! Kiuwan positions with almost 100% True Positives Rate (TPR) and just above 40% False Positive Rate (FPR). Not bad at all. This means that Kiuwan does report almost all vulnerabilities in the benchmark code, the True Positives (TP), leaving only 22 false negatives (FN) behind. On the other hand, Kiuwan reports another 511 vulnerabilities that are False Positives (FP).
In plain English, Kiuwan is a very sensitive tool finding almost all real vulnerabilities, but it is a little less specific reporting more vulnerabilities that are not real. In other words, Kiuwan adds some “noise” to the results.
How does Kiuwan compare to other tools?
Now, if you have to select a SAST tool what would you prefer a more sensitive or a more specific tool? Well obviously the answer should be the most sensitive AND the most specific, right? Lets see a comparative graph with the results of 17 tools (16 SAST and 1 DAST). These are the tools compared in the OWASP benchmark site where we have added Kiuwan and the latest version of SonarQube. Notice that the other commercial tools are anonymized using SAST-0X for the names. For your reference, you can see the original graph in the OWASP Benchmark wiki.
Except 3 of them (the ones not designed for security testing), all of them are above the worse than guessing line, that’s good news. The further above the line the better, that distance is related to the average score for the tool or Youden’s index that combines sensitivity and specificity. You can see that there is no tool that is very sensitive and very specific at the same time. So if you are in the position to decide what tool to use to analyze your code there is always a trade off, sensitivity vs. specificity. The Youden’s index can give you a hint, the higher the better, but be careful, you can have the same Youden’s index for several sensitivity, specificity pairs.
If you ask me, I want sensitivity over specificity. I want my tool to report as many real vulnerabilities as possible, and if that means it generates some “noise” that’s okay with me. As long it is a reasonable amount, the noise can be managed. Whereas not detected real vulnerabilities just lay there in your code and you have no information about them, none. If you have any suspicions you need to manually review your code to find them, that is costly and time consuming.
Taking all this into account Kiuwan should be your tool of choice. But how fair is the benchmark?
Giving the Benchmark a little twist
By design, the OWASP Benchmark considers every vulnerability equally, is that fair? Well OWASP itself has their famous OWASP Top 10 project in an effort to prioritize security vulnerabilities. What happens if we put together the Top 10 with the Benchmark?
The Benchmark has all the test cases classified by vulnerability type and these mapped to correspondent CWEs. Here is the classification and the mapping.
|Cross-site Scripting – XSS||79|
|OS Command Injection||78|
|Reversible One-Way Hash||328|
|Trust Boundary Violation||501|
|Insufficiently Random Values||330|
|Use of a Broken or Risky Cryptographic Algorithm||327|
|Sensitive Cookie in HTTPS session without “Secure” attribute||614|
There are 11 vulnerability types mapping to several CWE, how do these map to the Top 10? That is not an easy question, since some of the vulnerability types and the correspondent CWEs map perfectly with a listed top 10, but others are more challenging like Trust Boundary Violation.
Anyway, we have done the exercise and mapped the 11 types in the Benchmark. Next we have assigned a weight from 1 to 10 depending on the top 10, 10 for A1, 9 for A2, etc. I know the next table can be controversial but it is completely open to discussion.
|Vulnerability type||CWEs||Top 10||Weight|
|Cross-site Scripting – XSS||79||A3- Cross-site Scripting (XSS)||8|
|OS Command Injection||78||A1-Injection||10|
|Path Traversal||22||A4- Insecure Direct Object Reference||7|
|Reversible One-Way Hash||328||A6- Sensitive Data Exposure||5|
|Trust Boundary Violation||501||A2-Broken Authentication and Session Management||9|
|Insufficiently Random Values||330||A9-Using Components with Known Vulnerabilities||2|
|Use of a Broken or Risky Cryptographic Algorithm||327||A6- Sensitive Data Exposure||5|
|Sensitive Cookie in HTTPS session without “Secure” attribute||614||A5-Security Misconfiguration||6|
You probably know where we are going, right? We basically think that a tool handling more critical vulnerabilities better than the less critical ones should rank better. Treating all vulnerabilities equal in the calculation does not give you this information. So instead of calculating the tool score using True Positive and False Positive rates averages on the results for the 11 vulnerability types, we are calculating weighted averages applying the weights in the table above.
We have done this only for the open source tools which detailed results are in the OWASP Benchmark GitHub repository, the latest version of SonarQube (we run this one ourselves) and Kiuwan of course. We don’t have access to other commercial tools and the detailed results haven’t been published.
After crunching the numbers we have created the comparison scorecard for the 8 tools of what we call the weighted benchmark.
You can see that all results get displaced in different directions. By taking weights into account we are not only measuring the sensitivity and specificity in general, but the ability of the tools to handle the more critical vulnerabilities according to the OWASP Top 10 ranking.
For example if a tool can handle, let’s say half of the True Positives in the Benchmark correctly, with the average calculation it will be somewhere half way up the Y axis in the graph. However if those vulnerabilities are less critical than the other 50% the tool is missing, it will be placed below the 50%. If it is the other way around it will be above the 50% line.
I think is fair. Maybe controversial since, as I said, you may not agree with the mapping we’ve done, but fair if you somehow agree.
Regarding Kiuwan result with this approach, you can see that it is slightly displaced to the right in the graph. This means that the reported false positives correspond to more critical vulnerabilities according to the Top 10. Which still I think it is not a bad result. Since prevention is better than cure I prefer to get alerted of possible highly critical vulnerabilities than not having no information at all.
Another shortcoming of the Benchmark is that it is only available for Java. Which is fine, but there are many tools out there like Kiuwan that support other languages. We are doing the effort to use known benchmarks for other languages (like the ones available at NIST) and apply the scorecard mechanism of the OWASP Benchmark described here (and the weighted one too!) to come up with the same graph results for other languages too.