
Disclosure: This is a vendor self-comparison. It puts BackBox AI's own findings side by side with the Aikido Security and XBOW submissions that Doyensec independently reviewed on the Photoview application. BackBox AI was not part of Doyensec's review: of its 22 findings, 16 correlate to Doyensec-confirmed findings and the remaining 6 are self-reported. Where a severity is labeled "Doyensec-adjusted", it reflects Doyensec's published severity for the reviewed tools and, for BackBox AI, the severity inferred from the correlated counterpart. Raw finding counts are also influenced by how each tool decomposes its findings.
Introduction
When an autonomous security tool reports more vulnerabilities than a competitor, it is tempting to read that as a win. It rarely is. A security team does not measure value by the length of the report; it measures value by how many genuinely serious issues surface, and how little noise it has to triage to get there.
This article puts that idea to the test on Photoview, an open-source photo gallery web application. Three offensive tools assessed the same target: BackBox AI (our automated offensive security agent), Aikido Security, and XBOW. The Aikido and XBOW submissions were independently reviewed by Doyensec (published benchmark), who validated every finding and, where needed, re-rated its severity. That independent review gives us a rare, common yardstick: instead of comparing each vendor's self-assigned severities, we can compare against a neutral third party applying one consistent standard.
The result is a benchmark less about volume and more about signal. As the data below shows, BackBox AI matched the serious-vulnerability coverage of a tool that submitted nearly 50% more findings, while keeping its low-value noise materially lower, and it surfaced several business-logic flaws that neither competitor reported. It also missed one genuinely hard vulnerability, which we discuss openly, because an honest benchmark is the only kind worth publishing.
Results at a Glance
All three tools were measured against Doyensec's adjusted severities, applied uniformly.
| Severity (Doyensec-adjusted) | BackBox AI | Aikido | XBOW |
|---|---|---|---|
| Critical | 1 | 1 | 1 |
| High | 5 | 5 | 1 |
| Medium | 7 | 9 | 2 |
| Low | 6 | 9 | 1 |
| Info | 3 | 8 | 2 |
| Total findings | 22 | 32 | 7 |
| High + Critical | 6 | 6 | 2 |
| Low + Info | 9 (41%) | 17 (53%) | 3 (43%) |
Two numbers carry the story. Under a single neutral standard, BackBox AI and Aikido each land 6 findings at High or Critical, yet BackBox AI reached that bar with 31% fewer submissions (22 versus 32). And while 53% of Aikido's validated findings fell to Low or Info, only 41% of BackBox AI's did. More report, in this case, did not mean more risk identified; it meant more to read.
Signal Over Noise: Severity Quality, Not Finding Count
The clearest way to see the difference is to look at what Doyensec changed during review.
Aikido submitted its 32 findings without a single one labeled Info. Doyensec then lowered the severity of 11 of them and raised none. Of those 11 downgrades, 8 were cut all the way to Info. In other words, a quarter of Aikido's report was presented at Medium or Low severity for issues a neutral reviewer considered informational: share-token validity oracles, differential-response enumeration, field-suggestion recon, and similar low-impact items. To be clear, each of those was a confirmed true positive, so the question they raise is one of weighting, not validity.
| Tool | Findings | Doyensec lowered | Doyensec raised | Self-labeled Info | Info after review |
|---|---|---|---|---|---|
| Aikido | 32 | 11 | 0 | 0 | 8 |
| XBOW | 7 | 2 | 1 | 1 | 2 |
| BackBox AI | 22 | n/a (not reviewed) | n/a | 1 | 3 (incl. self-corrections) |
This is the heart of the "signal over noise" picture. A report that presents informational findings at inflated severities can look more impressive at first glance, but it shifts the cost of separating signal from noise onto the reader. Aikido's exact-severity agreement with Doyensec was 66% (21 of 32); XBOW's was 57% (4 of 7).
BackBox AI's calibration tells the opposite story on the dimension that matters for trust: restraint. On its 16 findings that correlate to a Doyensec-reviewed counterpart, BackBox AI over-rated only 3, and it self-corrected two findings during its own validation: f08 was reclassified from SQL injection to error-based information disclosure once testing showed it was not exploitable as injection, and f04 (path traversal) had its impact bounded to Low after confirming the media-type filter blocks non-media file access. A tool that talks itself down when the evidence demands it is a tool whose Highs you can believe.
What BackBox AI Found That the Others Missed
Volume is not the only place BackBox AI diverged from the pack. Six findings were unique to BackBox AI, and several of them are exactly the class of issue that pattern-driven enumeration tends to miss: business logic and code-level flaws.
| Finding | Severity (self-assessed) | Why it matters |
|---|---|---|
| Admin self-demotion causes permanent DoS (f13) | Medium | A business-logic flaw: an administrator can remove their own last admin privilege, locking the instance out of administration permanently. No other tool reported it. |
| WebSocket CSRF origin bypass (f05) | Medium | A logic flaw in the CheckOrigin handling when SERVE_UI is enabled, reachable only by reasoning about the code path rather than fuzzing endpoints. |
| Unauthenticated Mapbox token disclosure (f15) | Medium | A code-level secret-exposure issue surfaced by source review rather than live probing. |
Path traversal via userAddRootPath (f04) |
Low | Reported with its impact carefully bounded after validation, not overstated. |
| No password-policy enforcement (f19) | Low | Includes an analysis of the stealth-persistence angle. |
| GraphQL multipart transport enabled without uploads (f22) | Info | Attack-surface hygiene. |
These six were outside the set Doyensec reviewed, so we present them as BackBox AI findings rather than independently confirmed ones. But the pattern is meaningful: the two business-logic issues (admin self-demotion, WebSocket CSRF) are the kind of vulnerability that requires understanding how the application behaves, not just what endpoints it exposes.
Where the Three Tools Agree
Credibility comes from overlap as much as from difference. Five findings were reported by all three tools, the strongest possible signal that they are real and material:
| Vulnerability | Doyensec severity |
|---|---|
Unauthenticated SQL injection in /api/download/album |
Critical |
| IDOR in media access (no ownership check) | High |
SQL error disclosure via orderBy parameter |
Low |
| Missing security headers | Low / Info |
| Filesystem path disclosure | Info |
Beyond that shared core, BackBox AI's 22 findings break down into 16 that overlap the Aikido or XBOW submissions and 6 unique to BackBox AI. The headline Critical, an unauthenticated stacked-query SQL injection in the album download route, was found by all three and confirmed by Doyensec at Critical severity, a useful sanity check that all three tools were operating at a serious level on the same target.
What Each Tool Did Best on Photoview
On this target the three tools overlapped on the most serious issues, yet each also showed a distinct strength worth naming. These observations describe the Photoview assessment specifically: on other targets, or under different time budgets, scopes, and configurations, the balance could shift.
- BackBox AI, here, concentrated serious findings with low noise and reached into business-logic and code-level flaws (admin self-demotion, WebSocket CSRF) that endpoint enumeration tends to miss.
- Aikido, on this target, mapped the widest attack surface, with 14 valid findings that neither other tool reported, useful for an exhaustive hardening pass once its severities are re-baselined against a neutral standard.
- XBOW, in this test, traded volume for the single hardest finding in the set, the SQLi-to-SSRF chain, with severities that held up well under review.
Where BackBox AI Can Improve
A benchmark that only flatters the sponsor is not a benchmark. BackBox AI has real gaps, and the most important one is genuinely difficult.
| Missed finding | Source | Doyensec severity | Note |
|---|---|---|---|
| Out-of-band SSRF chained from the SQL injection | XBOW-3 | Medium | The standout miss. The SQL injection can be chained into an out-of-band SSRF through how ffprobe.ProbeUrl is used. Doyensec described this as non-trivial to identify, requiring deep code understanding. |
| Media share token exposes the full parent-album media list | AIKIDO-5 | Medium | A token-escalation path BackBox AI did not surface. |
| Plaintext share-password cookie on a broad path | AIKIDO-14 | Medium | Cookie-security hygiene. |
Directory listing enabled in /assets |
XBOW-4 | Medium | A configuration issue requiring static-directory enumeration. |
No rate limiting on shareTokenValidatePassword |
AIKIDO-12 | Medium | BackBox AI flagged rate limiting generally but missed this specific brute-force vector. |
Two patterns stand out. First, complex exploitation chaining: the SSRF chain is the clearest example of a vulnerability that XBOW's targeted, low-volume approach was well suited to find and that BackBox AI did not. Second, severity calibration on access control: where BackBox AI erred against Doyensec's standard, it tended to under-rate IDOR and session issues (it rated the FavoriteMedia IDOR Low and the post-password-change session issue Medium, both of which the neutral standard treats as High). The encouraging read is that BackBox AI found these issues; the work is in valuing them correctly. Cookie-flag hardening, TLS cipher review, and the face-recognition attack surface are smaller, systematic gaps worth closing.
Methodology and Limitations
Vendor self-comparison. This analysis was produced by BackBox Labs and compares its own product against third-party submissions. That is an inherent conflict of interest, and it is why the analysis is anchored to Doyensec's independent severities rather than our own ratings.
BackBox AI was not independently reviewed. Doyensec reviewed the Aikido and XBOW submissions, not BackBox AI's. We correlate BackBox AI's findings to the reviewed ones (16 of 22) and infer the adjusted severity from the matched counterpart; the other 6 are self-reported. Any "true positive" or "Doyensec-adjusted" claim for BackBox AI applies only to the correlated subset.
Severity is partly subjective. Doyensec is treated as the authoritative standard here, but reasonable professionals can disagree on individual ratings. The value of the comparison is in the aggregate calibration pattern, not in any single finding.
Finding counts are not capability. XBOW's 7 findings reflect a targeted, high-precision approach, and it found the hardest vulnerability in the set. Different time budgets, scopes, and configurations across the three assessments mean raw counts should be read with care.
Environment differences. Configuration such as
SERVE_UI,MAPBOX_TOKEN, and TLS termination may not have been identical across assessments, which affects what was discoverable.
For reference, Doyensec's average manual analysis time per finding was roughly 19 minutes for the Aikido set and 17 minutes for the XBOW set.
Key Takeaways
| # | Takeaway | Confidence |
|---|---|---|
| 1 | Under Doyensec's uniform severity standard, BackBox AI matched Aikido on High + Critical findings (6 each) while submitting 31% fewer findings overall. | High |
| 2 | 53% of Aikido's validated findings were Low or Info, and Doyensec downgraded 11 of its 32 findings (raising none, cutting 8 to Info), indicating inflated self-assigned severities. | High |
| 3 | BackBox AI demonstrated calibration restraint, self-correcting two findings during its own validation rather than overstating them. | High |
| 4 | BackBox AI uniquely surfaced business-logic flaws (admin self-demotion DoS, WebSocket CSRF bypass) that neither competitor reported. | Medium (BackBox AI findings, outside Doyensec's reviewed scope) |
| 5 | BackBox AI's most significant miss was the SQLi-to-SSRF chain (XBOW-3), which Doyensec called non-trivial to identify. | High |
| 6 | Where BackBox AI mis-rated severity against the neutral standard, it tended to under-rate access-control issues, not over-rate them. | High |
Conclusion
On Photoview, the meaningful comparison is not 22 versus 32 versus 7. It is what those numbers cost the reader. BackBox AI reached the same count of serious, High-and-Critical vulnerabilities as a tool that reported nearly half again as many findings, and it did so with a cleaner severity profile and a habit of correcting itself rather than inflating. For a security team, that is the difference between a report that drives action and a report that drives triage.
The honest counterweight is the SSRF chain that BackBox AI missed and the access-control issues it under-rated. Both are addressable, and naming them is part of why this benchmark is worth trusting. The direction of the error matters too: BackBox AI's mistakes were under-statements of real issues, which are safer to ship than over-statements of minor ones.
The takeaway is not that BackBox AI found the most. It is that, measured against a neutral standard, it found what mattered, with less noise. That is the metric that earns a place in a security workflow.
All findings and severities in this report are derived from Doyensec's independent review of the Aikido and XBOW submissions and from BackBox AI's own validated output. No data or assumptions beyond those sources have been introduced.
Source:
- Doyensec, AI for Pentesting: Aikido and XBOW on Photoview (independent review, severity-adjusted findings): blog.doyensec.com/2026/05/27/aikido-xbow.html