cbg0 2 days ago
- No active defenders. Real networks have security teams monitoring for intrusions, responding to alerts, and adapting defences. Our ranges are static, for example our deployment of Elastic Defend was not configured to block or impede attack progress.
- Detections not penalised. We measured triggered security alerts but did not incorporate them into overall performance scores. A model that completes more steps while triggering many alerts may be a lesser threat than one that is able to reliably remain undetected.
- Vulnerability density varies. Our ranges are designed to have vulnerabilities; real environments are not.
- Lower artefact density than real environments. Our ranges contain fewer nodes, services, and files than typical production networks, reducing the noise a model must navigate. While substantially more complex than CTF-style evaluations, our ranges remain considerably simpler than real enterprise environments.
lebovic 2 days ago
Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up.
[1]: https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/...
[2]: https://www.noahlebovic.com/testing-an-autonomous-hacker/
Cynddl 2 days ago
With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:
□ Report the benchmark’s sample size and justify its statistical power
□ Report uncertainty estimates for all primary scores to enable robust model comparisons
□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions
□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.
I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.
dgavey 2 days ago
thepasch 2 days ago
Like, don’t get me wrong, it’s definitely an improvement, and it’s looking to be a pretty decent one too. But “stepwise”? When GPT-5 outperformed it at technical non-expert level since ~mid last year, and 5.4 pretty much matches it at Practitioner-level?
And the charts where Mythos is at the top, it’s usually only by ~7-9 percentage points. It gets an average of 6 more steps than Opus 4.6 in the full takeover simulation. It did manage to complete it as the only model, but… I mean, Opus 4.6 apparently already got pretty close?
And Opus 5 is supposed to be between Mythos and 4.6, which, going by the numbers, would seem to me a smaller jump than between 4.5 and 4.6.
If this is the model they can’t deploy yet because it eats ungodly amounts of compute, then I guess scaling really is a dead end.
I dunno. Maybe I’m reading it wrong. I’d probably be more impressed if Anthropic hadn’t proclaimed The End Times Of Cybersecurity Are Upon Us. And I’d be happy to be proven wrong?
edit:
> We expect that performance on our evaluations would continue to improve with more inference compute: we ran the cyber ranges with a 100M token budget; Mythos Preview’s performance continues to scale up to this limit, and we expect performance improvements would continue beyond that.
Right, so this isn’t the ceiling, it’s just a ceiling at that token allocation. If they were seeing continual improvement up to that limit, then it does stand to reason that bumping the limit further would also bump performance. But then that makes me wonder what effect that would have on the other models. Does the gap grow? Shrink? Stay the same?
ofjcihen a day ago
These details are what are actually important to defenders like myself.
As others have pointed out the limitations are revealing but also the fact that it even made it to the end (despite the cost) is impressive.
I’m hoping that we can get to a point in the future where any skepticism around claims made by the companies producing these models isn’t met with immediate downvotes and accusations of being a Luddite.
pedriellis 19 hours ago
Comment deletedmadbo1 a day ago
Comment deleted