Thank you. Hello everyone and welcome. So we're here, we're going to be talking about security and we're
going to be talking specifically about the AI cybersecurity arms race. You know, unless you've
been living under a rock,
you notice that around November of last year,
coding agents really started to change in terms of their effectiveness.
Cloud Code Codex, these things have been around for a while,
but there's a step change in the work they can do autonomously,
the work they can do reliably, and what they're able to do.
And of course, coding agents are also security agents.
If you can write code, then you can write static analyzers,
you can read code, you can reason about it, you can write exploits, you can debug. These are what the AI
people like to call dual-use capabilities. They help defenders, they help attackers as well.
So it's a very interesting situation for, I think, everyone in the software industry. But I think
in crypto and with smart contract platforms like SWE and the other ones that you all work on,
we're on the front lines of this because we write code that manages money. It's radically open. There are rich bug bounty programs. So a lot of this is getting a lot of attention and there's increased action. you know, just inform people both in the crypto industry and outside who are going to, who are not on the front lines, but are going to be affected just as much what they, what
they need to be thinking about and doing to keep themselves safe.
So I wanted to start off by a, well, if everyone go around and do a brief intro and just tell
me about a jaw dropping moment you had with a coding agent related to security, where
it just did something where you're like, wow, I did not think that I could do that.
I didn't expect that that could work.
Or, you know, this is just like your, this is just your moment of things changed since November.
Maybe we could start with you, Ben.
I'd say the jump from Sonnet 3.7 to Opus 4.5,
that transition period was when it really seemed like
we could actually code things with agents
and have them abstract about the code base and actually do tool calling. period was when it really seemed like we could actually code things with agents
and have them abstract about the code base and actually do tool calling because it's hard to remember but like a couple months ago models hated calling tools they would never call tools and they
would never use them properly and to finally see that work correctly was absolutely incredible
what was the thing you tried to do where you saw the capability jump concretely?
So what we were trying to do
is we were trying to mix Slither's static analysis
with LLM-based semantic analysis
because there's a lot of limitations with static analysis.
You can't figure out like what's an actor,
like what's the semantic meaning of this code.
And so we tried to plug it into an LLM
and have the LLM infer certain things
about the code and then pass them back into Slither using tool calls. And it just wouldn't do it.
When it would call the tool, it would be malformatted. It would have issues. It wouldn't
know what to do. Sometimes it would call it when it shouldn't call it. Sometimes it would
do the exact opposite. And I'd say it was really the jump after Sonnet 3.7
where we really started to see that get resolved.
And I should have said mine too.
I'm the co-founder and CTO of Mistin Labs
and creator of the Moot programming language.
But mine also had to do with static analysis and triaging
where I did a lot of work on static analysis at Facebook
And so I said, okay, you know, Cloud Code code, please take my quandary taint analyzer that's
written in OCaml for Java and port it to move and run it on the full corpus of all the move
programs. And it just did it. And that was like, okay, add entry points as sources and add, you
know, usage of money as a sink. And then it's like, cool, here are the results. And then I was like,
triage the results and show me what looks like a phone. And then when I saw the results, I was like, whoa,
this is a new world that we're in
in terms of like the speed and the capabilities.
Whereas like the triage results
that was completely manual before
and the coding would have taken a long, long time.
Building on that, there was something that we did
very recently around dimensional analysis.
I used to be really into F-sharp
and F-sh sharp has these like
units of measure in its language typing system. We're like, what if we took that and we tried to
apply it to Solidity or like a smart contract language and have an LLM infer what the unit
should be and then do the arithmetic to see if all the units work out. And it gave us really,
really good results. We published the plugin to our skills repo if anyone's interested.
But being able to take these traditional techniques
and blend them with the semantic analysis of LLMs,
it's absolutely incredible where it's going.
Seth, what about for you?
Please introduce yourself and tell us
about your jaw-dropping moment.
I'm the CEO of Satora and Satora is a security company
and we work across all different chains,
all different aspects of security in Web3.
And I think I can take that personally
And what I mean by that is in my role,
I'm not coding day-to-day
and I'm not auditing day-to-day.
So I've had my own personal
recent jaw-dropping moments,
but, and they're kind of interesting.
So for me, recently, and this was a while ago, I wanted to extract a list of all of our contacts from Salesforce.com.
This is like a stupid, annoying task.
And I asked Claude to do it, and I just said, Claude, write me a script that extracts all my contacts from Salesforce.com.
But what's interesting is it does so many things
wrong from a security perspective, but they're subtle. So it stored the files in a place that,
first of all, I don't have access to. And it's like, where did it even get that from?
It's some arbitrary directory on disk that includes a path that has nothing to do with
anything I could imagine. I thought, it's fascinating to know that it pulled this
information from somewhere and injected it into my code.
And I don't know how it got here.
And then just the usage of environment to store safety, critical secrets like my Salesforce
login key that really should be kept secret in a more meaningful way.
That gave me very personally both sides of the story.
More professionally, it's like every time I talk to our researchers, there's something
amazing. Just a couple of days ago, I was talking to one of our researchers
and he was saying that a client of his who he worked with before he came to Sartora,
who had a really complex protocol, and he'd found seven critical vulnerabilities in that project.
And this client came to him with a list of 200 reports from an AI tool. And painstakingly,
the client had gone through the first 160 and decided
that they weren't real, but there were 40 left that were just painfully difficult to evaluate.
So difficult that our researcher wasn't sure that he was clear on all of them. And he asked the LLM
itself to figure out which of those were valid. And it did. And it came back to him with a list
that when he went back through that sort of self-prioritized list, he found seven real issues out of that 200.
And what's most frightening about that story is that he said three of those intersected with his own findings, but four did not.
And, you know, his sort of seven findings were all steel funds kinds of findings.
were all steel funds kinds of findings.
And the four non-intersecting LLM findings
were all steel funds kinds of findings
embedded in that original list of 200.
So, you know, beyond the interesting things
we're doing with tools and, you know,
which we absolutely do with the Prover all the time,
looking at formal verification results
that are really subtle and hard to understand.
We have our new violation analyzer
that now uses LLMs to do all sorts of interesting stuff.
There's just that very disturbing knowledge
that LLMs can find things that humans can
Yeah, I love both those stories.
And I feel like pretty soon we'll feel that I found four
It's going to be a pretty good score for the human so he should take yeah uh Robert what what about you uh tell me about
your background and uh your your jaw-dropping moment yes um I'm Robert I'm the founder of
PoderSec we've been working with Steve for for a long time now I guess yeah probably like three
years I feel like time sort of flies in this industry.
I think for me, there wasn't one particular moment.
I guess it was more like the trend.
Like, I don't think there was one instance where we ran the tool and we found a bunch of bugs and we're like, oh, this is amazing.
I think for me, probably it's been, I mean, we worked on the EVM bench post with OpenAI
And I remember, I don't know if I sent you the chart actually, but I remember there was
this one chart which was like GPT-5, I think had a 20% on the benchmark and then GPT-5.3
had like a 40% on the benchmark.
And, you know, I think when we saw that, we actually put the data down.
I mean, that trend was super, right.
And I mean, even if the trend doesn't continue entirely, right? Like you can imagine GPD 5.4 or 5.5 has like 50% or 60%,
that by its head doubled over the course of a few months.
I think that was to me a sign of what's to come, right?
And how we all need to be using AI more seriously
EVM Bench is super cool work,
and we're gonna be digging into that a little bit more
But first, Klaas, let me go to you
to hear about your background, your company,
and what really impressed you
with coding and security agents.
Yeah, so we're a tiny company focusing on SUI,
so we're doing auditing and formal verification.
Actually, we're doing almost only formal verification lately for SUI, so we're doing auditing and formal verification, actually doing mostly almost only formal verification lately for SUI. We've been essentially
working on the, to make the SUI Prover work well for the past almost year and a half.
And yeah, essentially our focus very much over the past few months is making formal
verification scale, so we've been working with many of the top DeFi projects on SUI. And yeah, essentially our focus very much over the past few months is making formal education scale.
So we've been working with many of the top DeFi projects on Sui.
Now, in terms of story, well, yeah.
Well, it's actually a story from my co-founder, Andrei Steppanescu.
So he was working on modeling the borrow mute for dynamic fields for the prover.
And yeah, and basically the agent struggled with it quite a bit
because it was in the loop with Cloud Code.
And eventually the agent actually decided to go upstream in the SWE prover
for everyone is actually based on the SWE compiler, right?
And the agent decided to go upstream in the compiler
and essentially start putting print lines there and starting to understand the structure so that it
can propagate ownership information down and it then made the changes to propagate ownership
information and essentially solve the its task which is considering the complexity of all of this is very impressive.
Now on the other side, I mean, I'll probably have more examples of failures probably also,
but I think the main thing, kind of a summary of the failures is that,
I think our main struggle is with making the agents reliable,
both for auditing and for modification.
So they can be intermittently brilliant, but at this point, we cannot rely on them for
They're just making us way, way faster.
Yeah, the story of it needs to do something, but it sort of doesn't have enough information in this part of the code, but it exists elsewhere and sort of figuring out where it'd be and like pulling the plumbing through is like a super, super impressive thing. And, you know, I can't count how many times I've done that as a programmer. And like the fact that you don't have to do that anymore, that someone else can do it, especially the plumbing part. The discovery part is really the impressive part, but the plumbing is like the time consuming is pretty amazing.
is really the impressive part, but the plumbing is like the time consuming is pretty amazing.
So thanks guys for the intros and the stories. Now we're going to get into the questions. So for
the audience, what I did is I sent out a survey with eight questions, eight questions that are
supposed to be hard, where I thought there would, you know, sort of be different opinions. And on
six out of eight, we managed to split the room and get a very different opinions. And so I'm going to
go through the questions and call on people on different sides of it. These are yes or no questions
and sort of get some interesting discussion and debate going.
So one question that we'll start with is the question is, for some value of X,
releasing a model and skills that can reliably find X percent of critical vulnerabilities
in some substantial open source benchmark set is a violation of responsible disclosure.
This is an ethics question.
So Robert, you talked about EVM Bench. You said, yes, this would be a violation of responsible disclosure. This is an ethics question. So Robert, you talked
about EVM Bench. You said yes, this would be a violation of responsible disclosure, but you
clearly don't think x equals 40 because you released the code and the skills. So what's the
x for which this is true and how do you think about this question? Yeah, I mean, I think there's
a bit of a tension here, right, in the sense that if you don't release anything ever, people won't know that security tools are increasing at such a rapid pace.
And, you know, you might actually do worse for the security ecosystem.
But I think the obvious example that comes to mind is like if X is 100 and you release it and immediately everyone gets hacked, then clearly that's also not valid.
Right. And that can't possibly be correct either.
And I think this is one of the questions that we were debating, right,
when we worked on this project too, is like, hey,
if we have this tool that can actually reliably find bugs
in a large amount of these smart contracts,
how do we best get it to developers so they can run it first
And I think one thing that we did for Ethereum Venture example is that we worked on a front-end
where we sponsored or I guess paradigm sponsored the credits and this allowed people to run their
contracts essentially for free with the framework that we used. And honestly yeah I agree with you
like I think 40% is probably not the right.
I mean, 40% on our specific eval set, which means that, like, in the real world, maybe it's slightly less.
But if it was, like, 90% or, like, 85%, I feel like I would be much more inclined to say, like, hey, you should first run this on all the major projects.
See if there's any findings.
And for us, we actually did some of that.
Like, we did run this tooling against big projects
and made sure that there weren't a bunch of critical bugs laying around.
Yeah, that's the sort of thing I was wondering,
where, like, you guys in the EVM bench,
you very carefully selected the benchmarks,
so it's all vulnerabilities that have been fixed or code
where it's sort of, like, not at risk anymore. But, you know, at some point, the it's all vulnerabilities that have been fixed or code where it's sort of like not at risk anymore.
But, you know, at some point, the X is high enough
that you're like, well, someone can also take this
and run it on any open source code.
So like, what's my ethical obligation to run that,
to report it, to sort of give people a leg up?
I think those are the sort of really interesting questions
So Ben, I wanted to go to zero no on this question.
You can sort of see it in the way Trail of Bits operates.
Like you guys are really proactive in open sourcing these skills,
open sourcing tools, like sort of making auditing.
An AI powered security auditor, security researcher available.
How do you think about this question?
I think this is a really good question.
I think the way you get to think about it from a framing perspective
is how hard is it to create a tool that is going to be able to exploit something so like a clod skill has a very low cost to create you could
literally open up clod and be like hey i want to create a skill that does blah blah blah blah
and it'll produce that um for something like that it's really easy for an attacker to also get
access to that capability they just have to stumble upon it.
So let's say in some way we built a skill that finds 80% of critical vulnerabilities.
In a case like that, if we actually want to triage it and be able to like responsibly disclose this across the industry, like it'd be a process that probably take a couple of months, we have to run
it against absolutely everything we could. And just going through the disclosure, I think would be like
three or four months because just not everyone has a bug bounty. And the people
who don't have a bug bounty are really hard to contact. And there's TVL at risk on
those protocols. On the flip side, if we
were to just publish it and not wait those
three months, then you have to think about, okay, who's going to be publish this, you know, and not wait those three months, then you have to think about,
okay, who's going to be running this? Well, you have white hats and you have black hats.
And if anyone in this room was to publish something that claimed to find 80% of the
critical issues, I'm pretty sure we would all be running it on as much stuff as we could
immediately. And so now you get to ask yourself, how many white hats are there in the industry?
And how many black hats are there? And what are their relative token budgets? And I think the
white hats would be operating on orders of magnitude, larger token budgets than the black
hats. So by publishing this tool quickly, the white hats are the ones who are going to have
the advantage, not the black hats. Whereas if we keep it private, now there's three months where maybe a black hat is going to stumble on this and start exploiting
it. That's kind of the approach there. But let's say if it was something more expensive,
like a full model, like we're training a model. Training a model costs billions of dollars.
If we were to build a model that was able to find those bugs, it's kind of okay to wait three months
because North Korea is not going to be training
a billion-dollar model in a couple of months.
So it really depends on the circumstances
of how you've been able to create this tool.
Yeah, I think that's a really interesting answer.
I like your, like, the white hat token budget
versus the black hat token budget.
It definitely echoes the like open source,
like more good eyeballs than evil eyeballs,
you know, makes for safer code,
which I think is a good segue
it is now safer for your smart contract code
than it is to be closed source.
So this one split the room.
Seth, you were on the record as saying you think it's safer to be open source. How this one split the room, it was 50-50. Seth, you were on the record as saying
you think it's safer to be open source.
How do you think about this?
Yes, and I still stick with that.
I think that closed source,
particularly in a situation where you deploy
and people can reverse engineer the binary
and work from there anyway,
is more of a mirage than anything.
And putting it out open source, you know, creates
a responsibility on your side to make sure that you've done your diligence with security without
the false sense of security you get from this secret that's not really a secret. So, I mean,
I think as with many things, AI related and otherwise, it's about the mentality that goes
into it. If you know it's open to the world and anybody can run AI tools against it,
you can rightfully say to yourself,
if I didn't run the best AI tools
But if you have this illusion
that because it's closed source,
somehow it's safe and protected,
then you might think that you could get away
with not running the best tools out there
That's really not going to help you at all in the end.
So I think open sourcing in this context just raises the security bar for all those involved,
and it sets the right mindset from the beginning, and that's why it's absolutely the right way to go.
And I think to the previous question of responsible disclosure, I would say
one way that I always look at it is, you know, AI tools have made it really easy to execute social engineering attacks,
but we don't say that they should have been closed source and a warning should have been put out
that everybody should be prepared for far more deep fake social engineering attacks because they're coming.
It's really just a matter of if you're in this industry and you take it seriously and you're in an arms race,
you need to play in the arms race.
And so you have to look for the next thing that comes out
because that's part of your responsibility
That said, if I were to come up with,
kind of agreeing with Ben here,
if I were to come up with a new brilliant tool
that could find 100% of the vulnerabilities out there,
that's different because the barrier to entry
And now I need to start thinking about responsible disclosure before I put it out in the world.
Yeah, that's super interesting.
And I think, yeah, I totally agree with you.
Well, I don't want to tip it too hard, but on this one, I think I agree.
I think reversing capabilities are superhuman.
So, yeah, it's like if you're open source, it's almost kind of more of a flex nowadays.
So it's like, hey, you know, hit me, I've done everything,
you know, and especially if you have some,
if you have your skills in there,
if you have verification, like so much more so.
I think one other factor I think about is that
the safest place to be is if you're open source
and you are a dependency of one of the foundation labs,
because you know what they're doing
before they release the models.
Now this is a very actionable
for smart contract developers, but I think it definitely matters. And I think if you're open source and you're prominent, you know, a're doing before they release the models. Now this is a very actionable for smart contract developers,
but I think it definitely matters.
I think if you're open source and you're prominent,
a lot of them think pretty seriously
about these questions too,
and about their responsibility
of what models can and can't do.
So if you're a prominent project in closed source,
I think you're definitely less safe
than a prominent project that's open source
because the angels might be running the early model
and disclosing things to you.
There's also this leak the other day
about how there's like secret anthropic employee mode for patching bugs
But Kaz, you were on the other side of this question.
I want to hear your point of view. KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV KASZAKOV I agree with Seth and it doesn't matter. So models can now basically engineer everything.
So you might as well just publish it in that case.
And I think it's a good thing to publish this outcome.
Now, I think the more interesting part is whether to publish for us.
And that's actually, that was an actual question we asked ourselves.
It's whether to publish specs.
So basically formal verification.
And a few months ago, we were definitely in the campaigns
of, hey, we're going to publish everything.
We're going to make it easy for people
to trust this by replicating on their system.
And we were thinking, oh, we're going to put this somewhere
so that they can actually see all the traces everything
executed and all of that so this was our position a few months ago um and that's no longer our
position uh and the thing is basically we've seen agents do really well reasoning about specs
so if we publish them we are essentially for each of our clients we are say we're giving
attackers a way to build upon our work
and see these are the edges that are not yet super well verified.
Let me try to build a bit more there and see whether there's anything I can exploit.
And that's a risk that's avoidable, right?
You cannot really stop the code from being out there.
On the other hand, the specs,
you can stop them from being out there.
And I think it is responsible towards our,
responsible to keep them private, unfortunately.
That's a very interesting trade-off.
Like when we're saying the machines are superhuman reversing,
so it's easy to go from binary to source code.
I think there's implicit assumption you're making that they are not superhuman at going from source code
to the correct spec, which is certainly my experience, especially if you want a concise spec.
How long that capability gap will exist is an interesting question, but at least while it does,
I think I'd buy your argument like that's maybe something you'd want to keep closed. And similarly,
like if you have text files or supporting context about where bodies are buried or previous bug
vulnerability reports or, you know, areas that your team is nervous about, like maybe those you could keep, maybe those you could keep private.
But I think eventually, like, you know, you're going to want all of those things out there.
So the next question is about verification, which we've been sort of touching on at a high level. So the yes, no statement is in 2026,
there is no defensible reason to launch a new smart contract without specifying and formally
verifying all financial and access control related properties. This is not common practice in the
industry, but we got three out of four yeses, which is super exciting. And I think I'm eager
to hear why folks think this because it's definitely a step change and implies that we're
going to be seeing more in the future. Maybe let me start with the no. Robert, you were the lone dissenter on this question.
Yeah, I mean, it feels like a super strong statement. I guess maybe I come from a slightly
different perspective where I feel like form verification depends a lot on the specs you
write and how good the specs are.
And I feel like just saying a blanket of like, hey, you have to launch with these formal properties might sound good in theory.
But in practice, people probably aren't going to do your job with it.
I also think that there's a lot of other considerations with this.
Like, for example, certain protocols might not need
formal verification or formal verification might not be super suited to them, right?
Or for example, maybe fuzzing is better for certain types of protocols compared to formal
verification if there's like a lot of math involved, for example. And I think this sort
of blanket statement doesn't feel right to me. I could totally see narrowing it, right?
Like certain kinds of protocols,
like maybe multi-sig protocols or vault protocols,
need formal verification of all the financial properties.
But this sort of blanket statement
just doesn't sit right with me.
Yeah, that's totally fair, I think.
And you have to make very strong statements
if you wanna get folks to disagree.
So I appreciate you taking the other side of that argument.
And I think I biased some parts of it.
There's some things like access control,
you should just specify and verify it.
Like, every protocol has it.
You know, there's no, it's not really hard, actually.
There's no reason not to do it.
There might be some other things where it's, like,
maybe should I specify and verify the way my events, you know,
But maybe, you know, there could be bugs there, maybe not.
But, you know, there's more sort of more gray areas.
On the yes side, Seth, I wanted to ask you,
because, you know, Sertoro is like maybe the most successful
formal verification company in history.
Like you guys are obviously gonna have a strong opinion
on this and a lot of perspective.
And of course you have personal background,
you know, with CoVarity and working in formal methods
besides verification and beyond.
So I'm really, really interested to see what you think.
Well, actually, let me start by saying,
would you have had the same answer in 25
or did your answer change in 26?
And how do you think about it in both of these? Actually, my answer didn't change. I would have said the same in 25 or did your answer change in 26 and how do you think about it in both of these?
Actually my answer didn't change. I would have said the same in 25 and I think fundamentally I'm not even, let's put aside the limitations of verification because there are always limits
and I understand that we can, you know, that in those cases what are you going to do? You can't
formally verify, the tools don't work, the mathematics is too complex, et cetera. There's always the limit. But if we put aside the limit, I just focus on the concept.
We're in an arms race. The person on the other side is getting smarter and smarter at a rate
which we cannot possibly keep up with as humans. The only thing you can try to do is go for
the absence of bugs. And formal verification is the best known method for proving the absence of
certain classes of bugs. And yes, you can't prove everything, but you've got to try because tomorrow
the next agent update is going to come. And no matter how many hours of well-guided fuzzing you
did, something is missing and you don't know if the model is going to find it. And it just comes
down to this. We're in an industry where we are, you know, putting out contracts that have no real defense
in depth. Yes, you can code it in, but there are no multiple layers of protection like you have in
every other industry that monitors and manages this kind of money. It's like putting the visa
transaction system with the code out there to be exploited, no firewalls, no nothing,
you know, here you go. So the level to which we have to try to prove these contracts correct
is drastically different, especially in a world where, you know, we don't have visibility
into how good the model is going to be tomorrow. So to me, it's just a matter of responsibility.
Is the world perfect? Can we all
write great specs? No. Agents are getting better and better at it, although I agree with everyone
who says they're not that good yet. But there's many things we can do to make them better, and we
should. But as a matter of responsibility, you have to try and get as close as you can to verifying
everything that you can. Yeah, I definitely buy that. And I think, you know, this industry or smart contracts
have always had the most adversarial threat model
of maybe any software out there.
Anyone can link to it and call it.
And that's sort of the point.
It manages money, all these things.
So the fact that your answer didn't change from 25 to 26,
It's just like now the threat model just gets worse
because the adversaries are good and getting better faster.
You know, the defenses are too.
So I wanted to move into talking about sort of the consequences of all of this.
So I'll start with a little bit of personal perspective on what we're seeing.
You know, we've run a bug bounty program for SWE since the very beginning.
You know, all these stats I'm talking about, most of these stats I'm talking about are public.
You can see, like, it's paid out quite a bit over the, all these stats I'm talking about, most of these stats I'm talking about are public, you can see like it's paid out quite a bit
over the years, like 2.3 million over three years that we've had a lot of great relationships
with the White Hats, a lot of great reports, it's an excellent resource for us. Something
we've been seeing recently is definitely since LLMs came out, there have been more reports,
it's early on, it's mostly slop, I would say actually, you know, almost all slop, it's
just elevated volume. But the number we've been looking at recently that is interesting
is over the lifetime of the program,
we've only had three duplicate reports over three years.
And then we had a month where there were eight in one month.
Now, why does a duplicate report happen?
Duplicate report happens because the bug bounty program
scope and the model are put, like you take the bug
bounty program scope and the code,
and you put it into the model, and you get roughly the same thing.
Or maybe you have a fancier harness.
You know, this is our thesis,
and I think that this is probably true.
And so it's like clearly, like, people are taking stuff,
and I mean, valid duplicate reports,
So it's like, hmm, that's sort of been
one of the wake-up calls to us
that people are really doing this,
and it's working a lot better than it did before.
So, you know, the question I have related to this then is,
in 2026, crypto bug bounty programs
are the best ethical ways to convert subsidized tokens,
e.g. your cloud code and your codex max plans into dollars.
So the room is split 50-50 on this.
Ben, you said yes, what do you think?
So I was kind of interested in this question because it begs another question, like what's the best way to use an LLM right now?
I think a lot of people have been using LLMs to find bugs on programs like Immunify using like one shot or a couple shot prompts that basically say find the bugs or they give a list of vulnerability types and then like find a vulnerability type that matches one of these um and that's effective to some extent
and i think that's a big reason why we're seeing so many duplicate bugs because these are the these
are the bugs that the model has been rled for to be able to find um then the question is how do you
find the bugs that the model has not been RL'd for?
Because there are, so you said, mentioned that there's a couple of critical bugs that was found.
I guarantee there's probably more critical bugs, but the LLMs aren't going to find it with a single
shot. And I think for a really talented security researcher, they're able to take the model and push it into these areas of its exploration space
that the other people aren't going to be able to do with a single shot.
And we've seen people do that really effectively.
We've been starting to push people more internally to do bug bounties,
you know, for fun and stuff.
And the level of effectiveness that people have at being able to push these models
in directions where there isn't duplicate findings is really incredible.
But you have to really push beyond that initial, like, I just want to ask it to find the bugs.
What tips do you have for getting it out of that local maximum?
Don't assume that it's a thinking system.
assume that it's a thinking system. I think that it's very easy to make the mistake of thinking
that the system is able to think and reason and do things like that. It's best as a summarization
machine. So let's say you have a giant code base like the SWE node. There's probably like a million
lines of code in there. No single human could review all of that code in any reasonable amount
of time. But using a summarization machine like an
llm it might be able to point you at the highest risk most important areas to verify first and then
that's when you dive deep but if you started from the very beginning and you're like hey find all
the high severity bugs in the code base you don't have that extra guidance from the engineer saying
hey this is what i think is really important I know there's been like a bunch of results by Anthropic and them saying like,
when you do LLMs with like hybrid human LLM, like the performance isn't as good as just an LLM.
We haven't been able to corroborate that. So far for us,
human plus LLM is vastly, vastly better than either one alone.
Good for our job security in this room.
Klaus, you're on our job security in this room. Klaus,
you're on the other side of this question.
So what do you think is the best ethical
way to convert subsidized tokens into dollars
if it's not crypto bug bounty programs?
Yeah, I think actually, yeah,
regarding this question, my take is more
very, so I agree that the harness is important
and you essentially have, I know, basically,
both white hat teams and black hat teams
still trying to do the same task with different models, right?
And I guess, I think on both sides
you can also use all of this.
So basically that's just the fact that you have
these discounted prices for models is just kind of the substrat of everything that we do, right?
And I think that our jobs as security researchers are just to be much, much better than
to be much, much better than kind of the other side
And what's the magic of getting much better there?
Yes, that's a kind of open question
and we're all experimenting, I guess, internally.
But the crux is to be so much better
than that we are hopefully one model generation ahead
in terms of finding bugs versus let's say an off-the-shelf cloud code because if we are not then when a new model lands we will essentially
have a significant problem so basically i'm kind of that's how i see our job as a company be
be able to be one generation ahead
and then be able to react very quickly
to changes in your environment
in terms of both the models themselves
and the prices of the models.
I want to dig into the part about the new model releases
since there's a question about this.
I think it's very interesting.
The yes-no question I asked,
and this one actually didn't split the room,
but I think it's still a good question to discuss is,
in 2026, a team with a great 48 hour
new model release playbook and no pre-launch audit
is safer than a team with a top tier audit and no playbook.
And so what I mean by a 48 hour new model release playbook
is sort of like the new model drops,
like what is your team doing right now?
If you're not in the model pre-release program,
which obviously it's preferable to be in there
and do this before, you know, certainly every team should try to do that, but it's just
not going to be possible for everybody. I think for layer ones, you definitely can and should be
if you're not already. But anyway, so yeah, so what is the playbook that you start running then?
And I think like the, everyone said no, everyone said no, like, you know, you want the audit instead
of the playbook, you want the audit and no playbook instead of like good playbook, but no audit,
which it makes sense as auditors.
But I think my, so my framing would be,
Actually I would say you actually need both.
Yeah, of course, of course.
I have to try to split the room
so I make it a question without nuance.
But so the way I would think about it is,
are you the auditor with today's model better than the totality of the white hats with tomorrow's model?
And that's another way to think about the question.
the answer seems like the totality of white hats with tomorrow's model are
maybe you actually care more about the playbook than the audit.
But of course the right answer is both.
So anyway, I won't, I won't call on anyone for that one since everyone has...
I think in the blockchain industry, and I'm sure everyone here will agree,
there's this perception that if you get an audit,
that means that there's no bugs in your code base.
And we constantly have to spend time educating clients saying,
listen, if we found three or four critical bugs during your audit,
there's probably a lot more and it's not safe to deploy.
And there's also qualitative guidance.
We might tell them, hey, your test coverage isn't great.
Your complexity management isn't great.
You need a private key management strategy. And if,
if the client comes through, they just fix their bugs.
They don't follow any of that guidance. I think they're totally toast.
It doesn't matter who gets, who audits them.
If they're not taking that guidance to heart, it's not going to matter.
The job of an audit isn't to be better than the white hats with that next
It's to make sure that the development process is going to produce a code base
that is secure against the white hats plus the next generation model.
And if your audit can't produce that, then that's when you run into issues.
I would second that emphatically.
We are not in the business of finding all bugs.
And if we think we are, we are wrong.
We're in the business of finding all bugs. And if we think we are, we are wrong. We're in the business, I mean, an audit
in the traditional sense in the financial world
is an audit of process and numbers,
but it's not just the numbers.
And it's like, we should be heavily invested
in ensuring that our teams think end to end about security.
And they view it as a whole company initiative
that requires security as a first principle
and compromise as unacceptable
when it comes to the safety of their systems.
And that has to be resilient to the next model and beyond.
So it's like, it can't be just about the bugs.
And I think that's the point you're making,
And let me connect this to a different question, which is one that people also all agreed on,
In 2026, the role of a useful audit shifts from inspecting the code towards building
the invariants and scaffolding that will be used in red teaming in the future.
Now, to put it sort of crassly, maybe even the past people would have thought that an
audit is like the deliverable of an audit is a PDF with some bugs or something like that.
I'm sure none of you would say that, but I think teams maybe think of it that way.
But everyone here said like, yes, the role of an auditor is shifting from that old model or some better version of that old model to this new thing that's about scaffolding and invariance.
So Robert, tell me about the shift. How's the work that OtterSec does in 26 different than, you you know say you were doing in 2024 if you're
thinking about invariance and scaffolding as the main output instead of a pdf yeah yeah i guess the
way i took this question maybe was a bit more broadly um so actually i'm a little bit tired
today one reason for that is because there was a really big hack two days ago um just got hacked
for 50 million dollars um like quarter billion dollars quarter billion dollars. So that took my day, and I stayed up all night helping them.
And I think the really interesting thing about the hack is, as we all know, it was not a
smart contact hack, right?
Even though Drift had a relatively complicated code base, the part that hit them in the end
was a multi-state compromise.
And I think that's a trend that we're seeing as ecosystems
mature, right? Which is like, as I mean, I think this particular hack was inspired a lot by Bybit,
right? But as the code base gets better, hackers look for alternative venues, right? And at least
the way I took this question is more like, hey, there's a lot of different ways that hacks could
happen. And if we think about the totality of the risk,
AI might be really good at finding smart contract bugs, but if that is the case, then maybe the
weakest link moves somewhere else, right? And hackers want to go for the weakest link,
they don't need to break everything. So whether that's writing invariants to secure the code,
or whether that's working on op-sec practices with the team. I think there's a lot more that humans, you know,
hopefully can still help with to make teams more secure.
Colin, since you guys do mostly verification,
I hear the wisdom of what Robert's saying.
Do you work with teams on verifying OPSEC properties
outside of the smart contracts
as the, you know, the core contract invariants?
Like how do you think about that where you're securing one part of the risk, but you know,
it might be moving around?
Yeah, I mean, for now we are focusing strictly on the mobile also on the smart contract itself.
But I think there's a lot you can do also in the smart contract itself including including to kind of put circuit breakers in
case of multi-sig failure or human multi-sig failure and you can you can there is a lot that
can be done in code um and i'm actually okay so maybe we're very small and naive and we're here
but it's uh i'm kind of thinking that our so we we cannot really promise our customer that
But I'm kind of thinking that we cannot really promise our customer that perfection in terms of security, at least from a legal perspective.
But I do think that we are for smart contracts on SUI specifically, with SUI being easier to verify than others.
others. I think we are approaching and maybe we're a few months away even from a world where we can
make at least the code of the smart contract completely resilient against, and provably resilient,
against catastrophic failure. So yes of course you can always have nuances like oh does the reward
model allow a bit too much to be drained
out of a DeFi contract or whatever, right?
So not, there's nuance, but at least I think we are probably a few months away from having
proofs of lack of catastrophic failure.
And yeah, and I think we should do it.
Yeah, it can't come soon enough.
I want to close by going around the horn on a question for everyone.
This is something where everyone said no, but I think the answer, the specifics of the
answer will be interesting.
And so the question is, in one year coding agents will advance to a point where find
all the critical vulnerabilities in this code, you know, the sort of trivial-ish prompts
with a sufficient token budget will work and it will be indistinguishable for a more sophisticated
harness. It's definitely not true today, but even now, like, you know, if you watch,
say, this viral Nicholas Carlini talk, Security Research and Anthropica, where he talks about what
they do, it's sort of like the file by file harness where you say, file one, find the bugs in this,
find the bugs that are here. File two, find the bugs that are here. So my argument would be,
for any bug, there's one or more files that contain the code for that bug. As you get smarter, you'll eventually be able to work backward from a
suspicious location to the other things that are relevant and find all the bugs. It doesn't work
today, but what stops it from working tomorrow? Or does everyone think that it will work in my
one year was just too aggressive of a prediction? Maybe I'll start with you, Seth.
Maybe I'll start with you, Seth.
Maybe it is that you're one year too aggressive.
I mean, I think as models get smarter and smarter,
of course, finding all the bugs
is something you could ask them to do
and maybe they'll be able to do it.
But also have just a general belief
that we think we know more than we really know.
And that's always been the case in security
and it's been proven time and time again.
And it's like every attempt to ever map out
the comprehensive set of all possible hacks
against any system has always failed
when someone came up with the next system in the next way.
And so it's sort of embedded in that is an assumption
that it's a matter of the LLMs getting good
at figuring out how to get to the bottom of everything
that we know to be there.
But I think there are vulnerabilities we don't know are really there.
And so we're just shifting the surface.
Yeah, all the smart contract bugs might be found given the scope of knowledge
that we have now about how smart contracts can fail.
But people will come up with a new way to exploit as long as it's profitable to do so.
And it could be a type of bug that we never considered before.
And I have yet to, you know, and do I believe in a year that an LLM can properly explore the full space of everything we've never considered before?
I think that's where it breaks down for me.
Yeah, I think that's pretty interesting.
You know, the problem of find all bugs, given that you know the sort of bug types that you're looking for,
like find all buffer overflows, given that buffer overflows exist, and you know sort of know how they work,
is a different kind of thing than saying,
invent the buffer overflow, which had to be invented.
You know, it's not like somebody, you know,
there's a security paper that created that,
and said there may be many other things like this,
and that of course are more complicated,
that aren't just software,
that are interacting processes,
oracles, code, and all of that.
Ben, what's your perspective?
Are we going to get to find all the bugs with
a trivial prompt? Maybe for like an arbitrarily small program, like something that's so simple
that you could formally verify the whole thing. So what's the point anyways? I think once you get
to like a non-trivial program, the problem's really coverage. Like these vulnerabilities
scanning, like the code agents and all this stuff, they're
really good at finding vulnerabilities.
They can find critical vulnerabilities.
I'm sure this next one that's going to come out is going to be way better at finding critical
vulnerabilities, but it always comes back to coverage.
What do you actually look at?
What vulnerabilities did you look at?
And do you know what your unknowns are?
And are there any unknown unknowns remaining?
And like, you can't even get that out of a human.
I don't think there's any reason to believe
that we'll get it out of an LLM.
Koss, I want to hear a perspective
and then we'll close with Robert.
Yeah, well, I think that basically
there are two aspects here. One is agents looking for bugs,
right? And in that case, it's more of a stochastic process. And I think it's very hard,
especially for real sized systems, for the language models to actually find everything,
even in the future generations. I do think that we're going to push the boundary of formal
verification from small projects to
actually reasonably sized projects and even things maybe five years from now on the size of the Linux
kernel. So basically I think that this boundary will keep being pushed because of agents. And
then for those levels we have essentially in way, agents proving perfect security or finding all the bugs.
But yeah, it's not in their bug-finding mode, but more in their formal verification mode.
I think thinking of the, like, formally verify all the important, specify and formally verify all the important properties of this code is an interesting tool to think about.
Like, is that reached first?
You know, it gives you sort of more exhaustiveness.
And there's definitely exciting advances and like sort of jaw dropping moments there.
Like Leo DeMora started writing these blog posts
that are unbelievable about Lean
and his vision for verifying all the things.
And some of these results like, oh, Zlib,
like verify that zip is the inverse of unzip
Like that's really, really cool stuff.
And yeah, I'm excited to see it scale up
to Linux kernel size efforts to other large programs. But I'm not convinced, Sam scale up to Linux kernel size efforts or to other large programs.
But I'm not convinced, Sam, just to break the order here,
that this solves spec completeness, which is always a huge, huge problem. It's like, are you sure that even if you tell the LLM to create the perfect spec,
it really is the perfect spec?
The way you know is that somebody tomorrow finds a way to break your code.
And so it shifts the race, but the race is still on.
Spec completeness is really difficult to prove.
And I don't think that anybody has figured out how to prove that a spec is actually complete.
Even for the trivial examples, when I'm talking to people about formal verification,
I'll always ask, give me the spec for sorts example.
the part about like, oh, the elements in the right order, correct. But they'll always miss the, oh,
and it's the input is a permutation of the output part. And then it's like, oh yeah,
this specification thing is hard and sort of counterintuitive. And of course, it only
gets trickier when you go for like more complex programs and properties. So yeah,
given the correct specs, we may be able to verify all the things with agents, the correct specs,
well, maybe humans will still have a role for some things.
Robert, what do you think?
Are we going to be able to find all the bugs
just by saying something pretty trivial?
Yeah, maybe I have a slightly different perspective here.
I mean, I think it's fair to say that LMs
are unlikely to find all the bugs, right? Because they might have the right context, or they might not understand the code base.
But I also feel like the same could be said about humans.
So at least the way I took your question, I took it like, will it be the case that LLMs are roughly equivalent, or maybe even arguably superior than humans?
And I mean, with the rate that AI is increasing,
like maybe, right? And I think it's definitely possible that it could find the bugs. And
I think where I disagree with this a bit is what a bug actually means. And I think, you
know, this is like an example on my mind, it's a recent, right? But like the drift thing
was examples that was not really a bug, right, like they sort of accepted that this admin multi-sig could control markets,
and, you know, with normal operations, assuming the admin multi-sig wasn't compromised,
that was totally safe, right, but you could argue also that like the fact that this multi-sig
existed was a bug, and I think that's something where oftentimes developers don't even know
themselves, right, like what is correct or what counts as a bug.
And I think my perspective is like for all the common bug classes or for anything that's like
a trivial loss of funds or anything that is clearly wrong, I think AI will actually within
a year or two years or however long it is, but I think it will be really good at finding those.
I think the points are to you and still have a chance.
I wonder if we should build a bit more effort, is it actually a bug?
Or what are your operational security practices?
Is this something that, like if you have this superconable hostage, is your object secure enough or are your people secure enough to actually enable that?
I guess I'll say sorry about my phone.
Robert's hotel Wi-Fi has turned into an agent.
But I think that's a good place to close.
And I like the, you know, the drift thing is an interesting one, I mean, because it's
top of mind, but it sort of reminds me of something Seth was saying earlier.
It's like sometimes something's a bug,
and sometimes something's just maybe unacceptable,
lack of defense in depth.
And, you know, maybe that's sort of where we get to,
is like we get better at finding the bugs
and more talking about, like, you know,
given something that could go wrong,
like what are the layers of defense that we have?
And that also seems like an infinite regress
that, you know, we'll spend a lot of time climbing.
Guys, thank you so much for this substantial conversation.
I really appreciate you engaging with the questions and sharing with your perspective and expertise. It was a lot of fun for. Guys, thank you so much for this substantial conversation. I really appreciate you engaging with the questions
and sharing with your perspective and expertise.
It was a lot of fun for me.
I hope you enjoyed it too.