Building an Agent Sandbox from Scratch
Part 1 of the sandbox series: how I built a disposable runtime for AI agents with Docker-in-Docker, headless execution, and durable outputs.
Most demos of autonomous agents feel like magic shows — the agent is asked to build out some new feature or a platform and then it just does it. It's easy to do it for a showcase, but the hard part is actually scaling it up to work in a real environment: where does the agent run, how does it access a database and browser, how do you keep it away from the machine that launched it, and what happens if the environment dies halfway through a task?
Recently I've been working on building that missing layer: a disposable sandbox where agents can run real tasks. One docker run starts an isolated workspace with its own Docker daemon, its own application infrastructure, and its own browser tooling. The agent pulls down any necessary resources, executes the task (e.g. cloning a repository and modifying its code), runs validation/testing, captures screenshots/videos, and writes a durable result. The container disappears when the task finishes, but the outputs — whether commits, logs, media, or metadata — live on persistently.
This write-up documents some of my learnings when building the sandbox. Although my initial MVP was to run coding agents against full-stack web apps, the sandbox itself is designed to be agent-agnostic: it can support any autonomous task that benefits from an isolated environment. Coding is just the first use case (since then I could theoretically use the sandbox to improve itself in parallel), but my overall goal is something much bigger.
Why build from scratch?#
The obvious question is: why not use E2B, Daytona, OpenSandbox, or another platform? I was wondering this myself, but ultimately two answers drove me to decide on building my own:
First, my target environment is not just "a terminal with a repository checked out." Many tasks — generating code, spinning up microservices, scraping sites, or triggering workflows — require a full application runtime. For my current coding usecases that means Supabase, Convex, Redis, Postgres and whatever project-specific containers the task demands, all running inside the sandbox, not on the host.
Second, this sandbox is the foundational building block to a larger project I am working on. The details of that platform are intentionally vague here, but the implication is clear: I need the sandbox to integrate deeply with the overarching system and evolve along with it. Using an off-the-shelf product would've ment coupling to someone else's roadmap and constraints. Building it myself forced me to understand the tradeoffs and left me free to extend the design later.
Those constraints produced a short list of non‑negotiables:
- One command to start the sandbox locally or on a cloud VM.
- Full infrastructure support inside the sandbox, rather than relying on the host.
- A non‑destructive coding workflow where the agent works in a branch and writes a result that a human can inspect.
- A runtime that is agent‑agnostic so I can swap the inner agent/workflow without changing the container environment.
- A self‑contained environment that fetches its own inputs and does not depend on host bind mounts.
The design goal: disposable environments#
The central design principle behind the sandbox is simple:
The environment should be disposable.
Every task starts with a fresh container and ends with that container being destroyed. The sandbox should not depend on host state, local mounts or long‑lived services. Durable state should live somewhere else, ideally somewhere designed for it. For the initial buildout I am using GitHub, but later on this can be moved to Cloudflare Durable Objects or another scalable solution.
This constraint simplifies a surprising number of problems:
- If a run fails, start a new one.
- If infrastructure becomes inconsistent, destroy it.
- If an agent behaves badly, kill the container.
Durability still matters, but it moves up a layer: commits, logs and screenshots are persisted to GitHub or another storage system. The container itself is just a disposable vessel.
With that design goal established, the rest of the architecture mostly falls into place.
What the sandbox actually does#
From the outside, the interface is intentionally boring. A CLI takes a description of the task, packages it into a work order, and starts an isolated container:
sandbox-agent create \
--project my-saas-app \
--task "Add pagination to the users API endpoint"The --project flag points to a YAML file that describes everything the sandbox needs to know about the target environment:
name: my-saas-app
repo: https://github.com/myorg/my-saas-app
branch: main
supabase: true
services:
redis:
image: redis:7-alpine
env:
REDIS_URL: 'redis://redis:6379'
NODE_ENV: 'test'
lifecycle:
setup:
- npm ci
- supabase db push --db-url $DATABASE_URL
validate:
test: npm test
lint: npm run lint
typecheck: npx tsc --noEmit
dev:
command: npm run dev
port: 3000
harness: claude-code
max_budget_usd: 5.0
timeout_minutes: 30
pr:
labels: [agent-generated]
draft: trueThis config is the contract between "what you want done" and "what the container does." Which repo to clone, which services to start, what setup commands to run, how to validate the output, whether there is a dev server for browser verification — it all lives here. A different YAML file means a different stack, with no changes necessary to the container image itself.
The host CLI reads this config, combines it with the task description into a work order JSON, and launches a container with docker run --privileged. Everything interesting happens after that boundary.
Insider the container, the entrypoint script runs a multi-step pipeline: start the inner Docker daemon, create a network, fetch the necessary resources, spin up infrastructure, generate .env files, run setup commands, and finally hand control over to the task runner. The runner itself is pretty straightforward: it uses standard Python modules and shell commands to execute the agent, validate the output, capture media, and write a result JSON. Because the sandbox is agent-agnostic, the runner can execute a coding agent today and a different type of autonomous tasks tomorrow.
When the agent finishes, the runner validates the work, captures screenshots (if necessary), pushes its deliverables remotely (e.g. code to a GitHub branch), writes results.json, and then the container exists. The important property here is that the container is disposable: it is safe to kill and cheap to recreate. All durable state lives in GitHub (for this initial version).
The first real decision: Docker socket or Docker‑in‑Docker#
Full‑stack tasks force you to answer the Docker question immediately. Supabase alone starts roughly ten containers internally, so the sandbox has to be able to launch its own Docker workloads.
There are two real options:
- Mount the host Docker socket with
-v /var/run/docker.sock:/var/run/docker.sock. - Run a second Docker daemon inside the container with Docker‑in‑Docker (DinD).
Socket passthrough is convenient, but fundamentally wrong for this use case. If the sandbox can talk to the host daemon, it is not really a sandbox. It can inspect, stop or delete whatever else is running on the machine. For agent tasks beyond coding — anything that could issue arbitrary shell commands — that is a non‑starter.
DinD preserves isolation but introduces complexity: you have to manage two daemons, choose storage drivers, trap signals and clean up residual state.
I chose DinD because it was the only way to guarantee that each run had its own container namespace and its own internal network. The rest of this article goes into the tradeoffs that flowed from that decision.
Background daemons can break shell scripts in boring ways#
My first surprise was that containerd and dockerd could not just be launched in the background and forgotten because they inherited the entrypoint's stdout file descriptor. Downstream pipes stayed open because the daemons still held references, making the script look like it was hanging for no obvious reason.
The fix was simple: redirect daemon output to log files under /var/log/ instead of letting background processes inherit the caller's stdout. This is not specific to Docker; it is a reminder that background processes can block pipelines if you are not careful with the implementation.
Treat Docker state as partially disposable#
I wanted to cache image layers across runs so a second task would not need to pull everything again. My first attempt was naive: mount a volume over /var/lib/docker and persist the whole thing.
That worked until a test where the outer container was killed instead of being stopped. Once that happened, the inner daemon could leave behind stale BoltDB locks and dead network endpoint references. The next startup subsequently failed with errors like:
error while opening volume store metadata database: timeout
libnet controller initialization: timeoutThe fix was to stop treating Docker state as all‑or‑nothing:
- Cache what is actually valuable, mainly image layers.
- Delete transient network state on startup.
- Clean known stale metadata files before the daemon starts.
- Trap
SIGTERMand shut down the inner daemon gracefully. - Prefer
docker stop -t 10overdocker killon the host whenever possible.
With these changes implemented, the setup got much more reliable.
Storage drivers are a portability problem#
I started with Alpine’s docker:27-dind image, which defaults to overlay2. That works well on Linux, but it fails in DinD on macOS Docker Desktop because nested overlay mounts are not supported in that environment:
failed to mount overlay: invalid argumentMy solution to this was to detect what the host environment could support, and fall back in order: overlay2, then fuse-overlayfs, then vfs.
# Simplified version of the detection logic
if mount -t overlay overlay -o lowerdir=/tmp/test-lower,upperdir=/tmp/test-upper,workdir=/tmp/test-work /tmp/test-mount 2>/dev/null; then
STORAGE_DRIVER="overlay2"
elif command -v fuse-overlayfs &>/dev/null; then
STORAGE_DRIVER="fuse-overlayfs"
else
STORAGE_DRIVER="vfs"
fiThe base image pivot: Alpine was the wrong optimization#
I started with Alpine because it is the default choice for a lot of Docker‑heavy tooling. The image was smaller, builds were fast, and it felt like the disciplined option. That all fell apart the moment I started working on implementing browser automation.
The browser tooling I use (agent-browser from Vercel) depends on Playwright, and Playwright expects glibc. A bunch of native npm dependencies do as well. Alpine uses musl, which turned that "small image" decision into a compatibility tax on everything else.
Switching to Ubuntu 24.04 solved the libc problem, but it uncovered another one: Ubuntu’s chromium-browser package is a snap stub, which is useless inside Docker, and Google’s Chrome .deb is only available for amd64, which breaks when developing on Apple Silicon.
The setup that finally worked for me was Ubuntu 24.04 plus Playwright’s bundled Chromium. The image got bigger (from about 1.1 GB to roughly 2.1 GB), but the toolchain actually worked on both arm64 and amd64 which was a win for local and cloud compatibility.
Running the agent headless#
Once the infrastructure worked, the next challenge was less obvious: getting the inner task runner to run unattended inside the container without breaking shutdown, authentication, or basic Unix expectations.
Root privileges are required until the last moment#
The container starts as root because it has to: starting a Docker daemon, creating networks, and setting system‑level git config all need elevated privileges.
The agent, on the other hand, refuses to run headless as root for sensible security reasons.
The compromise was to do infrastructure setup as root, then create a non‑root sandbox user and hand off to that user only for the final execution step. Most containers drop privileges immediately. This one has to keep them until the platform is ready, then cross the boundary right before the agent starts.
PID 1 still matters#
I needed a SIGTERM trap in the entrypoint so the inner Docker daemon could shut down cleanly and flush cached state. The usual exec su sandbox -c '...' pattern broke that immediately because exec replaces PID 1 and throws away the trap handler.
The fix was to keep the entrypoint as PID 1, run the agent in the background, and then wait on it:
trap cleanup SIGTERM SIGINT
su sandbox -c 'python3 -m runner.main' &
RUNNER_PID=$!
wait $RUNNER_PIDIt is a small shell detail, but getting it wrong means that your cleanup logic will never run.
Git config scope matters in multi‑user containers#
One of the more annoying bugs had nothing to do with agents. I configured Git with git config --global, which writes to root's home directory. The runner later executes as sandbox, whose home is /home/sandbox, so none of that configuration was visible.
The symptom looked like an authentication problem, which took me more time that I'd like to admit to debug:
could not read Username for 'https://github.com': No such device or addressThe fix was using git config --system, which writes to /etc/gitconfig instead and is shared across users in the container. This is the kind of bug that only appears when you stop treating the container like a single‑user shell session and start treating it like a small operating system.
Durability matters more than orchestration#
The most important design shift was realizing that interruption is not an edge case, but rather the normal failure mode.
The runner pipeline is sequential: the agent executes the task, the runner validates it, the runner captures/records media, the runner writes the result, and optionally pushes up deliverables (e.g. a GitHub PR or an artifact). If the container dies after local changes exist, but before those commits reach the external store (GitHub in this case), the work is gone.
This is not just a hypothetical problem: containers can get stopped by things like timeouts, budget caps, spot instance reclamation, or just plain operator error.
To address this, I added three layers of durability:
- Push on shutdown. If the container receives
SIGTERM, the runner pushes committed work before exiting and writes a partialresult.jsonwithstatus: "interrupted". - Periodic push. During longer runs, the runner pushes every few minutes (or after every milestone) so the worst‑case loss is a small slice of recent work instead of the entire session.
- Continue mode. If a run dies anyway, a new container can resume from the same branch. The new agent gets a prompt that tells it to inspect the existing branch, read the git history, assess what is already done and continue from there.
sandbox-agent continue \
--project my-saas-app \
--task "Add pagination to the users API endpoint" \
--task-id T-A1B2C3D4 \
--target-branch agent/T-A1B2C3D4-add-paginationContinue mode handles interrupted runs, but the sandbox also supports the rest of the PR lifecycle. Revise mode takes review feedback from an existing PR (from something like Bugbot, Greptile, a human, etc.) and spins up a new container to address it. Rebase mode handles branches that have fallen behind — it fetches the latest base, resolves conflicts and force‑pushes. The pattern is always the same: fresh container, targeted prompt, disposable environment, durable output.
I considered building explicit checkpoints for agent memory and context restoration. I dropped that idea quickly. Internal agent state is not easy to serialize, and even if you could serialize it, you would now own a much harder recovery system. Git already stores the state I actually care about: what changed, in what order, on what branch. Using Git as the durability layer gets most of the value for a fraction of the complexity.
Where it stands now#
There's a lot more that needs to be added to the sandbox, but the core loop is completely functional:
- The first complete end‑to‑end run took about 85 seconds: start the container, fetch resources, execute the task, capture media, push the branch and write the result.
- Small tasks like adding a utility function or fixing a type error finish in roughly 7 – 12 seconds.
- The agent can open a browser mid‑session to verify UI changes, not just after the fact.
- Interrupted runs resume on the same branch without redoing completed work. Review feedback and stale branches each have dedicated modes that follow the same container‑per‑task pattern.
There is still plenty missing. I do not have cost or token tracking yet, and orchestration across multiple concurrent sandboxes is still ahead of me.
But the part I cared about most is now real: task in, result out, with isolation by default and a recovery story when the container dies.
What I would build next#
The immediate next step is observability: where time goes, where money goes, and which parts of the loop fail most often. Right now I know whether a run succeeded, but not much about why a run was slow or expensive.
After that comes orchestration — running multiple sandboxes concurrently, dispatching work from a queue, and automatically retrying interrupted runs. The container is already stateless enough for this. The missing piece is the coordinator that manages the fleet.
The bigger takeaway is that the agent is the easy part. The hard part is giving it somewhere to work — an environment that is isolated enough to be safe, durable enough to survive failure, and transparent enough that a human can review what happened. That is the bar I care about. The sandbox is my attempt to clear it.