borodark

Zed (ZFS + Elixir Deploy) - two secrets-design forks I’d like opinions on

Hey folks,

I’m building Zed (not editor) a declarative BEAM deploy tool for FreeBSD and illumos that uses ZFS user properties as the state store (com.zed:version=1.4.2) and zfs rollback as the rollback mechanism.

No K8s, no etcd, no external state. It’s Apache 2.0, early days, ~2000 lines of Elixir so far: GitHub - borodark/zed: Declarative BEAM deployment on FreeBSD/illumos. ZFS properties as state store. No etcd, no YAML. · GitHub

Phase 1–4 are done (DSL, convergence engine, FreeBSD jails, multi-host via Erlang distribution). I’m now on Phase 5 and adding secrets support. The overall pipeline is pretty clear — DSL declares sources, a resolver runs on the target at converge time, resolved values land in a per-app 0600 env file that the release reads via System.get_env/1 in runtime.exs. Full design doc here: docs/SECRETS_DESIGN.md.

The reason I’m posting here is that I’ve got two forks in the road where I don’t want to pick solo, because the choices compound. Both affect what a real deploy looks like, and I’d rather hear from people who’ve done this in anger than commit to an answer and regret it.

The DSL shape (for context)

The motivating example is a shape I suspect many of you have run into: a BEAM app with several external integrations under one deploy — multiple broker-account credential pairs, a license key, a Phoenix secret_key_base, a distribution cookie.

app :broker_bot do
  dataset "apps/broker_bot"
  version "1.5.0"
  cookie {:env, "BEAM_COOKIE"}

  secrets do
    broker_a_key    {:env, "BROKER_A_API_KEY"}
    broker_a_secret {:env, "BROKER_A_API_SECRET"}
    broker_b_token  {:file, "/var/run/secrets/broker_b.token"}
    license_key     {:env, "LICENSE_KEY"}
    secret_key_base {:env, "SECRET_KEY_BASE"}
  end
end

{:env, ...} and {:file, ...} are the MVP sources. The questions are about what else should ship in MVP, and what should stay out.

Fork 1 — age-encrypted files in-repo: MVP or Phase 5.1?

The idea: support {:file, "secrets/broker_a.age", mode: :age} so encrypted secrets can live in the git repo alongside the DSL. The target-side resolver shells out to age -d -i <keyfile> to decrypt. This is the standard NixOS / agenix / sops-age pattern.

Case for MVP: without it, a first real deploy that involves several credential pairs (e.g., a multi-account broker integration with a structured accounts.config) still relies on out-of-band secret delivery — operator SCPs a file, or a bootstrap script seeds it. That defeats half the point of declarative deploys.

Case for Phase 5.1: adds a binary dependency (age CLI, or vendoring the Rust age crate via Rustler). Introduces a whole new concern — where does the keyfile live on the target, who owns it, how is it rotated. Risk of shipping a bad first version of something that’s security-critical.

My current lean: defer to 5.1. MVP keeps secret files out of git entirely; Option A in the design doc is “opaque file on target, ship path only.” But that means the first real customer deploy still has a plaintext credentials file living on a host with no replication story.

What would change my mind: if someone here has shipped age-decryption in an Elixir deploy tool and has a reusable pattern, or if someone can point at a failure mode I’m not seeing with shell-out to age.

Fork 2 — should `{:zfs_prop, "com.zed:key"}` be a secret source at all?

The appeal: ZFS user properties travel with zfs send/receive. If a secret lives in a property, replicating the dataset replicates the secret for free. No separate secret-replication pipeline, no sops-per-host keys.

The problem: zfs get all is readable by anyone with access to the dataset, which on a multi-user FreeBSD box is broader than you’d like. Property values end up in history, backup tooling, and any zpool status-adjacent introspection path.

Option A: reject {:zfs_prop, ...} from the secrets block entirely. Keep it as a source for non-sensitive config only (node_name, version, feature flags).

Option B: allow it with a big compile-time warning and an opt-in allow_zfs_prop_secrets: true on the deploy. Trust the operator to know their threat model.

My current lean: Option A, hard reject. The convenience isn’t worth the footgun — “replicates for free” is exactly the kind of seductive default that bites at 3 AM.

What would change my mind: a use case where the operator genuinely controls all dataset access (single-user homelab, say) and wants the replication story without adding a second mechanism. Probably a real thing, but I’d rather make those users opt into something else than weaken the default.

Specific things I’d love input on

If you’ve shipped secrets in an Elixir deploy (sys.config from envs, {:system, ...} in config, sops, agenix-style flows) — what bit you? What would you want a new tool to not repeat?
Anyone shipped age in production, in an Elixir context? Was it a shell-out or a Rust NIF? Rotation story?
For the zfs_prop question — is anyone actually doing this? Is there a homelab crowd I’m discounting?
Am I missing a source kind that should be in MVP? (I’ve deliberately left off 1Password CLI, HashiCorp Vault, AWS Secrets Manager — they feel out of scope for a zero-infra tool, but push back if you think one belongs.)

Full design doc with threat model, module layout, and wire-up seams: zed/docs/SECRETS_DESIGN.md at main · borodark/zed · GitHub

Happy to move this to a GitHub Discussion if that’s a better venue — figured ElixirForum would get the most eyes for the “what would a seasoned BEAM person do” angle.

Cheers,
Igor

12 comments

#deployment

27 738 12

2026-05-08 14:30:44 UTC

Most Liked

abrookewood

I love ZFS (best FS ever), but have never thought about using User Properties to store secrets. It seems really niche and feels to me like it could leak way easier than people realise. I’m not your target audience though (I’m a ZFS on lInux user), so take that with a grain of salt. Congrats on trying something unique though - sounds pretty interesting!

Post #2

borodark

Quick update for anyone who followed the original post — and answers to both forks.

When I asked the forum about the two design forks (age-encrypted files vs ZFS user properties for secret material), the project had ~34 tests on FreeBSD and a half-written Zed.Bootstrap. Nine days and a lot of FreeBSD later, here’s where things actually landed.

What shipped

A0 — DSL slot validation. Secrets in the DSL now go through {:secret, slot, field, storage: :local_file} and the validator rejects unknown storage modes at parse time. Future modes (:probnik_vault_pair, :shamir_k_of_n) fail compilation until their implementation lands. The slot catalog is a single source of truth — typo a slot name in your DSL, get a compile error with the source location.

A1 — Zed.Bootstrap. Idempotent install-time generator for zed’s own secrets: beam_cookie, admin_passwd (Argon2id), ssh_host_ed25519. All sit on an encrypted dataset (<base>/zed/secrets) with canmount=noauto. Fingerprints get stamped into ZFS user properties (com.zed:fingerprint.<slot>); the values themselves never live there. zed bootstrap status/rotate/verify/export-pubkey are wired up. Re-running init is a no-op. Drift detection is fingerprint-based — corrupt the file on disk and verify tells you which slot drifted.

A2a — zed-web LiveView. Phoenix 1.7 + LiveView, password login against admin_passwd, 8h rolling session, TLS with the bootstrap-generated self-signed cert. The first useful page is /admin showing live Zed.Bootstrap.status/1 — not flashy, but it proved the round-trip from ZFS state to the browser.

A2b — QR admin first-login. Zed.QR renders an ANSI QR with a {zed_admin, …} Erlang-term payload; Zed.Admin.OTT is a GenServer with an ETS-backed atomic single-use consume. bootstrap init prints a 10-minute QR; the dashboard has a “Generate pairing QR” button issuing 2-minute OTTs. Rate-limited 10/min/IP. Audit log records the OTT prefix only.

A3 — Passkey (WebAuthn). Browser-only; uses wax_ (pure Elixir, no NIF). Register on an authenticated session, sign in with biometric. Sign-count monotonicity catches replays. Works on Chrome desktop, Safari iOS, Chrome Android. The credential lives in the OS secure enclave — zedweb only ever sees the public COSE key.

A4 — SSH-key challenge. For operators who carry ssh-ed25519 muscle memory but no passkey. Pubkey gets pasted in once (authorized_keys format, auditable with stock tools). Login is POST /admin/ssh/challenge → sign with ssh-keygen -Y sign → POST /admin/ssh/response → session cookie. Verification uses :public_key.verify/4 from OTP — no extra dep. There’s a 50-line shell script that does the whole flow and drops a cookie file for curl --cookie. Unblocks scripts.

A5 — Bastille jail backend (this is the one that nearly broke me). Adapter to FreeBSD’s Bastille (1048-star pure-shell jail manager, BSD-licensed). 540 lines of Elixir, 79-line Runner behaviour, 64-line Mock for unit tests. 175 mocked unit tests passed cleanly on the laptop. The first live run on a real FreeBSD 15.0 Mac Pro found seven distinct production bugs in sequence. Long-form retro here: https://www.dataalienist.com/blog-lie-at-exit-zero.html.

The summary version: bastille destroy -f exits 0 even when it does nothing (running jail, no -a). The mock said the destroy worked. The system kept running. Every other failure was a shape of the same lesson — adapters exist precisely to convert soft contracts into hard ones, and the post-condition check is the only thing that catches a tool that lies on the way out. Final state on the Mac Pro: 5/0 live integration tests, merged to main as daea21a.

A5a — privilege boundary (specced, not yet built)

A5.1 ran the BEAM as the same user that ran doas bastille. That’s a perfectly fine pilot but not a production posture. specs/a5a-privilege-boundary.md lays out a two-user split: zedweb (network-facing, no doas) and zedops (privileged, doas-authorized for the bastille subcommands only). Communication via a small gen_tcp line-protocol over a Unix socket with a per-process token. ~1.5 person-months of work. Decisions 12-18 in the spec lock the surface area.

Answers to the original forks

Fork 1 — age-encrypted files. Verdict: yes, but as a {:file, path, mode: :age} source mode in the DSL, not as the bootstrap default. Bootstrap stays on encrypted ZFS datasets — it’s the right primitive for “secrets that travel with zfs send.” age belongs in the user-supplied secret pipeline (accounts.config.age style), not in zed’s own bootstrap chain. Implementation is Phase 5.1; the DSL syntax is already validated parse-time so consumers can write the references today and get a “not yet implemented” error at converge time, not a typo six months later.

Fork 2 — ZFS user properties for secrets. Verdict: no, with a clarification. Properties get fingerprints (com.zed:fingerprint.<slot> = sha256:<hex>), never values. The reason is a single sentence: ZFS properties are world-readable to any user with zfs get rights on the dataset. They’re a great metadata backbone — they replicate with snapshots, they survive send/recv, they’re free — but they are not a secret store. The {:zfs_prop, "com.zed:name"} source kind in the DSL is reserved for non-secret configuration only, and the validator will reject it for slots tagged secret: true. This was the cleanest answer once I started writing the threat model: properties optimise for visibility, secrets optimise against it.

What’s next

B0 (the zedz Android+iOS scanner — fork of probnik) is the next thing on the runway. After that, A5a — the privilege boundary that retires “BEAM-runs-as-bastille-user” forever. Layers C and D (NAS-adjacent + Probnik Vault + Shamir) remain shelved unless explicitly unshelved.

Post #3