§ Entry
Building My Homelab with Infrastructure-as-Code
Start as you mean to go on
Before I built any of this, I'd spent a long time lurking in the homelab corners of Reddit and YouTube, and most of the horror stories there go the same way. You spin up one VM to try something out, it works, so it sticks around. A year later you've got six VMs, a reverse proxy, a DNS server, a few Docker hosts, and a pile of firewall rules you set up by clicking around a web UI at 1am and never wrote down. Now there's a box you're a little scared to reboot, and half the threads are people trying to reverse-engineer their own setup after a disk died.
I really didn't want to end up there. Between watching that play out a hundred times online and my day job (I'm a backend engineer, so infra is only part of what I do), I'd seen plenty of what happens when stuff gets hand-configured and nobody writes anything down. The server everyone's scared to touch. The migration that drags on for months because no one's sure what the old box actually does. The bug that turns out to be some setting a person changed by hand once, ages ago, and never recorded.
So when I built my homelab, IaC wasn't a "I'll clean this up later" thing. It was the rule from the very first VM: if it's not in the repo, it doesn't exist. Every machine, every firewall rule, every container, every DNS record lives in git and can be rebuilt from scratch.
This is a tour of how it all fits together. It's the first post in a short series, so I'm staying at the whole-system level here. Later ones will get into the individual pieces.
So why bother for a homelab?
Nobody's paying me to keep this up. There's no on-call, no SLA, no angry customer. So why put in the IaC effort for what's basically a hobby?
Part of it is just how I'm wired. I can't stand clicking through UI buttons to configure things, and I love the opposite: everything written down in plain text, in one place I can read top to bottom. IaC is really just that preference turned into a workflow. But beyond the temperament, a few things turned out to matter more than I expected:
- I can rebuild from nothing. Host dies? Recovery is
terraform apply, not detective work. - The repo is the docs. No separate wiki to go stale, because the code is the source of truth, always up to date by definition.
git logis my changelog. "What did I change recently?" has an actual answer instead of a shrug.- No mystery clicks. If a setting exists, it's in a file I can read.
But the single biggest reason is that I don't want my setup to depend on my memory at all. I'll flip some obscure setting to fix a problem and move on, and a week later I won't remember doing it, let alone why. That's not me being forgetful; it's just what happens after a few months of little tweaks here and there. The real cost shows up when you reinstall from scratch: every undocumented fix is gone, and you're stuck rediscovering the exact same problems one at a time.
With everything in code, the repo remembers for me: not just what I changed, but, thanks to commit messages and comments, why. That's the whole game.
And here's a benefit I genuinely didn't see coming when I started: because the whole lab is just text files, AI coding agents are ridiculously good at helping with it. I describe what I want (a new service, a firewall rule, another node) and the agent drafts the Terraform or the compose file for me. I read the diff, tweak it, push. Stuff that used to eat an evening of squinting at provider docs now takes minutes. A click-ops homelab can't be handed off like that; a repo full of declarative config absolutely can. It's quietly become one of the biggest time-savers in the whole setup.
The hardware, quickly
I'll stay vague here; it's my house, and honestly the specifics don't matter much. Roughly:
- A Proxmox box running a handful of VMs. The workhorses.
- A little mini PC and a couple of Raspberry Pis for the lighter, spread-out stuff.
- Some MikroTik gear doing the routing and switching.
Nothing fancy. The interesting bit isn't the hardware anyway. It's that none of it gets configured by hand anymore.
Carving up the network
This is where everything starts, and it's the one part I didn't get right on the first try.
It began with the usual warning you'll see all over the homelab subreddits: cheap smart plugs and bulbs are often insecure, rarely updated, and really shouldn't sit on the same network as your actual computers. Fair enough. So my first move was to dump all the smart-home stuff onto the guest network, away from everything important.
That worked right up until my ISP had an outage, and suddenly I couldn't control my own lights. The guest network was so isolated that my main devices had no local path to the IoT gear, so with the internet down, the only route to them (some cloud round-trip) was gone too. Lights I was physically standing next to, unreachable. Annoying doesn't quite cover it.
That's what sent me down the VLAN rabbit hole. I picked up networking gear that actually supports VLANs and split things properly:
- Management: the network gear's own admin interfaces.
- Main: trusted machines, servers, workstations.
- IoT: smart-home stuff I don't trust an inch.
- Guest: fully isolated, for visitors.
The important bit is the firewall rules between them, and specifically that they're asymmetric. The IoT VLAN can't start a conversation with Main; if a bulb gets popped, it's stuck in its own little box. But Main can reach into the IoT VLAN, so my home-automation controller talks to the lights directly over the local network, no cloud in the loop.
There's one wrinkle worth mentioning: a lot of smart-home auto-discovery (mDNS, SSDP, plain broadcast traffic) doesn't cross VLAN boundaries even when the firewall would happily allow it. So my Home Assistant box is dual-homed, with one interface on Main and a second one sitting directly in the IoT VLAN. That way it discovers and talks to devices natively, as if it were on the same flat network as them, while everything else stays properly separated.
Nothing special happens on the VM itself for this: Proxmox just hands it a tagged interface on each VLAN, and the same asymmetric firewall rules still apply. Being dual-homed lets Home Assistant see both networks, but it doesn't punch a hole between them; the firewall still decides who's allowed to talk to whom.
End result: the smart home is walled off and fully local. The lights keep working when the ISP doesn't, which, given how this whole tangent started, is exactly the point. This segmentation is the base everything else sits on, so it goes in first.
Building the boxes: layered Terraform
Terraform builds the actual infrastructure, split into numbered layers that run in order. Each one has a single job:
00-creds: the only layer that talks to my secrets manager.01-network: the MikroTik devices (identities, firewall, NAT, DHCP).02-compute: provisions the VMs and containers on Proxmox.03-ingress: the reverse proxy, firewall rules, and host config on the main VM.04-dns: the public DNS zone and tunnel config over at Cloudflare.05-nodes: the bare-metal Docker nodes (the Pis).
The bit I'm happiest with is how secrets move around. Only 00-creds ever talks to the secrets manager. It grabs everything once, spits out a single sensitive identity output, and every other layer just reads that from remote state instead of hitting the API itself:
flowchart TD
OP[Secrets manager] --> |API call, once| Creds[00-creds]
Creds --> |identity output via remote state| Net[01-network]
Creds --> Compute[02-compute]
Creds --> Ingress[03-ingress]
Creds --> DNS[04-dns]
Creds --> Nodes[05-nodes]
Net --> Live[Live infrastructure]
Compute --> Live
Ingress --> Live
DNS --> Live
Nodes --> Live
Concretely, layers 01 through 05 pull those values via terraform_remote_state data sources (the state itself lives in an S3 backend), so they never need direct access to the credentials provider.
That split wasn't me over-engineering for fun; it came straight out of a problem I hit. I keep secrets in 1Password, and its personal plan caps you at roughly a thousand API reads a day. Originally the credential lookups lived in every layer, so a single plan/apply across all of them fired off around eighty 1Password calls. A few iterations in one afternoon and I'd blown through the daily cap and locked myself out of my own secrets. Folding all the reads into one 00-creds layer fixed it for good: those calls only happen when I apply that one layer, and normal day-to-day work on everything else makes zero. So if I'm just poking at a firewall rule, 01-network runs without a single secrets API call.
Keeping secrets out of git
Quick but important, because this is exactly where homelab repos tend to leak: there are no secrets in the repo. They live in a secrets manager, pulled in at apply time with a service-account token, and the Terraform state lives in remote object storage, not on my laptop. The repo's safe to be sloppy with; if it went public tomorrow, nothing in it would hurt me.
Running the services: GitOps with Komodo
This is the fun part, where it stops being "Terraform I run by hand" and starts running itself.
Terraform builds the boxes, but it doesn't run my containers. That's Komodo's job, and the whole thing is GitOps end to end. The neat trick: Komodo's own config is also just code in the same repo.
Every service it manages is declared in TOML: the stacks, which servers they run on, the repos they pull from, scheduled jobs, all of it in version control. A stack definition is tiny and just points at a compose directory in the same repo:
[[stack]]
name = "cloudflared"
tags = ["infrastructure"]
[stack.config]
server = "main-vm"
linked_repo = "homelab-infrastructure"
run_directory = "services/infrastructure/cloudflared"
poll_for_updates = true
auto_update = true
That run_directory is literally a path to a compose.yaml sitting alongside everything else. No separate "deployment config"; the service and how it gets deployed live in the same tree.
Komodo keeps itself in sync with the repo through a thing it calls Resource Sync. I point it at the folder of TOML files, mark them as managed, and from then on it's: change the TOML, push, Komodo reconciles itself to match. Config that manages the config.
And here's the part that makes it actually GitOps: deploys happen on git push.
flowchart LR
Push[git push] --> WH[Webhook / poll]
WH --> Core[Komodo Core]
Core --> |PullRepo| Pull[Pull latest repo]
Pull --> |BatchDeployStack| Peri[Periphery agents]
Peri --> Up[docker compose up -d]
I lean on two mechanisms here, depending on how fast I need a change live:
- Per-stack polling: most stacks set
poll_for_updatesandauto_update, so they spot repo changes on their own and redeploy without me doing anything. - A webhook deploy: a push can fire a webhook that runs
PullRepothenBatchDeployStackacross everything, for when I want it live now instead of at the next poll.
The layout's simple: every VM, the mini PC, and each Pi runs the Komodo Periphery agent, and one main VM runs Komodo Core. Core's the brain: it holds the definitions and decides what runs where. Periphery's the muscle on each host that actually drives Docker. A Caddy reverse proxy sits out front doing TLS for everything on my internal domain (lab.example.com), with certs from Let's Encrypt over a Cloudflare DNS challenge.
So the day-to-day is: edit a compose file or a stack definition, commit, push. A minute later it's live on the right host. I genuinely can't remember the last time I ran docker compose up by hand, and I don't miss it.
(Quick honesty check on secrets here too: the env values in those stack files are placeholders or references, never the real thing. Push-to-deploy is great right up until it pushes a credential to a public remote.)
What it cost, what it bought
I won't pretend it was free. There's real upfront effort: learning each provider's quirks, getting the layering right, the occasional afternoon torched on some provider complaining about a field that doesn't exist in its own schema. And push-to-deploy cuts both ways: a sloppy commit ships just as fast as a careful one.
And IaC doesn't cover everything, much as I'd like it to. Some things just can't be expressed as code: a BIOS or UEFI tweak on the host, a one-time firmware setting, the odd manual bootstrap step before the automation can take over. For those I do the next best thing and write them down properly, in the repo, right alongside the code. The non-codifiable bits live as documented runbooks in the same git history as everything else, so even they don't end up as tribal knowledge rattling around in my head. If it can't be code, it's at least a checked-in doc.
But the trade was worth it, easily. Never touching a UI again is the whole point: every change gets reviewed by me, recorded in git, and can be rolled back. It's the kind of setup I'd happily nuke and rebuild on a slow Sunday, because I know exactly what'd come back.
And that's the foundation everything else builds on. There's plenty more worth digging into (the individual services, how I make them resilient, the network underneath), and I'll get to those in later posts. But this groundwork is the part worth getting right first.