How Claude Code and I Built a €18/Month Kubernetes Platform
239 commits, 83 days, one AI copilot. The full story of building a production Kubernetes platform on Hetzner — from a €13.50 dream through a $100+ autoscaling bill to a stable €18/month.
This post was almost entirely written by Claude (AI) while I prompted and directed the content. The infrastructure it describes was also built in collaboration with Claude Code.
The Premise
I wanted a place to build and host my own applications without
parachuting into someone else's platform. A personal Kubernetes cluster
on Hetzner, managed by
Talos Linux, provisioned with
OpenTofu, deployed via
ArgoCD. The original plan was
a stub—an outline of decisions to vet. Then I started building it
with Claude Code and 83 days
later, 239 commits deep, the platform is running production workloads
for about €18 a month.
Nearly every commit in this repository was co-authored with Claude Code.
Not generated and copy-pasted—pair-programmed. I described what I
wanted, Claude wrote the Terraform, the Helm values, the TypeScript
manifest generators, the GitHub Actions workflows, the network policies.
I reviewed, steered, and hit enter. The entire cluster—from the
first tofu apply to the
Prometheus alerting rules that page me on Discord—was built this
way.
What follows is the full story: the €13.50 dream, the day-one
memory crisis, the three-month TLS redirect loop saga, the $100+
autoscaling bill, and where things landed. Everything described here is
live in the
talos-redux repository.
The Journey
The €13.50 Dream (Nov 24, 2025)
It started at 10:06 PM with 14 commits in a single evening. The
vision: 1× cpx11 control plane + 2× cpx11 workers
(2 vCPU, 2 GB RAM each) for €13.50/month. The first
commit used the kube-hetzner
Terraform module, but within an hour I discovered it was K3s-based, not
Talos. By 11:03 PM, the entire foundation had pivoted to
hcloud-k8s/kubernetes for
real Talos Linux—immutable OS, no SSH, API-driven everything.
Digger for IaC was dropped the same night in favor of direct OpenTofu
workflows. Claude Code wrote the replacement module config, the R2
backend configuration, and the ArgoCD bootstrap Helm values in rapid
succession.
Day Two: Memory Crisis and TLS Wars (Nov 25–26)
By morning the cluster was dying. The control plane was at 133% memory utilization, the controller-manager was crash-looping, and 87 evicted pods littered the namespace. Two gigabytes of RAM was not enough to run Kubernetes. The nodes were upgraded from cpx11 to cpx21 (4 GB RAM) and the €13.50 dream became €18.
Then the TLS redirect loops started. Cloudflare terminates TLS at its
edge, so traffic arrives at nginx-ingress over HTTP. But the Kubernetes
Dashboard and ArgoCD both expected HTTPS and returned 307 redirects,
creating infinite loops. The Dashboard fight consumed a full day. ArgoCD
was worse—25 debug/fix commits in a single
day, a stream of
debug:,
fix:,
ci: prefixes escalating
through the git log. The root cause was simple once found: ArgoCD
needed the --insecure flag
since TLS was already terminated upstream. This same class of bug would
quietly haunt every new application for the next three months until a
universal ssl-redirect: false annotation was standardized with a regression test in the builder.
Building the Platform (Nov 26 – Dec 2)
With the cluster stable, the real building started. Claude Code and I assembled the platform layer by layer:
- Observability: Prometheus (kube-prometheus-stack), Grafana with Google OAuth, and Loki + Promtail for log aggregation. All wired together with pre-configured datasources.
- Database: CloudNativePG deployed PostgreSQL 17 with automatic failover—1 primary + 1 replica, 20 Gi per instance, pod anti-affinity across nodes.
- First apps: BigCartBuddy (a receipt scanning app)
became the first real workload. A private Docker registry went up at
registry.frodojo.com. - Cost optimization: PVC sizes were slashed (Loki
50Gi→10Gi, Prometheus 30Gi→10Gi). Every stateful workload
got a
nodeSelector: workerto pin it to static nodes so the cluster autoscaler could freely manage dynamic capacity without evicting Prometheus mid-scrape. - Automation: n8n for workflow automation, ARC runners for self-hosted GitHub Actions inside the cluster, and argocd-image-updater to auto-deploy new container images.
The application definition system evolved three times during this
period. It started with CDK8s, moved to lightweight TypeScript classes,
and settled on the current
k8s-apps/ builder pattern
that generates ArgoCD Application YAMLs from typed config. Each
iteration was simpler than the last.
Applications Arrive (Dec 7 – Jan 13)
December 7 brought one of the more novel additions: a Discord bot powered by Claude's Agent SDK with full kubectl access to the cluster. It can query Prometheus, generate charts, and report on cluster health directly in Discord. The SDK integration took multiple iterations—Claude Code helped me try three different Anthropic SDK packages before landing on the right one.
December 26 was the big deployment day—three applications in a single session: Beyond Cloud (auth service + frontend at usebey.com), Tknscope (token analysis platform with its own CNPG database), and a comprehensive alerting system (14 PrometheusRules, Alertmanager→Discord, kubernetes-event-exporter). The cluster immediately ran out of capacity. A third static worker was added on December 27, and on December 31 the control plane was upgraded from cpx21 to cpx31 (8 GB RAM) after hitting 94% memory.
January brought ClockZen (time tracking), DJWriter (content generation), Claude Runner (AI integration), and Plane CE (project management)—each deployed through the same GitOps pipeline.
The $100+ Bill
Then came the Hetzner invoice: over $100. The cluster autoscaler had been spinning up cpx21 nodes in response to pending pods and never tearing them down fast enough. With no cost guardrails in place, autoscale workers silently accumulated and the bill ballooned from the expected €18 baseline to well past $100.
The fix was multi-layered. Autoscale node lifetime was capped so nodes
are torn down after 6+ hours of inactivity. Stateful workloads were
pinned to static workers with explicit
nodeSelector rules so they
never trigger autoscaling. And most importantly,
cost alerting rules were added to Prometheus: alerts
fire if unexpected nodes appear, if the autoscaler creates more than 10
nodes in 24 hours, if any autoscale node runs longer than 6 hours, or
if total volume capacity exceeds 300 Gi. Critical cost alerts push
to my phone via
ntfy.sh. The bill came back
down to the €18 baseline.
Security Hardening (Jan 31)
For the first two months, security was an afterthought. January 31
brought a dedicated hardening sprint: the Discord agent's ClusterRole
lost secrets write and pods/exec permissions, the Dashboard was
downgraded from cluster-admin to view-only, all frontends got
securityContext (runAsNonRoot,
drop ALL capabilities, seccomp RuntimeDefault), and NetworkPolicy
resources were added to restrict database access to only the application
namespaces that need it. TLS was enforced on every ingress.
What’s Running Now
The cluster runs Talos Linux v1.12.2 on Kubernetes v1.33.7,
provisioned through the
hcloud-k8s OpenTofu module
(v3.21.1) in Hetzner's Ashburn region:
- 1 control plane node — CPX31 (4 vCPU, 8 GB RAM). Single control plane, not HA. Rebuilding from state takes minutes.
- 3 static worker nodes — CPX21 (4 vCPU,
8 GB RAM each). Labeled
nodepool: workerand pinned to stateful workloads. - 0–5 autoscale workers — CPX21 nodes
created on-demand. Labeled
nodepool: worker-autoscale. Torn down after 6+ hours idle. Closely monitored after the $100 lesson. - Hetzner LB11 — Load balancer in front of nginx-ingress, managed by OpenTofu.
The Stack
Everything is deployed as Helm charts managed by ArgoCD Application
resources, generated by a TypeScript build step using
cdk8s. Source of truth:
k8s-apps/.
Networking & Ingress
- Cilium — CNI plugin (built into hcloud-k8s). Pod networking and Kubernetes NetworkPolicy enforcement.
- nginx-ingress — Ingress controller behind the Hetzner LB. TLS terminated with cert-manager certificates.
- external-dns v1.14.3 — Auto-creates Cloudflare
DNS records from Ingress resources for
*.frodojo.com,clockzen.com,bigcartbuddy.com, andusebey.com. Proxied through Cloudflare. - cert-manager — Let's Encrypt via DNS-01 against
Cloudflare. Issuer:
letsencrypt-cloudflare. - Cloudflare Access — Zero Trust protection for internal services (ArgoCD, Grafana, Dashboard) via Google OAuth.
Monitoring & Alerting
- Prometheus (kube-prometheus-stack v65.2.0) — 7-day retention, 20 GB on hcloud-volumes. Custom PrometheusRules for nodes, pods, deployments, PVCs, certificates, Loki health, cost tracking, and ArgoCD sync status.
- Grafana v7.0.0 — 10 Gi persistence. Dashboards for cluster overview, node metrics, pod detail, ARC runner stats, and Hetzner cost tracking.
- Loki (loki-stack v2.10.2) — 30 Gi storage, 7-day retention, rate-limited to 10 MB/s ingestion after an early log flood incident.
- Alertmanager — Critical alerts to Discord
immediately, warnings aggregated hourly. Cost alerts push to phone via
ntfy.sh.
Data
- CloudNativePG — PostgreSQL 17.2 with 2 instances for failover. 20 Gi per instance, pod anti-affinity across nodes. Managed roles for ClockZen, Tknscope, and DJWriter.
- hcloud-volumes — Hetzner's block storage CSI driver. ~120 Gi provisioned across the cluster.
CI/CD & GitOps
- ArgoCD v7.7.11 — Root Application watches
apps/dist/, auto-syncs with prune and self-heal. ApplicationSet auto-discovers repos in the GitHub org every 3 minutes. Google OAuth with admin for[email protected]. - GitHub Actions — Infrastructure changes (
infra/) trigger OpenTofu. App changes (k8s-apps/) generate and apply manifests. Kubeconfig pulled from R2 at runtime. - ARC Runners (v0.10.1) — 0–5 self-hosted
GitHub Actions runners with Docker-in-Docker. Currently serving
clockzen-next. - ArgoCD Image Updater — Auto-updates image tags from container registries.
Applications
- ClockZen — Time tracking (API + frontend).
- Tknscope — Token analysis (API + frontend + marketing).
- DJWriter — Content generation.
- n8n — Workflow automation.
- Beyond Cloud — Auth service + frontend (usebey.com).
- Claude Runner — AI integration service.
- Discord K8s Agent — Claude-powered cluster management bot.
- Plane CE — Project management (plane.so).
Each application deploys to its own namespace with default-deny NetworkPolicy. Selective rules open only the required paths.
How Deployment Works
Two paths, both triggered by pushing to GitHub:
- Infrastructure changes (
infra/) — GitHub Actions runstofu plan+tofu applyagainst Hetzner. State in Cloudflare R2. Provisions nodes, LB, and bootstraps Talos + ArgoCD. - Application changes (
k8s-apps/) — TypeScript build generates ArgoCD Application YAMLs intok8s/generated/. GitHub Actions applies them. ArgoCD auto-syncs with prune and self-heal.
Secrets live in GitHub Secrets (CI) and Kubernetes Secrets (runtime). The kubeconfig and talosconfig are uploaded to Cloudflare R2 during provisioning and downloaded by CI workflows at deploy time.
What It Actually Costs
The original dream was €13.50/month. Reality hit fast—the nodes got bigger, the worker count grew, and one month the autoscaler ran away to a $100+ bill. After taming the autoscaler and adding cost alerts, the baseline settled at about €18/month:
| Component | Spec | Monthly Cost |
|---|---|---|
| Control Plane | 1× CPX31 (4 vCPU, 8 GB) | €4.50 |
| Static Workers | 3× CPX21 (4 vCPU, 8 GB each) | €9.00 |
| Load Balancer | Hetzner LB11 | €4.50 |
| Autoscale Workers | 0–5× CPX21 (on-demand) | €0–15.00 |
| Cloudflare | R2 state storage, DNS, Access | Free tier |
| GitHub | Actions CI/CD, ARC runners | Free tier |
| Total (base) | ≈ €18 / month | |
For comparison, an equivalent setup on AWS EKS (managed control plane + 3 comparable nodes + ALB + storage) runs ~$300+/month before egress, and GKE/AKS lands around $200+/month. Hetzner includes 20 TB of egress in the base price—on public clouds, that bandwidth alone costs more than the entire Hetzner bill.
The tradeoffs are real: single control plane (no HA), self-managed Talos upgrades, no IAM integration, and the operational burden is entirely mine. But for a personal platform running side projects, the economics are hard to argue with—especially when the entire thing was built in 83 days of pair-programming with an AI that never gets tired of writing Helm values.