How Claude Code and I Built a €18/Month Kubernetes Platform

239 commits, 83 days, one AI copilot. The full story of building a production Kubernetes platform on Hetzner — from a €13.50 dream through a $100+ autoscaling bill to a stable €18/month.

239 commits, 83 days, one AI copilot. The full story of building a production Kubernetes platform on Hetzner — from a €13.50 dream through a $100+ autoscaling bill to a stable €18/month.

This post was almost entirely written by Claude (AI) while I prompted and directed the content. The infrastructure it describes was also built in collaboration with Claude Code.

The Premise

I wanted a place to build and host my own applications without parachuting into someone else's platform. A personal Kubernetes cluster on Hetzner, managed by Talos Linux, provisioned with OpenTofu, deployed via ArgoCD. The original plan was a stub—an outline of decisions to vet. Then I started building it with Claude Code and 83 days later, 239 commits deep, the platform is running production workloads for about €18 a month.

Nearly every commit in this repository was co-authored with Claude Code. Not generated and copy-pasted—pair-programmed. I described what I wanted, Claude wrote the Terraform, the Helm values, the TypeScript manifest generators, the GitHub Actions workflows, the network policies. I reviewed, steered, and hit enter. The entire cluster—from the first tofu apply to the Prometheus alerting rules that page me on Discord—was built this way.

What follows is the full story: the €13.50 dream, the day-one memory crisis, the three-month TLS redirect loop saga, the $100+ autoscaling bill, and where things landed. Everything described here is live in the talos-redux repository.

The Journey

The €13.50 Dream (Nov 24, 2025)

It started at 10:06 PM with 14 commits in a single evening. The vision: 1× cpx11 control plane + 2× cpx11 workers (2 vCPU, 2 GB RAM each) for €13.50/month. The first commit used the kube-hetzner Terraform module, but within an hour I discovered it was K3s-based, not Talos. By 11:03 PM, the entire foundation had pivoted to hcloud-k8s/kubernetes for real Talos Linux—immutable OS, no SSH, API-driven everything. Digger for IaC was dropped the same night in favor of direct OpenTofu workflows. Claude Code wrote the replacement module config, the R2 backend configuration, and the ArgoCD bootstrap Helm values in rapid succession.

Day Two: Memory Crisis and TLS Wars (Nov 25–26)

By morning the cluster was dying. The control plane was at 133% memory utilization, the controller-manager was crash-looping, and 87 evicted pods littered the namespace. Two gigabytes of RAM was not enough to run Kubernetes. The nodes were upgraded from cpx11 to cpx21 (4 GB RAM) and the €13.50 dream became €18.

Then the TLS redirect loops started. Cloudflare terminates TLS at its edge, so traffic arrives at nginx-ingress over HTTP. But the Kubernetes Dashboard and ArgoCD both expected HTTPS and returned 307 redirects, creating infinite loops. The Dashboard fight consumed a full day. ArgoCD was worse—25 debug/fix commits in a single day, a stream of debug:, fix:, ci: prefixes escalating through the git log. The root cause was simple once found: ArgoCD needed the --insecure flag since TLS was already terminated upstream. This same class of bug would quietly haunt every new application for the next three months until a universal ssl-redirect: false annotation was standardized with a regression test in the builder.

Building the Platform (Nov 26 – Dec 2)

With the cluster stable, the real building started. Claude Code and I assembled the platform layer by layer:

  • Observability: Prometheus (kube-prometheus-stack), Grafana with Google OAuth, and Loki + Promtail for log aggregation. All wired together with pre-configured datasources.
  • Database: CloudNativePG deployed PostgreSQL 17 with automatic failover—1 primary + 1 replica, 20 Gi per instance, pod anti-affinity across nodes.
  • First apps: BigCartBuddy (a receipt scanning app) became the first real workload. A private Docker registry went up at registry.frodojo.com.
  • Cost optimization: PVC sizes were slashed (Loki 50Gi→10Gi, Prometheus 30Gi→10Gi). Every stateful workload got a nodeSelector: worker to pin it to static nodes so the cluster autoscaler could freely manage dynamic capacity without evicting Prometheus mid-scrape.
  • Automation: n8n for workflow automation, ARC runners for self-hosted GitHub Actions inside the cluster, and argocd-image-updater to auto-deploy new container images.

The application definition system evolved three times during this period. It started with CDK8s, moved to lightweight TypeScript classes, and settled on the current k8s-apps/ builder pattern that generates ArgoCD Application YAMLs from typed config. Each iteration was simpler than the last.

Applications Arrive (Dec 7 – Jan 13)

December 7 brought one of the more novel additions: a Discord bot powered by Claude's Agent SDK with full kubectl access to the cluster. It can query Prometheus, generate charts, and report on cluster health directly in Discord. The SDK integration took multiple iterations—Claude Code helped me try three different Anthropic SDK packages before landing on the right one.

December 26 was the big deployment day—three applications in a single session: Beyond Cloud (auth service + frontend at usebey.com), Tknscope (token analysis platform with its own CNPG database), and a comprehensive alerting system (14 PrometheusRules, Alertmanager→Discord, kubernetes-event-exporter). The cluster immediately ran out of capacity. A third static worker was added on December 27, and on December 31 the control plane was upgraded from cpx21 to cpx31 (8 GB RAM) after hitting 94% memory.

January brought ClockZen (time tracking), DJWriter (content generation), Claude Runner (AI integration), and Plane CE (project management)—each deployed through the same GitOps pipeline.

The $100+ Bill

Then came the Hetzner invoice: over $100. The cluster autoscaler had been spinning up cpx21 nodes in response to pending pods and never tearing them down fast enough. With no cost guardrails in place, autoscale workers silently accumulated and the bill ballooned from the expected €18 baseline to well past $100.

The fix was multi-layered. Autoscale node lifetime was capped so nodes are torn down after 6+ hours of inactivity. Stateful workloads were pinned to static workers with explicit nodeSelector rules so they never trigger autoscaling. And most importantly, cost alerting rules were added to Prometheus: alerts fire if unexpected nodes appear, if the autoscaler creates more than 10 nodes in 24 hours, if any autoscale node runs longer than 6 hours, or if total volume capacity exceeds 300 Gi. Critical cost alerts push to my phone via ntfy.sh. The bill came back down to the €18 baseline.

Security Hardening (Jan 31)

For the first two months, security was an afterthought. January 31 brought a dedicated hardening sprint: the Discord agent's ClusterRole lost secrets write and pods/exec permissions, the Dashboard was downgraded from cluster-admin to view-only, all frontends got securityContext (runAsNonRoot, drop ALL capabilities, seccomp RuntimeDefault), and NetworkPolicy resources were added to restrict database access to only the application namespaces that need it. TLS was enforced on every ingress.

What’s Running Now

The cluster runs Talos Linux v1.12.2 on Kubernetes v1.33.7, provisioned through the hcloud-k8s OpenTofu module (v3.21.1) in Hetzner's Ashburn region:

  • 1 control plane node — CPX31 (4 vCPU, 8 GB RAM). Single control plane, not HA. Rebuilding from state takes minutes.
  • 3 static worker nodes — CPX21 (4 vCPU, 8 GB RAM each). Labeled nodepool: worker and pinned to stateful workloads.
  • 0–5 autoscale workers — CPX21 nodes created on-demand. Labeled nodepool: worker-autoscale. Torn down after 6+ hours idle. Closely monitored after the $100 lesson.
  • Hetzner LB11 — Load balancer in front of nginx-ingress, managed by OpenTofu.
Platform Topology
Current architecture: Cloudflare proxies traffic to a Hetzner LB, which routes through nginx-ingress to the cluster. ArgoCD syncs from GitHub. OpenTofu provisions infrastructure with state in Cloudflare R2.

The Stack

Everything is deployed as Helm charts managed by ArgoCD Application resources, generated by a TypeScript build step using cdk8s. Source of truth: k8s-apps/.

Networking & Ingress

  • Cilium — CNI plugin (built into hcloud-k8s). Pod networking and Kubernetes NetworkPolicy enforcement.
  • nginx-ingress — Ingress controller behind the Hetzner LB. TLS terminated with cert-manager certificates.
  • external-dns v1.14.3 — Auto-creates Cloudflare DNS records from Ingress resources for *.frodojo.com, clockzen.com, bigcartbuddy.com, and usebey.com. Proxied through Cloudflare.
  • cert-manager — Let's Encrypt via DNS-01 against Cloudflare. Issuer: letsencrypt-cloudflare.
  • Cloudflare Access — Zero Trust protection for internal services (ArgoCD, Grafana, Dashboard) via Google OAuth.

Monitoring & Alerting

  • Prometheus (kube-prometheus-stack v65.2.0) — 7-day retention, 20 GB on hcloud-volumes. Custom PrometheusRules for nodes, pods, deployments, PVCs, certificates, Loki health, cost tracking, and ArgoCD sync status.
  • Grafana v7.0.0 — 10 Gi persistence. Dashboards for cluster overview, node metrics, pod detail, ARC runner stats, and Hetzner cost tracking.
  • Loki (loki-stack v2.10.2) — 30 Gi storage, 7-day retention, rate-limited to 10 MB/s ingestion after an early log flood incident.
  • Alertmanager — Critical alerts to Discord immediately, warnings aggregated hourly. Cost alerts push to phone via ntfy.sh.

Data

  • CloudNativePG — PostgreSQL 17.2 with 2 instances for failover. 20 Gi per instance, pod anti-affinity across nodes. Managed roles for ClockZen, Tknscope, and DJWriter.
  • hcloud-volumes — Hetzner's block storage CSI driver. ~120 Gi provisioned across the cluster.

CI/CD & GitOps

  • ArgoCD v7.7.11 — Root Application watches apps/dist/, auto-syncs with prune and self-heal. ApplicationSet auto-discovers repos in the GitHub org every 3 minutes. Google OAuth with admin for [email protected].
  • GitHub Actions — Infrastructure changes (infra/) trigger OpenTofu. App changes (k8s-apps/) generate and apply manifests. Kubeconfig pulled from R2 at runtime.
  • ARC Runners (v0.10.1) — 0–5 self-hosted GitHub Actions runners with Docker-in-Docker. Currently serving clockzen-next.
  • ArgoCD Image Updater — Auto-updates image tags from container registries.

Applications

  • ClockZen — Time tracking (API + frontend).
  • Tknscope — Token analysis (API + frontend + marketing).
  • DJWriter — Content generation.
  • n8n — Workflow automation.
  • Beyond Cloud — Auth service + frontend (usebey.com).
  • Claude Runner — AI integration service.
  • Discord K8s Agent — Claude-powered cluster management bot.
  • Plane CE — Project management (plane.so).

Each application deploys to its own namespace with default-deny NetworkPolicy. Selective rules open only the required paths.

How Deployment Works

Two paths, both triggered by pushing to GitHub:

  • Infrastructure changes (infra/) — GitHub Actions runs tofu plan + tofu apply against Hetzner. State in Cloudflare R2. Provisions nodes, LB, and bootstraps Talos + ArgoCD.
  • Application changes (k8s-apps/) — TypeScript build generates ArgoCD Application YAMLs into k8s/generated/. GitHub Actions applies them. ArgoCD auto-syncs with prune and self-heal.

Secrets live in GitHub Secrets (CI) and Kubernetes Secrets (runtime). The kubeconfig and talosconfig are uploaded to Cloudflare R2 during provisioning and downloaded by CI workflows at deploy time.

Deployment Flow
Infrastructure changes flow through OpenTofu to Hetzner. Application changes flow through TypeScript/cdk8s to ArgoCD.

What It Actually Costs

The original dream was €13.50/month. Reality hit fast—the nodes got bigger, the worker count grew, and one month the autoscaler ran away to a $100+ bill. After taming the autoscaler and adding cost alerts, the baseline settled at about €18/month:

Component Spec Monthly Cost
Control Plane 1× CPX31 (4 vCPU, 8 GB) €4.50
Static Workers 3× CPX21 (4 vCPU, 8 GB each) €9.00
Load Balancer Hetzner LB11 €4.50
Autoscale Workers 0–5× CPX21 (on-demand) €0–15.00
Cloudflare R2 state storage, DNS, Access Free tier
GitHub Actions CI/CD, ARC runners Free tier
Total (base) ≈ €18 / month

For comparison, an equivalent setup on AWS EKS (managed control plane + 3 comparable nodes + ALB + storage) runs ~$300+/month before egress, and GKE/AKS lands around $200+/month. Hetzner includes 20 TB of egress in the base price—on public clouds, that bandwidth alone costs more than the entire Hetzner bill.

The tradeoffs are real: single control plane (no HA), self-managed Talos upgrades, no IAM integration, and the operational burden is entirely mine. But for a personal platform running side projects, the economics are hard to argue with—especially when the entire thing was built in 83 days of pair-programming with an AI that never gets tired of writing Helm values.