Skip to main content

Domain: infra

Curated entry point for the substrate that every other domain (including platform) runs on top of: AWS Terraform, the homelab Kubernetes cluster, Kafka / Vault / Aurora / Redis / MinIO cluster administration, the GitLab self-hosted server + project administration, DNS, certificates, networking. Cross-functional: owned across the whole engineering org rather than by any one business team.

See the sibling platform/ domain for the engineer-facing capabilities (shared TypeScript packages, on-call agent, alert/SLO/runbook content, CI scripts, dev tooling) that consume this substrate.

Owner

Founder + agents (Phase 1). Future-team-shape: a small infrastructure / SRE function once the company grows beyond 3-4 engineers; that team owns the substrate that the platform team's capabilities ride on.

Mission

Owns the substrate every workload runs on, in production and non-production:

  • AWS production substrate (Terraform-managed): EKS cluster, Aurora Serverless v2 Multi-AZ Postgres, S3 buckets + Object Lock + KMS, IAM roles + OIDC federation to GitLab, Route 53 DNS, ACM certificates, networking (VPC, subnets, NAT, security groups, VPC endpoints), CloudFront, CloudTrail.
  • Homelab non-production substrate: bare-metal Proxmox + Ceph host, the on-prem K8s cluster (control plane + nodes), Traefik ingress, cert-manager, Cloudflare Tunnel, persistent volume provisioning (Ceph CSI + MinIO).
  • Cluster-admin of shared services running on the above substrates: Kafka brokers (Strimzi-managed), Vault server (3-node Raft cluster, KMS auto-unseal), Redis (Sentinel / cluster), MinIO, Centrifugo deployment + config.
  • Observability-stack substrate (vs. observability content, which is platform): kube-prometheus-stack Helm release, Loki, Tempo, Alertmanager — Helm values, retention policies, scrape configs, storage backends.
  • GitLab self-hosted: project administration of the astrix/ namespace (group hierarchy, project settings, CI variables, OIDC trust to AWS) on the founder's pre-existing self-hosted instance at gitlab.txap.co. Server administration of the GitLab instance itself (backups, upgrades, runner pool, host OS) is owned by the founder's homelab and predates the Astrix bootstrap; we audit-and-reuse rather than re-provision.
  • Cross-cutting cluster resources: namespaces, RBAC, network policies, NetworkPolicies, PodSecurityStandards.
  • Disaster recovery: backup procedures, snapshot policies, region-failover playbooks.

Boundary: infra owns the substrate. It does not own the engineer-facing TypeScript packages (@astrix/observability, etc.), the on-call agent code, or the alert/SLO/runbook content — those belong to platform. It also does not own application workloads — those belong to the relevant business domain.

Rule of thumb: if it ships as a Terraform module, Helm chart values for a substrate service, raw K8s manifest, or a cluster admin task (DNS record, certificate rotation, IAM policy, GitLab project setting), it's infra. If it ships as a TypeScript package or a YAML rule file under infra/observability/{prometheus/rules,slos}/<x>.yaml, it's platform.

Code paths

  • AWS Terraform: infra/terraform/{aws-tfstate-bootstrap,networking,eks,aurora,s3,iam,kms,route53,acm,cloudfront,cloudtrail}/
  • GitLab Terraform: infra/terraform/gitlab/astrix/ namespace administration (groups, projects, labels, runners, OIDC) on the existing gitlab.txap.co instance. See docs/standards/gitlab-bootstrap.md.
  • Vault Terraform: infra/terraform/vault-config/ — KV v2 mount, base policies, Phase-1b JWT auth. See docs/standards/vault-install-homelab.md for the manual Vault server install playbook.
  • Helm releases for substrate services: infra/helm/observability/, infra/helm/platform/{vault,kafka,redis,minio,centrifugo,docusaurus}/
  • Cluster-scoped K8s manifests: infra/k8s/{namespaces,rbac,network-policies,cert-manager,traefik,cloudflare-tunnel,vault}/
  • GitLab self-hosted operational scripts: infra/gitlab-server/ (Phase 0 deliverable; will hold backup scripts, upgrade runbooks, runner pool definitions)
  • DR / runbook scripts: infra/dr/ (snapshot rotation, restore drills)

Phase-0 substrate bootstrap order

The substrate is bootstrapped in this order; each step is the prerequisite for the next:

  1. DNS records for any homelab service we'll provision. Phase 0 needs vault.dev.astrixtrading.com → 68.203.212.75 (homelab public IP) on Route 53. See docs/standards/dns-naming-convention.md.
  2. Vault server (homelab K8s) — manual playbook; Helm install + init + Shamir unseal + KV v2 enable. Reachable at vault.dev.astrixtrading.com. See docs/standards/vault-install-homelab.md.
  3. infra/terraform/vault-config/ — base Vault policies, KV path skeletons. Local state at Phase 0.
  4. infra/terraform/gitlab/astrix/ namespace administration; writes bot tokens to Vault. Local state at Phase 0.
  5. infra/terraform/aws-tfstate-bootstrap/ — S3 state bucket + KMS key. Local state (the bootstrapper for the bootstrappers).
  6. Migrate vault-config + GitLab modules to S3 state via terraform init -migrate-state.
  7. AWS landing zone Terraform modules (networking, EKS, Aurora, etc.) — apply with S3 state from day 1.

Naming convention

All hostnames follow docs/standards/dns-naming-convention.md: production at *.astrixtrading.com, homelab non-prod at *.dev.astrixtrading.com, future managed staging at *.staging.astrixtrading.com. gitlab.txap.co is the single grandfathered exception.

(Note: infra/observability/{prometheus/rules,slos}/<domain>.{yaml,slo.yaml} files contain alert / SLO content and are owned by platform. The infra/helm/observability/ Helm release that runs Prometheus / Grafana / Loki / Tempo / Alertmanager is infra.)

PRD chapters that touch this domain

TDD chapters

(Empty — will populate. Expected: Phase 0 GitLab project bootstrap TDD, AWS landing-zone Terraform TDD, EKS cluster TDD, Aurora cluster TDD, Vault server TDD, Kafka cluster TDD, observability-stack-Helm TDD, homelab→prod parity TDD, DR TDD.)

ADRs that affected this domain

(All future substrate ADRs will land here. Examples to expect: GitLab project hierarchy, AWS account topology, EKS node-group strategy, Aurora capacity planning, Kafka cluster sizing, Vault Raft topology, DR + region-failover strategy.)

Service interfaces this domain exposes

Infra exposes substrate, not application service interfaces. The "interfaces" are:

  • AWS endpoints (Aurora connection string, S3 bucket ARNs, EKS API URL) — consumed via @astrix/clients/aws/* typed clients (which themselves live in platform).
  • K8s API — consumed by Helm charts and the platform's CI scripts.
  • Vault HTTP API — consumed via @astrix/clients/vault (a platform package).
  • Kafka broker addresses — consumed via @astrix/clients/kafka.
  • GitLab API — consumed by scripts/create-issue.ts and other CI tooling.

Breaking changes to substrate (e.g., Aurora connection-string format change, K8s API version bump that breaks a manifest) require an ADR and a coordinated rollout plan with platform and the affected business domains.

Events

Infra itself does not produce or consume Kafka domain events.

External integrations

This is the domain that talks to all the substrate vendors:

  • AWS — EKS, Aurora, S3, KMS, IAM, Route 53, ACM, CloudFront, CloudTrail (Q-I series)
  • HashiCorp Vault — server administration; KMS auto-unseal, Transit, PKI, Secrets engines (Q-E16)
  • Strimzi — Kafka operator on K8s (Q-E2)
  • Cloudflare — DNS, Access (auth proxy for non-prod), CDN, Tunnel
  • GitLab self-hosted — server administration + project administration (Q-E1)
  • Sealed Secrets / Vault Secrets Operator — secret distribution into K8s
  • PagerDuty — webhook/integration infrastructure (the secret rotation and Vault integration); the on-call agent that consumes PagerDuty pages is platform
  • Cert-manager + Let's Encrypt / ACM — certificate issuance

Homelab edge architecture

internet (port 80/443)

homelab public IP 68.203.212.75

HAProxy on the edge router (host-based routing by Host header / SNI)

k3s Traefik LoadBalancer (10.0.1.119/120/121/122 :80/:443)

Traefik Ingress / IngressRoute (per-namespace routing inside the cluster)

backend Service / Pod

HAProxy is the public-facing entrypoint. Adding a new public hostname to a homelab service requires:

  1. DNS record on Route 53 / Cloudflare (e.g., *.dev.astrixtrading.com → 68.203.212.75) — see docs/standards/dns-naming-convention.md.
  2. HAProxy ACL or default-backend rule on the edge router routing the new hostname to the k3s Traefik LB IP.
  3. K8s Ingress / IngressRoute inside the cluster.

Skipping HAProxy gives a 503 Service Unavailable from <html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html> — that body is HAProxy's default 503, distinguishable from Traefik's 503 (which has different markup). Diagnostic shorthand: if a public-IP curl returns 503 but a direct curl to 10.0.1.119 works, HAProxy is the missing layer.

Runbooks for this domain

(Empty — will populate. Expected high-priority Phase 1b runbooks: aurora-failover, kafka-broker-down, vault-sealed, eks-control-plane-degraded, gitlab-server-down, homelab-cluster-down, dns-failure, certificate-expiring, s3-object-lock-replication-lag.)

On-call

Infra on-call is the most operationally sensitive — substrate failures cascade to every other domain. Routing per Section 17.6.1: substrate Sev 0 / Sev 1 page directly via PagerDuty (the infra rotation, distinct from the platform rotation, even though Phase 1 has the same human on both); Sev 2-4 route to #alerts-warnings / #alerts-monitoring.

The L2 read-only diagnostic agent has a separate read-only IAM role / Vault policy for substrate diagnosis (per ADR-0004 RBAC bullet).

Cross-domain dependencies

  • This domain calls: nothing in the application layer.
  • Every other domain depends on this layer's substrate, either directly (the platform domain's clients wrap substrate APIs) or transitively (every business domain calls @astrix/clients/*).