Domain: infra
Curated entry point for the substrate that every other domain (including
platform) runs on top of: AWS Terraform, the homelab Kubernetes cluster, Kafka / Vault / Aurora / Redis / MinIO cluster administration, the GitLab self-hosted server + project administration, DNS, certificates, networking. Cross-functional: owned across the whole engineering org rather than by any one business team.See the sibling
platform/domain for the engineer-facing capabilities (shared TypeScript packages, on-call agent, alert/SLO/runbook content, CI scripts, dev tooling) that consume this substrate.
Owner
Founder + agents (Phase 1). Future-team-shape: a small infrastructure / SRE function once the company grows beyond 3-4 engineers; that team owns the substrate that the platform team's capabilities ride on.
Mission
Owns the substrate every workload runs on, in production and non-production:
- AWS production substrate (Terraform-managed): EKS cluster, Aurora Serverless v2 Multi-AZ Postgres, S3 buckets + Object Lock + KMS, IAM roles + OIDC federation to GitLab, Route 53 DNS, ACM certificates, networking (VPC, subnets, NAT, security groups, VPC endpoints), CloudFront, CloudTrail.
- Homelab non-production substrate: bare-metal Proxmox + Ceph host, the on-prem K8s cluster (control plane + nodes), Traefik ingress, cert-manager, Cloudflare Tunnel, persistent volume provisioning (Ceph CSI + MinIO).
- Cluster-admin of shared services running on the above substrates: Kafka brokers (Strimzi-managed), Vault server (3-node Raft cluster, KMS auto-unseal), Redis (Sentinel / cluster), MinIO, Centrifugo deployment + config.
- Observability-stack substrate (vs. observability content, which is
platform):kube-prometheus-stackHelm release, Loki, Tempo, Alertmanager — Helm values, retention policies, scrape configs, storage backends. - GitLab self-hosted: project administration of the
astrix/namespace (group hierarchy, project settings, CI variables, OIDC trust to AWS) on the founder's pre-existing self-hosted instance atgitlab.txap.co. Server administration of the GitLab instance itself (backups, upgrades, runner pool, host OS) is owned by the founder's homelab and predates the Astrix bootstrap; we audit-and-reuse rather than re-provision. - Cross-cutting cluster resources: namespaces, RBAC, network policies, NetworkPolicies, PodSecurityStandards.
- Disaster recovery: backup procedures, snapshot policies, region-failover playbooks.
Boundary: infra owns the substrate. It does not own the engineer-facing TypeScript packages (@astrix/observability, etc.), the on-call agent code, or the alert/SLO/runbook content — those belong to platform. It also does not own application workloads — those belong to the relevant business domain.
Rule of thumb: if it ships as a Terraform module, Helm chart values for a substrate service, raw K8s manifest, or a cluster admin task (DNS record, certificate rotation, IAM policy, GitLab project setting), it's infra. If it ships as a TypeScript package or a YAML rule file under infra/observability/{prometheus/rules,slos}/<x>.yaml, it's platform.
Code paths
- AWS Terraform:
infra/terraform/{aws-tfstate-bootstrap,networking,eks,aurora,s3,iam,kms,route53,acm,cloudfront,cloudtrail}/ - GitLab Terraform:
infra/terraform/gitlab/—astrix/namespace administration (groups, projects, labels, runners, OIDC) on the existinggitlab.txap.coinstance. Seedocs/standards/gitlab-bootstrap.md. - Vault Terraform:
infra/terraform/vault-config/— KV v2 mount, base policies, Phase-1b JWT auth. Seedocs/standards/vault-install-homelab.mdfor the manual Vault server install playbook. - Helm releases for substrate services:
infra/helm/observability/,infra/helm/platform/{vault,kafka,redis,minio,centrifugo,docusaurus}/ - Cluster-scoped K8s manifests:
infra/k8s/{namespaces,rbac,network-policies,cert-manager,traefik,cloudflare-tunnel,vault}/ - GitLab self-hosted operational scripts:
infra/gitlab-server/(Phase 0 deliverable; will hold backup scripts, upgrade runbooks, runner pool definitions) - DR / runbook scripts:
infra/dr/(snapshot rotation, restore drills)
Phase-0 substrate bootstrap order
The substrate is bootstrapped in this order; each step is the prerequisite for the next:
- DNS records for any homelab service we'll provision. Phase 0 needs
vault.dev.astrixtrading.com → 68.203.212.75(homelab public IP) on Route 53. Seedocs/standards/dns-naming-convention.md. - Vault server (homelab K8s) — manual playbook; Helm install + init + Shamir unseal + KV v2 enable. Reachable at
vault.dev.astrixtrading.com. Seedocs/standards/vault-install-homelab.md. infra/terraform/vault-config/— base Vault policies, KV path skeletons. Local state at Phase 0.infra/terraform/gitlab/—astrix/namespace administration; writes bot tokens to Vault. Local state at Phase 0.infra/terraform/aws-tfstate-bootstrap/— S3 state bucket + KMS key. Local state (the bootstrapper for the bootstrappers).- Migrate vault-config + GitLab modules to S3 state via
terraform init -migrate-state. - AWS landing zone Terraform modules (networking, EKS, Aurora, etc.) — apply with S3 state from day 1.
Naming convention
All hostnames follow docs/standards/dns-naming-convention.md: production at *.astrixtrading.com, homelab non-prod at *.dev.astrixtrading.com, future managed staging at *.staging.astrixtrading.com. gitlab.txap.co is the single grandfathered exception.
(Note: infra/observability/{prometheus/rules,slos}/<domain>.{yaml,slo.yaml} files contain alert / SLO content and are owned by platform. The infra/helm/observability/ Helm release that runs Prometheus / Grafana / Loki / Tempo / Alertmanager is infra.)
PRD chapters that touch this domain
12-non-functional.md— availability, security, DR targets that the substrate must meet13-compliance-legal.md— regulatory posture, S3 Object Lock, KMS key management14-roadmap-phases.md— Phase 0 substrate bootstrap is mostly this domain16-open-questions.md— every Tier 1 substrate decision (Q-E1, Q-E2, Q-E9, Q-E10, Q-E16, Q-I-series) lives here17-engineering-standards.md— Section 17.5 observability stack and Section 17.6 deploy silencing both have a substrate dimension
TDD chapters
(Empty — will populate. Expected: Phase 0 GitLab project bootstrap TDD, AWS landing-zone Terraform TDD, EKS cluster TDD, Aurora cluster TDD, Vault server TDD, Kafka cluster TDD, observability-stack-Helm TDD, homelab→prod parity TDD, DR TDD.)
ADRs that affected this domain
- ADR-0001: Modular monolith for the application layer with extracted long-runners — implies the substrate must support 6 deploy units
- ADR-0003: Observability stack architecture — direct app instrumentation drives substrate sizing for Prometheus / Loki / Tempo
(All future substrate ADRs will land here. Examples to expect: GitLab project hierarchy, AWS account topology, EKS node-group strategy, Aurora capacity planning, Kafka cluster sizing, Vault Raft topology, DR + region-failover strategy.)
Service interfaces this domain exposes
Infra exposes substrate, not application service interfaces. The "interfaces" are:
- AWS endpoints (Aurora connection string, S3 bucket ARNs, EKS API URL) — consumed via
@astrix/clients/aws/*typed clients (which themselves live inplatform). - K8s API — consumed by Helm charts and the platform's CI scripts.
- Vault HTTP API — consumed via
@astrix/clients/vault(aplatformpackage). - Kafka broker addresses — consumed via
@astrix/clients/kafka. - GitLab API — consumed by
scripts/create-issue.tsand other CI tooling.
Breaking changes to substrate (e.g., Aurora connection-string format change, K8s API version bump that breaks a manifest) require an ADR and a coordinated rollout plan with platform and the affected business domains.
Events
Infra itself does not produce or consume Kafka domain events.
External integrations
This is the domain that talks to all the substrate vendors:
- AWS — EKS, Aurora, S3, KMS, IAM, Route 53, ACM, CloudFront, CloudTrail (Q-I series)
- HashiCorp Vault — server administration; KMS auto-unseal, Transit, PKI, Secrets engines (Q-E16)
- Strimzi — Kafka operator on K8s (Q-E2)
- Cloudflare — DNS, Access (auth proxy for non-prod), CDN, Tunnel
- GitLab self-hosted — server administration + project administration (Q-E1)
- Sealed Secrets / Vault Secrets Operator — secret distribution into K8s
- PagerDuty — webhook/integration infrastructure (the secret rotation and Vault integration); the on-call agent that consumes PagerDuty pages is
platform - Cert-manager + Let's Encrypt / ACM — certificate issuance
Homelab edge architecture
internet (port 80/443)
↓
homelab public IP 68.203.212.75
↓
HAProxy on the edge router (host-based routing by Host header / SNI)
↓
k3s Traefik LoadBalancer (10.0.1.119/120/121/122 :80/:443)
↓
Traefik Ingress / IngressRoute (per-namespace routing inside the cluster)
↓
backend Service / Pod
HAProxy is the public-facing entrypoint. Adding a new public hostname to a homelab service requires:
- DNS record on Route 53 / Cloudflare (e.g.,
*.dev.astrixtrading.com → 68.203.212.75) — seedocs/standards/dns-naming-convention.md. - HAProxy ACL or default-backend rule on the edge router routing the new hostname to the k3s Traefik LB IP.
- K8s Ingress / IngressRoute inside the cluster.
Skipping HAProxy gives a 503 Service Unavailable from <html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html> — that body is HAProxy's default 503, distinguishable from Traefik's 503 (which has different markup). Diagnostic shorthand: if a public-IP curl returns 503 but a direct curl to 10.0.1.119 works, HAProxy is the missing layer.
Runbooks for this domain
(Empty — will populate. Expected high-priority Phase 1b runbooks: aurora-failover, kafka-broker-down, vault-sealed, eks-control-plane-degraded, gitlab-server-down, homelab-cluster-down, dns-failure, certificate-expiring, s3-object-lock-replication-lag.)
On-call
Infra on-call is the most operationally sensitive — substrate failures cascade to every other domain. Routing per Section 17.6.1: substrate Sev 0 / Sev 1 page directly via PagerDuty (the infra rotation, distinct from the platform rotation, even though Phase 1 has the same human on both); Sev 2-4 route to #alerts-warnings / #alerts-monitoring.
The L2 read-only diagnostic agent has a separate read-only IAM role / Vault policy for substrate diagnosis (per ADR-0004 RBAC bullet).
Cross-domain dependencies
- This domain calls: nothing in the application layer.
- Every other domain depends on this layer's substrate, either directly (the
platformdomain's clients wrap substrate APIs) or transitively (every business domain calls@astrix/clients/*).