Operating
Day-2 guidance for running the operator in production.
7.1 Monitoring
Section titled “7.1 Monitoring”The operator surfaces its state in three places.
Logs. Structured logs, controlled by RUST_LOG (default
info,stepscale_operator=debug). Each reconcile tick logs the license state, whether this
replica is the leader, and the metrics source in use:
kubectl logs -n <namespace> deploy/<release>-stepscale-autoscaler -fUseful log lines to watch for:
| Log message (substring) | Meaning |
|---|---|
operator starting | Startup; reports provider, namespaces, Prometheus, forecasting, licensed. |
metrics source: Prometheus history backfill | Prometheus is wired correctly. |
metrics source: HPA-status fallback | No Prometheus URL - history rebuilds slowly. |
reconcile tick | A tick ran; includes license=… and leader=…. |
created ScalingRecommendation | A new recommendation was emitted. |
applied recommendation / auto-reverted degraded recommendation | Apply / rollback occurred. |
not leader; skipping mutating pass | This replica is a follower (expected with replicaCount > 1). |
Recommendation status. The authoritative state of any change is the CR’s
status.phase and status.detail (see Usage §6.5):
kubectl get scalerec -Akubectl get scalerec <name> -n <namespace> -o jsonpath='{.status}{"\n"}'Kubernetes resources. Inspect the live workload to confirm an applied change:
kubectl get hpa <name> -n <namespace> \ -o custom-columns=MIN:.spec.minReplicas,MAX:.spec.maxReplicas,TARGET:.spec.metrics[0].resource.target.averageUtilization7.2 High availability
Section titled “7.2 High availability”Run two or more replicas and keep leader election enabled (the default):
helm upgrade <release> oci://ghcr.io/stepscale/charts/stepscale-autoscaler \ --version <version> --namespace <namespace> --reuse-values \ --set replicaCount=2- Leader election uses a
coordination.k8s.ioLease (leaderElection.leaseName, defaultstepscale-autoscaler-leader). Only the leader runs the mutating passes (apply, verify, schedule), so multiple replicas never double-apply. - The lease duration is 3×
intervalSeconds. Each mutating pass is re-gated on a fresh leadership check and bounded by a time budget, so a replica that loses the lease stops mutating promptly during a failover. - Followers still run read-only analysis, so failover is fast - a standby is already warm.
7.3 Upgrades
Section titled “7.3 Upgrades”Upgrades follow the same verify-then-install flow as the initial install:
-
Verify the new image signature (see Installation §3.1).
-
Mirror the new image and chart if you are air-gapped (§3.3).
-
Upgrade in place:
Terminal window helm upgrade <release> oci://ghcr.io/stepscale/charts/stepscale-autoscaler \--version <new-version> --namespace <namespace> --reuse-values
The CRD ships with the chart. Existing ScalingRecommendation resources and their approval
state are preserved across upgrades. Pull access to new images is tied to an active
subscription, which is the renewal lever - see Licensing.
7.4 Run modes
Section titled “7.4 Run modes”| Mode | How to run it (current behavior) |
|---|---|
| Analysis-only (advisor) | Install without a license (or without license.publicKey). The operator watches and emits recommendations but never applies; approved recommendations are marked blocked. Equivalently, simply never approve. |
| Apply | Provide a valid license and license.publicKey, then approve recommendations. The operator applies and verifies them. |
| Rules-only (no LLM) | Set llm.provider=none. Analysis uses the deterministic rule engine; no external calls are made. Combine with either mode above. |
7.5 Tuning the cadence and safety margins
Section titled “7.5 Tuning the cadence and safety margins”intervalSecondstrades responsiveness against API-server load. The default300s is appropriate for most clusters; lower it only if you need faster turnaround and your control plane has headroom.safety.probationWindowMinutesshould be long enough to capture a representative traffic sample for the workload. For spiky daily traffic, keep it at or above the default.safety.healthCpuMarginwidens or tightens what counts as “degraded.” Raise it to tolerate more post-change CPU headroom before rolling back; lower it for stricter reverts.