Troubleshooting & FAQ
Start by reading the operator logs and the recommendation’s status.detail:
kubectl logs -n <namespace> deploy/<release>-stepscale-autoscaler --tail=200kubectl get scalerec <name> -n <namespace> -o jsonpath='{.status}{"\n"}'9.1 No recommendations appear
Section titled “9.1 No recommendations appear”A few normal causes, in order of likelihood:
- History is still warming up. Without Prometheus, the operator accumulates one sample
per tick from HPA status, so it can take many ticks (hours) before there is enough history
to analyze. Fix by pointing at Prometheus:
--set metrics.prometheusUrl=…. The logs showmetrics source: HPA-status fallbackwhen no Prometheus URL is set. - Prometheus returns no points. The log line
prometheus returned no points; check CPU requests/limits or PromQL templatesmeans the default query found nothing - usually because the workload has no CPU requests/limits (the default query’s denominator), or your Prometheus uses non-standard metric names. Set CPU requests on the workload, or override the queries withmetrics.promqlCpu/metrics.promqlReplicas/metrics.promqlQueue(using{ns}/{deploy}placeholders). - No optimization opportunity. If a workload is already well-tuned, the rule engine emits nothing and no recommendation is created. This is expected.
- Not watching that namespace. Check
watchNamespaces. Empty means all namespaces; a non-empty list restricts to exactly those. - A recommendation already exists. The operator never overwrites an existing
recommendation for a workload. Check
kubectl get scalerec -A.
9.2 LLM errors
Section titled “9.2 LLM errors”- The operator falls back to a deterministic rules-only analysis if the LLM is not
configured (
llm.provider=none). Recommendations still appear, with the summary “Rule-based recommendation (no LLM configured).” - If a configured provider returns an error (bad key, rate limit, network), the analysis pass
logs the error and that tick produces no LLM-judged recommendation; the loop continues and
retries next tick. Verify the key Secret exists and holds the key under
apiKey, and that egress to the provider is allowed. llm provider 'openai' requires an API keyat startup meansllm.provideris set but no key was supplied. Setllm.apiKeyorllm.existingSecret, or switch tollm.provider=none.
9.3 License problems
Section titled “9.3 License problems”Check the license line in the logs (… license=licensed (…) / … expired - grace … /
… unlicensed - …):
| Symptom in logs / status | Cause | Fix |
|---|---|---|
unlicensed - no public key configured | license.publicKey not set. | Set license.publicKey to the base64 key stepscale provided. Required to apply. |
unlicensed - license Secret unreadable: … | The license Secret is missing or the operator cannot read it. | Confirm the Secret exists in the expected namespace and the RBAC secrets get grant is intact. |
unlicensed - license signature does not verify | Wrong public key, or a tampered/mismatched payload/signature. | Re-install the correct payload + signature and the matching publicKey. |
expired - grace Nd left | Past expires_at, inside the grace window. | Renew by swapping in a new signed license (see Licensing §5.2). |
Approved recommendation stuck in blocked | Operator is unlicensed/past grace. | Install or renew the license; apply resumes automatically. |
9.4 RBAC errors
Section titled “9.4 RBAC errors”Log lines like apply failed: … is forbidden: User "system:serviceaccount:…" cannot patch resource … mean the service account lacks a grant - usually because the ClusterRole /
ClusterRoleBinding was not installed (e.g. a templated-manifests-only deploy without
cluster-admin). Re-run the Helm install with cluster-admin so the RBAC objects are created,
and confirm:
kubectl auth can-i patch horizontalpodautoscalers \ --as=system:serviceaccount:<namespace>:<release>-stepscale-autoscaler -n <namespace>9.5 A change was applied but immediately rolled back
Section titled “9.5 A change was applied but immediately rolled back”The verify pass judged the workload degraded after probation (for example CPU pushed past
the target plus safety.healthCpuMargin). The recommendation moves to rolledBack and
status.detail records why. This is the safety net working as designed. If you believe the
change is actually fine, you can widen safety.healthCpuMargin or lengthen
safety.probationWindowMinutes so a brief post-change spike is not misread.
9.6 A ScaledObject change did not apply / no rollback
Section titled “9.6 A ScaledObject change did not apply / no rollback”KEDA ScaledObject targets are marked verified at apply time and are not health-verified
or auto-rolled-back yet (reading live ScaledObject config back is a follow-up). If the patch
itself did not take effect, confirm KEDA is installed, the ScaledObject exists in the
target namespace, and the operator has the keda.sh/scaledobjects grant (it does by default).
9.7 Predictive schedule keeps retracting
Section titled “9.7 Predictive schedule keeps retracting”If a forecasted peak does not actually materialize, the operator retracts the schedule after
several consecutive idle samples inside the window and restores the baseline floor
(status.detail: "schedule retracted: forecasted peak not materializing"). This is expected
for workloads whose pattern has changed. Forecasting also requires enough history
(forecasting.minHistoryDays) and a genuinely periodic signal
(forecasting.periodicityThreshold); flat or random workloads stay reactive by design.
9.8 FAQ
Section titled “9.8 FAQ”Does the operator change anything without approval? No. Nothing is applied until
spec.approved: true. Auto-rollback only ever undoes a change the operator itself applied.
Does it work air-gapped? Yes - with llm.provider=none and an offline license, there is
no outbound traffic at all.
Can I run it purely as an advisor? Yes - run analysis-only (no license) or simply never approve. See Operating §7.4.
Will two replicas double-apply? No - leader election ensures only the leader mutates.