Skip to content

HIVE-29492: Add AutoScaling to K8s operator#6507

Open
ayushtkn wants to merge 20 commits into
apache:masterfrom
ayushtkn:K8sautoscaling
Open

HIVE-29492: Add AutoScaling to K8s operator#6507
ayushtkn wants to merge 20 commits into
apache:masterfrom
ayushtkn:K8sautoscaling

Conversation

@ayushtkn

@ayushtkn ayushtkn commented May 26, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Add auto scaling to Hive Operator

Why are the changes needed?

Better usage & cloud saving.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

Installed Dependencies (ZK, Postgres & Ozone)

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install zookeeper bitnami/zookeeper \
  --set replicaCount=1 --set auth.enabled=false \
  --set image.repository=bitnamilegacy/zookeeper \
  --set image.tag=3.9.3-debian-12-r21 \
  --set global.security.allowInsecureImages=true --wait


helm install postgres bitnami/postgresql \
  --set auth.username=hive --set auth.password=hive123 \
  --set auth.database=metastore --wait


kubectl create secret generic hive-db-secret --from-literal=password=hive123


helm repo add ozone https://apache.github.io/ozone-helm-charts/
helm install ozone ozone/ozone --version 0.2.0 --wait
sleep 50
kubectl exec statefulset/ozone-om -- ozone sh volume create /s3v
kubectl exec statefulset/ozone-om -- ozone sh bucket create /s3v/hive

Started Hive Operator With AutoScaling Enabled (Very Low Thresholds for Testing)

helm install hive ./helm/hive-operator \
  --set cluster.database.type=postgres \
  --set cluster.database.url="jdbc:postgresql://postgres-postgresql:5432/metastore" \
  --set cluster.database.driver="org.postgresql.Driver" \
  --set cluster.database.username=hive \
  --set cluster.database.passwordSecretRef.name=hive-db-secret \
  --set cluster.database.passwordSecretRef.key=password \
  --set cluster.database.driverJarUrl="https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.5/postgresql-42.7.5.jar" \
  --set cluster.zookeeper.quorum="zookeeper:2181" \
  --set cluster.storage.coreSiteOverrides."fs\.defaultFS"="s3a://hive" \
  --set cluster.storage.coreSiteOverrides."fs\.s3a\.endpoint"="http://ozone-s3g-rest:9878" \
  --set-string cluster.storage.coreSiteOverrides."fs\.s3a\.path\.style\.access"=true \
  --set 'cluster.storage.envVars[0].name=HADOOP_OPTIONAL_TOOLS' \
  --set 'cluster.storage.envVars[0].value=hadoop-aws' \
  --set 'cluster.storage.envVars[1].name=AWS_ACCESS_KEY_ID' \
  --set 'cluster.storage.envVars[1].value=ozone' \
  --set 'cluster.storage.envVars[2].name=AWS_SECRET_ACCESS_KEY' \
  --set 'cluster.storage.envVars[2].value=ozone' \
  --set cluster.hiveServer2.autoscaling.enabled=true \
  --set cluster.metastore.autoscaling.enabled=true \
  --set cluster.llap.autoscaling.enabled=true \
  --set cluster.tezAm.autoscaling.enabled=true \
  --set-string cluster.llap.configOverrides."hive\.llap\.daemon\.task\.scheduler\.wait\.queue\.size"="1" \
  --set cluster.hiveServer2.autoscaling.scaleUpThreshold=1 \
  --set cluster.metastore.autoscaling.scaleUpThreshold=2

Launched Beeline

kubectl exec -it deployment/hive-hiveserver2 -- beeline -u "jdbc:hive2://hive-hiveserver2:10001/;transportMode=http;httpPath=cliservice"

OUTPUTS:

Initial Start -> Only 1 HMS, 1 HS2 (1 == Min Configured)

image

Hits First Beeline Session -> Tez AM, LLAP Daemons starts (Min 1 configured)

image

AutoScaling HS2 to 2 & Tez AM(Reduced max threshold)

image

Tez AM
image

HS2
image

Auto Scaling HMS & LLAP to 2

image

HMS
image

LLAP (Load reduced by the time, query finished :-( )
image

Scale Downs (After Cooling Periods)

Scheduled
image

Done (After waiting for cool down period for specific service)
image

CPU tracking

HS2
image
HMS
image

@ayushtkn ayushtkn changed the title WIP: Add AutoScaling to K8s operator HIVE-29492: Add AutoScaling to K8s operator May 29, 2026
@aturoczy

Copy link
Copy Markdown

In my mind, this PR is OK. It just needs detailed documentation explaining the logic and the expected behavior.

@tanishq-chugh tanishq-chugh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it out, and everything worked as expected (Attaching my run SS below)
Amazing feature, and was easy to deploy & configure 🚀

At start:
Image

TezAM & LLAP start-up when required:
Image

Auto Scale-up:
Image

Scale-Down post cooling periods:
Image

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds operator-driven autoscaling (and optional auto-suspend/hibernation) to the Hive Kubernetes operator by scraping JMX Exporter metrics directly from pods, refactoring dependent resources/workflow to support autoscaling-safe lifecycle hooks, and surfacing autoscaling state in the HiveCluster status/CRD.

Changes:

  • Introduces autoscaling control loop (background metrics scraping + per-component scaling strategies + stabilization windows) and optional auto-suspend.
  • Refactors operator dependents/workflow into unified ConfigMap/Service/PDB dependents and a programmatic WorkflowSpec.
  • Updates Helm charts, CRD schema/status printer columns, and Kubernetes operator documentation to reflect autoscaling/HTTP-transport defaults.

Reviewed changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestVectorMapJoinOuterGenerateResultOperator.java Removes an internal test comment.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/util/HiveConfigBuilder.java Sets transport-mode defaults and enables JMX metrics when autoscaling is enabled.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/util/ConfigUtils.java Adds shared component constants, port/config defaults, and boolean parsing helper.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/reconciler/HiveWorkflowSpec.java New programmatic workflow definition replacing annotation-based workflow/conditions.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/status/ComponentStatus.java Extends component status to include autoscaling and replica bounds/current.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/status/AutoscalingStatus.java New status payload for autoscaling decisions/metrics.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/TezAmSpec.java Adds autoscaling spec to TezAM configuration with defaults.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/MetastoreSpec.java Adds autoscaling spec to Metastore configuration with defaults.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/LlapSpec.java Adds autoscaling spec to LLAP configuration with defaults.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/HiveServer2Spec.java Adds autoscaling spec to HS2 configuration with defaults.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/AutoSuspendSpec.java New auto-suspend configuration record.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/spec/AutoscalingSpec.java New autoscaling configuration record (thresholds, windows, scrape interval, ports).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/HiveClusterStatus.java Adds phase/idle/suspend fields and CRD printer columns support.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/model/HiveClusterSpec.java Adds autoSuspend + suspend fields and initializes component defaults.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/HiveOperatorMain.java Injects the new programmatic workflow spec at operator startup.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/TezAmStatefulSetDependent.java Adds autoscaling-aware replica resolution and refactors naming/configmap refs.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/TezAmServiceDependent.java Removed (replaced by unified service dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/ScratchPvcDependent.java Adds secondary resource name override for disambiguation.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/SchemaInitJobDependent.java Adds secondary resource name override and consistent component naming.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/MetastoreServiceDependent.java Removed (replaced by unified service dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/MetastoreDeploymentDependent.java Adds autoscaling lifecycle/JMX exporter integration and autoscaling replica resolution.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/MetastoreConfigMapDependent.java Removed (replaced by unified configmap dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/LlapStatefulSetDependent.java Adds autoscaling lifecycle/JMX exporter integration and autoscaling replica resolution.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/LlapServiceDependent.java Removed (replaced by unified service dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/LlapConfigMapDependent.java Removed (replaced by unified configmap dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveServiceDependent.java New unified Service dependent with per-component subclasses.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveServer2ServiceDependent.java Removed (replaced by unified service dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveServer2DeploymentDependent.java Adds autoscaling lifecycle/JMX exporter integration, HTTP port probing, and autoscaling replica resolution.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveServer2ConfigMapDependent.java Removed (replaced by unified configmap dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HivePdbDependent.java New unified PDB dependent for autoscaling-safe disruption control.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveDependentResource.java Adds autoscaling helper logic (replica resolution, drain scripts, JMX exporter wiring, user volumes helper).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveConfigMapDependent.java New unified ConfigMap dependent with per-component subclasses.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HadoopConfigMapDependent.java Removed (replaced by unified configmap dependent).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/TezAmEnabledCondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/SchemaJobCompletedCondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/MetastoreReadyCondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/MetastoreEnabledCondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/LlapEnabledCondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/condition/HiveServer2Precondition.java Removed (replaced by inline lambda conditions in workflow).
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/TezAmScalingStrategy.java New TezAM demand-based scaling strategy.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/StabilizationWindow.java New HPA-like stabilization window helper.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/ScalingStrategy.java New scaling strategy interface.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/PrometheusTextParser.java New parser for Prometheus text exposition.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/PodMetrics.java New per-pod metrics record.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/MetricsScraper.java New concurrent pod scraper using pod IPs + HTTP client.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/MetricsCache.java New cache for scraped metrics with staleness handling.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/MetastoreScalingStrategy.java New Metastore scaling strategy based on API request rate.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/LlapScalingStrategy.java New LLAP scaling strategy using busy slots + HS2 activation gate.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/HiveServer2ScalingStrategy.java New HS2 scaling strategy using open sessions.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/HiveClusterAutoscaler.java New autoscaling orchestrator coordinating all components and status reporting.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/ComponentAutoscaler.java New per-component autoscaler combining strategy + stabilization + CPU signal.
packaging/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/BackgroundMetricsScraper.java New background scheduler for periodic scraping into cache.
packaging/src/kubernetes/README.md Adds extensive autoscaling/auto-suspend documentation and updates HS2 connection instructions to HTTP/10001.
packaging/src/kubernetes/pom.xml Updates logging deps (SLF4J2 binding) and docker build args.
packaging/src/kubernetes/helm/hive-operator/values.yaml Adds autoscaling + autoSuspend Helm values with defaults.
packaging/src/kubernetes/helm/hive-operator/templates/hivecluster.yaml Renders autoscaling/autoSuspend sections into HiveCluster CR.
packaging/src/kubernetes/helm/hive-operator/templates/clusterrole.yaml Grants additional RBAC for autoscaling (pods patch, PDBs, scale subresources).
packaging/src/kubernetes/helm/hive-operator/crds/hiveclusters.hive.apache.org-v1.yml Extends CRD schema/status (autoscaling, suspend/autoSuspend, printer columns).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packaging/src/kubernetes/pom.xml
Comment thread packaging/src/kubernetes/README.md

When LLAP and TezAM are configured with `minReplicas: 0` (the default), they start
with zero pods on fresh install. The operator automatically scales them up when HS2
reports open sessions, and scales them back to zero when HS2 is idle.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s interesting—LLAP can be spun up on demand just like Tez tasks on YARN. I’m curious about the speed of spinning up LLAP on Kubernetes concurrently. For example, if I need to start 100 LLAP instances at the same time to run tasks, will the concurrent startup take a long time?

If LLAP on K8s can start up very quickly, then we might explore using LLAP on K8s for certain batch processing tasks that require many LLAP instances to run concurrently during specific time windows. If that’s feasible, perhaps Tez tasks on K8s wouldn’t be as important—LLAP on K8s might be sufficient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @zhangbutao that was something my end goal is. To achieve something on those line. Will definitely lead to some more followup :-)
To answer your question about spinning up 100 instances concurrently:

  1. Concurrent Startup Speed
    From a Kubernetes scheduling perspective, it is extremely fast. However, because LLAP and TezAM are deployed as StatefulSets, Kubernetes defaults to spinning them up sequentially (waiting for pod-0 to be Ready before starting pod-1), which would be too slow for this use case.
    To fix this, I just pushed a commit (9545034) adding podManagementPolicy: "Parallel" to the StatefulSets. Now, if the autoscaler requests 100 LLAP instances, K8s will schedule and boot all 100 simultaneously.

The only real bottlenecks are JVM/ZooKeeper bootstrap times and pulling the massive Hive Docker image onto those new nodes.

  1. Why this makes LLAP vastly superior to pure Tez container mode
    Your thought about pure Tez tasks becoming less important on K8s is spot on, and image pull times are exactly why.
    Both LLAP and Tez container modes use the same heavy Hive image, but they pay the startup penalty differently:

Pure Tez Tasks: You pay the K8s pod scheduling, image pull, and JVM boot penalty on every single query. A quick 2-second query might take 45+ seconds just waiting for the K8s task pod to spin up.

LLAP on K8s: You pay that heavy infrastructure penalty exactly once when the autoscaler spins them up for your batch window. Once those 100 LLAP daemons are warm, the next 500 queries routed to them execute in milliseconds because the K8s lifecycle is completely out of the critical path.

So for that 2:00 AM batch processing window, the influx of HS2 sessions will dynamically spin up the required TezAMs, which will in turn cause the LLAP autoscaler to concurrently spin up the 100 LLAP executors. Once the window ends, the stabilization timers will gracefully scale them all back down to zero!

@zhangbutao

Copy link
Copy Markdown
Contributor

@ayushtkn Great! Thanks your effort for the feature!!!

If I remember correctly, I know there is the Hive LLAP Workload Management feature (https://issues.apache.org/jira/browse/HIVE-17481) for resource isolation. However, this resource isolation feature makes it very difficult to achieve absolute resource isolation among multiple tenants within a single LLAP cluster. LLAP cannot simply and effectively limit resource usage for certain users or tasks using queues like Tez on YARN.

But with LLAP on Kubernetes, I think LLAP can achieve better resource isolation. Specifically, user1 can start their own LLAP compute group, for example llap-cluster1 (comprising, say, llap0, llap1, llap2), and user2 can start their own LLAP compute group, llap-cluster2 (llap3, llap4, llap5). Both LLAP compute groups could be recognized by the same HS2/Tez AM. Users might only need to pass the LLAP compute group name in Beeline, for example:

beeline -u "jdbc:hive2://hive-hiveserver2:10001/;transportMode=http;httpPath=cliservice;llap=llap-cluster1" user1

This way, user1 and user2 can use their own LLAP instances for queries, and their query workloads will not interfere with each other at all. In this Kubernetes-based LLAP deployment model, multi-tenant resource isolation is effectively achieved — simpler and more effective than the Workload Management approach.

Haha, this is just a multi-tenant idea I came up with after seeing the elastic deployment capabilities of LLAP on Kubernetes. I think it’s worth exploring further. Thanks!

@ayushtkn

Copy link
Copy Markdown
Member Author

Thanx @zhangbutao for the great insights!!!

You hit the nail on the head regarding the shift from "YARN-thinking" to "Kubernetes-native thinking."

  1. Physical vs. Logical Isolation
    You are completely right about Workload Management (WLM). Trying to carve up a single JVM's heap and CPU cycles among competing tenants is incredibly complex and never gives you 100% true isolation. By shifting to Kubernetes, we get true physical isolation via namespaces, cgroups, and dedicated pod resources.

  2. How this could work technically
    What you are describing is entirely feasible. The LLAP instances register themselves in ZooKeeper under a specific app name (defaulting to @llap0). If we update the Operator to support an array of LLAP profiles (e.g., llap-cluster1, llap-cluster2), the Operator would spin up multiple independent StatefulSets, each registering to a different ZK path.

Then, exactly as you said, a user simply sets hive.llap.daemon.service.hosts=@llap-cluster1 in their JDBC string or session. TezAM would look up that specific ZK path, find those specific pods, and route the fragments exclusively to that tenant's dedicated executors.

  1. The Autoscaling Synergy
    The best part is how it ties into the autoscaling logic in this PR! Because each tenant's LLAP cluster would be its own independent K8s StatefulSet, the autoscaler would scale llap-cluster1 and llap-cluster2 completely independently. If user1 isn't running queries, their dedicated LLAP cluster scales to zero, costing nothing, while user2 can comfortably stay scaled up to 100 pods.

This is a fantastic concept for multi-tenancy. Since the core autoscaling loop and K8s operator primitives are established in this PR, building out "Multi-Tenant LLAP Compute Groups" on top of it feels like a perfect follow-up Jira ticket. I think it is definitely worth exploring! I will definitely give it a shot :-)

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants