Start a query Cluster on kubernetes
tip
Expected deployment time: 5 minutes ⏱
This tutorial covers how to install and configure Databend query cluster on kubernetes with minio storage backend.
Before you begin
- Make sure your cluster have enough resource for installation (at least 4 cpus, 4GB RAM, 50GB disk)
- Make sure you have a kubernetes cluster up and running, please take a look on k3d, minikube
- Databend Cluster mode only works on shared storage(AWS S3 or MinIO s3-like storage).
- This cluster mainly used for testing purpose, it is not targeted for production use.
Step 1. Deploy sample minio
caution
This configuration is for demonstration ONLY, never use it in production, please take a look at https://docs.min.io/docs/deploy-minio-on-kubernetes.html for more information on production TLS and High Availability configurations.
We will bootstrap a minio server on kubernetes, with the following configurations
STORAGE_TYPE=s3
S3_STORAGE_BUCKET=sample-storage
S3_STORAGE_REGION=us-east-1
S3_STORAGE_ENDPOINT_URL=http://minio.minio.svc.cluster.local:9000
S3_STORAGE_ACCESS_KEY_ID=minio
S3_STORAGE_SECRET_ACCESS_KEY=minio123
The following configuration shall be applied to the target kubernetes cluster, it would create a bucket named sample-storage
with 10Gi
storage space
kubectl create namespace minio --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/minio-sample.yaml -n minio
Step 2. Deploy standalone databend meta-service layer
The following configuration would configure a standalone databend meta-service on databend-system
namespace
kubectl create namespace databend-system --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/meta-standalone.yaml -n databend-system
Step 3. Deploy databend query cluster
The following configuration would configure a databend query cluster on tenant1
namespace
Each pod under the deployment have 900m
vCPU with 900Mi
memory
kubectl create namespace tenant1 --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/query-cluster.yaml -n tenant1
To scale up or down the query cluster, please use the following command
# scale query cluster number to 0
kubectl scale -n tenant1 deployment query --replicas=0
# scale query cluster number to 3
kubectl scale -n tenant1 deployment query --replicas=3
3.1 Check the Cluster Information
NOTICE: Please make sure that the localhost port 3308 is available.
nohup kubectl port-forward -n tenant1 svc/query-service 3308:3307 &
mysql -h127.0.0.1 -uroot -P3308
SELECT * FROM system.clusters
+----------------------+------------+------+
| name | host | port |
+----------------------+------------+------+
| dIUkzbOaqJEPudb0A7j4 | 172.17.0.6 | 9191 |
| NzfBm4KIQGEHe0sxAWa3 | 172.17.0.7 | 9191 |
| w3MuQR8aTHKHC1OLj5a6 | 172.17.0.5 | 9191 |
+----------------------+------------+------+
Step 4. Distributed query
EXPLAIN SELECT max(number), sum(number) FROM numbers_mt(10000000000) GROUP BY number % 3, number % 4, number % 5 LIMIT 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10 |
| RedistributeStage[expr: 0] |
| Projection: max(number):UInt64, sum(number):UInt64 |
| AggregatorFinal: groupBy=[[(number % 3), (number % 4), (number % 5)]], aggr=[[max(number), sum(number)]] |
| RedistributeStage[expr: sipHash(_group_by_key)] |
| AggregatorPartial: groupBy=[[(number % 3), (number % 4), (number % 5)]], aggr=[[max(number), sum(number)]] |
| Expression: (number % 3):UInt8, (number % 4):UInt8, (number % 5):UInt8, number:UInt64 (Before GroupBy) |
| ReadDataSource: scan schema: [number:UInt64], statistics: [read_rows: 10000000000, read_bytes: 80000000000, partitions_scanned: 1000001, partitions_total: 1000001], push_downs: [projections: [0]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The distributed query works, the cluster will efficiently transfer data through flight_api_address
.
Step 4.1. Upload the data to the cluster
CREATE TABLE t1(i INT, j INT);
INSERT INTO t1 SELECT number, number + 300 from numbers(10000000);
SELECT count(*) FROM t1;
+----------+
| count() |
+----------+
| 10000000 |
+----------+