Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions InferenceasAService/Kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Kubernetes Deployments with gaudi devices
## Prerequisites
- **Kubernetes Cluster**: Access to a Kubernetes v1.29 cluster
- **CSI Driver**: The K8s cluster must have the CSI driver installed, using the [local-path-provisioner](https://github.com/rancher/local-path-provisioner) with `local_path_provisioner_claim_root` set to `/mnt`.
- **Operating System**: Ubuntu 22.04
- **Gaudi Software Stack**: Verify that your setup uses a valid software stack for Gaudi accelerators, see [Gaudi support matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html). Note that running LLM on a CPU is possible but will significantly reduce performance.
- **Gaudi Firmware**:Make sure Firmware is installed on Gaudi nodes. Follow the [Gaudi Firmware Installation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#driver-fw-install-bare) guide for detailed instructions.
- **K8s Plugin for Gaudi**: Install the K8s plugin by following the instructions in [How to install K8s Plugin for Gaudi](https://docs.habana.ai/en/latest/Orchestration/Gaudi_Kubernetes/Device_Plugin_for_Kubernetes.html).
- **Hugging Face Model Access**: Ensure you have the necessary access to download and use the chosen Hugging Face model. For example, such access is mandatory when using the [Mixtral-8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1).
- **Helm CLIs installed**
------------

### Deploying the Intel Gaudi base operator on K8S

Install the Operator on a cluster by deploying a Helm chart:

#### Create the Operator namespace
```
kubectl create namespace habana-ai-operator
kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
```

#### Install Helm chart
```
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
helm repo update
helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.18.0-524 -n habana-ai-operator
```
------------
### Kubernetes Deployments steps for each models
Below steps has the kubernetes deployments which are used for Inference as a service on habana Gaudi . Following are kubectl commands examples for TGI models inference
Make sure to update the HuggingFace token in the yaml files before applying them - HF_TOKEN: "<your-hf-token>"

To delpoy Llama3.1-8B on 1 card
```
kubectl apply -f chatqna-tgi-llama.yml
```
To delpoy Llama3.1-70B 8 cards
```
kubectl apply -f chatqna-tgi-llama70b.yml
```
To deploy text-embeddings-inference
```
kubectl apply -f chatqna-tei.yml
kubectl apply -f chatqna-teirerank.yml
```

------------

## Verify pods and Services

To verify the installation,
run the command `kubectl get pods -A` to make sure all pods are running.
run the command `kubectl get svc -A` to validate service specific configurations for all the models deployed above

run the below curl command modifying the IP and port respectively to validate the model response
```
curl -k http://<Cluster-IP>:<service-port>/ -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' -H 'Content-Type: application/json'
```
------------
## License
The license to use TGI on Habana Gaudi is the one of TGI: https://github.com/huggingface/text-generation-inference/blob/main/LICENSE

Please reach out to api-enterprise@huggingface.co if you have any question.
167 changes: 167 additions & 0 deletions InferenceasAService/Kubernetes/chatqna-tei.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
# Source: chatqna/charts/tei/templates/configmap.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: v1
kind: ConfigMap
metadata:
name: chatqna-tei-config
labels:
helm.sh/chart: tei-1.0.0
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
app.kubernetes.io/version: "cpu-1.5"
app.kubernetes.io/managed-by: Helm
data:
MODEL_ID: "BAAI/bge-base-en-v1.5"
PORT: "2081"
http_proxy: ""
https_proxy: ""
no_proxy: ""
NUMBA_CACHE_DIR: "/tmp"
TRANSFORMERS_CACHE: "/tmp/transformers_cache"
HF_HOME: "/tmp/.cache/huggingface"
MAX_WARMUP_SEQUENCE_LENGTH: "512"
---
# Source: chatqna/charts/tei/templates/service.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: v1
kind: Service
metadata:
name: chatqna-tei
labels:
helm.sh/chart: tei-1.0.0
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
app.kubernetes.io/version: "cpu-1.5"
app.kubernetes.io/managed-by: Helm
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 2081
protocol: TCP
name: tei
selector:
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
---
# Source: chatqna/charts/tei/templates/deployment.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: apps/v1
kind: Deployment
metadata:
name: chatqna-tei
labels:
helm.sh/chart: tei-1.0.0
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
app.kubernetes.io/version: "cpu-1.5"
app.kubernetes.io/managed-by: Helm
spec:
# use explicit replica counts only of HorizontalPodAutoscaler is disabled
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
template:
metadata:
labels:
app.kubernetes.io/name: tei
app.kubernetes.io/instance: chatqna
spec:
securityContext:
{}
containers:
- name: tei
envFrom:
- configMapRef:
name: chatqna-tei-config
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
image: "ghcr.io/huggingface/tei-gaudi:latest"
imagePullPolicy: IfNotPresent
args:
- "--auto-truncate"
volumeMounts:
- mountPath: /data
name: model-volume
- mountPath: /dev/shm
name: shm
- mountPath: /tmp
name: tmp
ports:
- name: http
containerPort: 2081
protocol: TCP
livenessProbe:
failureThreshold: 24
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
failureThreshold: 120
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 5
resources:
limits:
habana.ai/gaudi: 1
volumes:
- name: model-volume
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
- name: tmp
emptyDir: {}
---
# Source: chatqna/charts/tei/templates/horizontalPodAutoscaler.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
---
# Source: chatqna/charts/tei/templates/servicemonitor.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
---
# Source: chatqna/charts/tgi/templates/horizontalPorAutoscaler.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
---
# Source: chatqna/charts/tgi/templates/servicemonitor.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#
# Dashboard for the exposed TGI metrics:
# - https://grafana.com/grafana/dashboards/19831-text-generation-inference-dashboard/
# Metric descriptions:
# - https://github.com/huggingface/text-generation-inference/discussions/1127#discussioncomment-7240527
---
# Source: chatqna/templates/customMetrics.yaml
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
Loading