vhpintel · vhpintel · Oct 30, 2024
diff --git a/InferenceasAService/Kubernetes/README.md b/InferenceasAService/Kubernetes/README.md
@@ -0,0 +1,66 @@
+# Kubernetes Deployments with gaudi devices
+## Prerequisites
+- **Kubernetes Cluster**: Access to a Kubernetes v1.29 cluster
+ - **CSI Driver**: The K8s cluster must have the CSI driver installed, using the [local-path-provisioner](https://github.com/rancher/local-path-provisioner) with `local_path_provisioner_claim_root` set to `/mnt`. 
+ - **Operating System**: Ubuntu 22.04
+ - **Gaudi Software Stack**: Verify that your setup uses a valid software stack for Gaudi accelerators, see [Gaudi support matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html). Note that running LLM on a CPU is possible but will significantly reduce performance.
+ - **Gaudi Firmware**:Make sure Firmware is installed on Gaudi nodes. Follow the [Gaudi Firmware Installation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#driver-fw-install-bare) guide for detailed instructions.
+ - **K8s Plugin for Gaudi**: Install the K8s plugin by following the instructions in [How to install K8s Plugin for Gaudi](https://docs.habana.ai/en/latest/Orchestration/Gaudi_Kubernetes/Device_Plugin_for_Kubernetes.html).
+ - **Hugging Face Model Access**: Ensure you have the necessary access to download and use the chosen Hugging Face model. For example, such access is mandatory when using the [Mixtral-8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1).
+ - **Helm CLIs installed**
+------------
+
+### Deploying the Intel Gaudi base operator on K8S
+
+Install the Operator on a cluster by deploying a Helm chart:
+
+#### Create the Operator namespace
+```
+kubectl create namespace habana-ai-operator
+kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
+kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
+kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
+```
+
+#### Install Helm chart
+```
+helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
+helm repo update
+helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.18.0-524 -n habana-ai-operator
+```
+------------
+### Kubernetes Deployments steps for each models 
+Below steps has the kubernetes deployments which are used for Inference as a service on habana Gaudi . Following are kubectl commands examples for TGI models inference
+Make sure to update the HuggingFace token in the yaml files before applying them - HF_TOKEN: "<your-hf-token>"
+
+To delpoy Llama3.1-8B on 1 card
+```
+kubectl apply -f chatqna-tgi-llama.yml
+```
+To delpoy Llama3.1-70B 8 cards
+```
+kubectl apply -f chatqna-tgi-llama70b.yml
+```
+To deploy text-embeddings-inference
+```
+kubectl apply -f chatqna-tei.yml
+kubectl apply -f chatqna-teirerank.yml
+```
+
+------------
+
+## Verify pods and Services
+
+To verify the installation, 
+run the command `kubectl get pods -A` to make sure all pods are running.
+run the command `kubectl get svc -A` to validate service specific configurations for all the models deployed above
+
+run the below curl command modifying the IP and port respectively to validate the model response
+```
+curl -k http://<Cluster-IP>:<service-port>/ -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' -H 'Content-Type: application/json'
+```
+------------
+## License
+The license to use TGI on Habana Gaudi is the one of TGI: https://github.com/huggingface/text-generation-inference/blob/main/LICENSE
+
+Please reach out to api-enterprise@huggingface.co if you have any question.
diff --git a/InferenceasAService/Kubernetes/chatqna-tei.yml b/InferenceasAService/Kubernetes/chatqna-tei.yml
@@ -0,0 +1,167 @@
+---
+# Source: chatqna/charts/tei/templates/configmap.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: chatqna-tei-config
+  labels:
+    helm.sh/chart: tei-1.0.0
+    app.kubernetes.io/name: tei
+    app.kubernetes.io/instance: chatqna
+    app.kubernetes.io/version: "cpu-1.5"
+    app.kubernetes.io/managed-by: Helm
+data:
+  MODEL_ID: "BAAI/bge-base-en-v1.5"
+  PORT: "2081"
+  http_proxy: ""
+  https_proxy: ""
+  no_proxy: ""
+  NUMBA_CACHE_DIR: "/tmp"
+  TRANSFORMERS_CACHE: "/tmp/transformers_cache"
+  HF_HOME: "/tmp/.cache/huggingface"
+  MAX_WARMUP_SEQUENCE_LENGTH: "512"
+---
+# Source: chatqna/charts/tei/templates/service.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: chatqna-tei
+  labels:
+    helm.sh/chart: tei-1.0.0
+    app.kubernetes.io/name: tei
+    app.kubernetes.io/instance: chatqna
+    app.kubernetes.io/version: "cpu-1.5"
+    app.kubernetes.io/managed-by: Helm
+spec:
+  type: ClusterIP
+  ports:
+    - port: 80
+      targetPort: 2081
+      protocol: TCP
+      name: tei
+  selector:
+    app.kubernetes.io/name: tei
+    app.kubernetes.io/instance: chatqna
+---
+# Source: chatqna/charts/tei/templates/deployment.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: chatqna-tei
+  labels:
+    helm.sh/chart: tei-1.0.0
+    app.kubernetes.io/name: tei
+    app.kubernetes.io/instance: chatqna
+    app.kubernetes.io/version: "cpu-1.5"
+    app.kubernetes.io/managed-by: Helm
+spec:
+  # use explicit replica counts only of HorizontalPodAutoscaler is disabled
+  replicas: 1
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: tei
+      app.kubernetes.io/instance: chatqna
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: tei
+        app.kubernetes.io/instance: chatqna
+    spec:
+      securityContext:
+        {}
+      containers:
+        - name: tei
+          envFrom:
+            - configMapRef:
+                name: chatqna-tei-config
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+              - ALL
+            readOnlyRootFilesystem: false
+            runAsNonRoot: true
+            runAsUser: 1000
+            seccompProfile:
+              type: RuntimeDefault
+          image: "ghcr.io/huggingface/tei-gaudi:latest"
+          imagePullPolicy: IfNotPresent
+          args:
+            - "--auto-truncate"
+          volumeMounts:
+            - mountPath: /data
+              name: model-volume
+            - mountPath: /dev/shm
+              name: shm
+            - mountPath: /tmp
+              name: tmp
+          ports:
+            - name: http
+              containerPort: 2081
+              protocol: TCP
+          livenessProbe:
+            failureThreshold: 24
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 5
+            periodSeconds: 5
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 5
+            periodSeconds: 5
+          startupProbe:
+            failureThreshold: 120
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 5
+            periodSeconds: 5
+          resources:
+            limits:
+              habana.ai/gaudi: 1
+      volumes:
+        - name: model-volume
+          emptyDir: {}
+        - name: shm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 1Gi
+        - name: tmp
+          emptyDir: {}
+---
+# Source: chatqna/charts/tei/templates/horizontalPodAutoscaler.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+---
+# Source: chatqna/charts/tei/templates/servicemonitor.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+---
+# Source: chatqna/charts/tgi/templates/horizontalPorAutoscaler.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+---
+# Source: chatqna/charts/tgi/templates/servicemonitor.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+#
+# Dashboard for the exposed TGI metrics:
+# - https://grafana.com/grafana/dashboards/19831-text-generation-inference-dashboard/
+# Metric descriptions:
+# - https://github.com/huggingface/text-generation-inference/discussions/1127#discussioncomment-7240527
+---
+# Source: chatqna/templates/customMetrics.yaml
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0