-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.json
More file actions
1 lines (1 loc) · 185 KB
/
index.json
File metadata and controls
1 lines (1 loc) · 185 KB
1
[{"content":" A blog about distributed systems, container orchestration platforms, and AI infrastructure ","date":null,"permalink":"https://datastrophic.io/","section":"","summary":"","title":""},{"content":" A blog about distributed systems, container orchestration platforms, and AI infrastructure Archive #","date":null,"permalink":"https://datastrophic.io/posts/","section":"","summary":"","title":""},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/emulation/","section":"Tags","summary":"","title":"Emulation"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kemu/","section":"Tags","summary":"","title":"Kemu"},{"content":"Optimizing scheduling efficiency for AI workloads requires extensive experimentation and observation due to constrained supply and high costs of datacenter GPUs. Maximizing infrastructure efficiency through optimized scheduling and high-density bin packing is critical for achieving high workload throughput, resource utilization, and cost efficiency. Introducing scheduler modifications at production scale is risky: configuration errors can render workloads unschedulable, causing multi-day delays for dozens or hundreds of ML engineers, idling capacity, wasted on-demand resources, and delayed delivery. It is imperative that scheduler modifications with a big blast radius are tested and verified in a safe environment before shipping to production.\nThis problem is not new, and virtual/emulated clusters for scheduling experimentation and workload right-sizing have been around for a long time. Emulated clusters provide a limited real cluster experience for the fraction of the price and compute resources required to run them. To understand what an effective emulation solution needs, let\u0026rsquo;s first establish the requirements based on a typical production GPU cluster setup.\nRequirements #Let\u0026rsquo;s consider the following cluster setup to provide background for the functionality of the emulated cluster:\nA Kubernetes cluster with 1,000+ GPU nodes of different types; The nodes are spread across multiple topology domains (availability zones, racks); Specialized scheduling and training operators are running on the cluster; Observability is provided via the Prometheus stack. Based on the provided example, we can derive the following high-level requirements for the emulated cluster:\nLarge cluster emulation using few resources (e.g., it can be run on a laptop or as a part of a CI job); Ability to execute workloads and schedule pods on emulated nodes; Configurability of cluster shape and size: number of nodes, instance types, node resources, and placement; Ability to install, configure, and run unemulated applications such as Kubernetes operators and observability stack; Reproducibility of emulated clusters; Automation for cluster lifecycle management. There exist well-known technologies and techniques that address multiple (but not all) requirements from the list. The next section provides a brief overview.\nPrior art #The most frequently used solution for the given set of requirements is a combination of a Kind cluster with emulated KWOK nodes. In such setups, Kind cluster is used for running the Kubernetes control plane components and, potentially, a data plane while KWOK is managing emulated data plane and pod lifecycle. Each technology alone doesn\u0026rsquo;t solve the problem in full: Kind provides a fully conformant Kubernetes control plane and data plane in Docker but doesn\u0026rsquo;t scale well, whereas KWOK scales to thousands of nodes but doesn\u0026rsquo;t easily allow running real workloads (such as schedulers and operators) due to the fully emulated data plane.\nThere exist other technologies in the emulation domain which are less applicable to this problem but are worth mentioning:\nSimKube: a record-and-replay simulation environment for Kubernetes based on KWOK; Minikube: similar to Kind but runs Kubernetes control plane in a VM instead of Docker; Virtual Kubelet: an open-source Kubernetes kubelet implementation that masquerades as a kubelet. Can be used for implementing a highly customizable emulation of Kubelet behavior; kubemark: a performance testing tool which allows users to run experiments on simulated clusters. Once the control plane and worker nodes are ready, required dependencies/addons can be installed using Helm Charts, kustomize, or raw Kubernetes manifests using helm or kubectl CLIs to provide desired component set for functionality under test and observability.\nThe main problem with this approach is in configurability and automation of the cluster bootstrap. Every component used in the target setup is versioned and might require custom configuration (for example, Helm values file). Kind and KWOK might require custom configuration themselves too. The diagram below demonstrates an example of such a cluster bootstrap process: With a small number of components and infrequently changing control plane setup, the whole process could be automated with Makefile, helper scripts, and pre-defined configuration files. However, when it comes to the emulated data plane configuration and management, the automation becomes quite involved and might require excessive templating. A task of creating a thousand emulated nodes with varying capacity and different placement options depending on the desired cluster shape, and in a reproducible manner, can quickly become overwhelming using scripting and templating.\nFor example, creating a 1,000-node cluster with 3 instance types across 3 availability zones using KWOK manifests requires:\n1,000 individual node YAML files or multiple application of predefined templates(reference); Custom scripts to manage naming, labeling, and placement; Manual coordination of Kind, KWOK, and Helm installations; Significant effort to modify cluster shape or rerun experiments. Automating this with Makefiles and shell scripts is fragile and difficult to maintain as cluster requirements evolve. Making this process repeatable, and clusters easily reconfigurable requires a programmatic solution driven by declarative configuration. KEMU (Kubernetes Emulator Utility) is a tool that was designed to eliminate this complexity entirely: one declarative specification replaces multiple node manifests and templates, coordination scripts, and fragmented configuration files.\nIntroducing KEMU - A Declarative Kubernetes Emulator Utility #KEMU provides a single-spec declarative approach for configuring control plane nodes, installing cluster addons, and defining emulated cluster nodes with various capacity and placement options.\nKEMU builds on Kind for control plane and worker node deployment, which is used for running auxiliary software required for experimentation. Examples include Prometheus Operator for observability, custom schedulers (Volcano, YuniKorn), workload management operators (Kueue, KubeRay), etc. Running these components requires actual Kubelet(s) to be available for scheduling, and Kind provides sufficient functionality for this.\nThe Kubelet emulation is based on KWOK, and KEMU provides a lightweight configuration scheme for defining node groups with various properties, and generates specified number of nodes automatically. Creating a cluster with KEMU is as easy as invoking the utility and providing it with a path or a URL to the cluster configuration file:\ngo install github.com/datastrophic/kemu kemu create-cluster --kubeconfig $(pwd)/kemu.config --cluster-config https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/gcp-small.yaml The cluster is accessible with standard Kubernetes tools and SDKs using the kubeconfig file:\nexport KUBECONFIG=$(pwd)/kemu.config kubectl get nodes # Example output: # NAME STATUS ROLES AGE VERSION # a2-ultragpu-8g-use1-0 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use1-1 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use1-2 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use1-3 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use1-4 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use2-0 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use2-1 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use2-2 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use2-3 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use2-4 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use3-0 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use3-1 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use3-2 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use3-3 Ready agent 99s v1.33.1 # a2-ultragpu-8g-use3-4 Ready agent 98s v1.33.1 # a3-highgpu-8g-use1-0 Ready agent 98s v1.33.1 # a3-highgpu-8g-use1-1 Ready agent 98s v1.33.1 # a3-highgpu-8g-use1-2 Ready agent 98s v1.33.1 # a3-highgpu-8g-use1-3 Ready agent 98s v1.33.1 # a3-highgpu-8g-use1-4 Ready agent 97s v1.33.1 # a3-highgpu-8g-use2-0 Ready agent 97s v1.33.1 # a3-highgpu-8g-use2-1 Ready agent 97s v1.33.1 # a3-highgpu-8g-use2-2 Ready agent 97s v1.33.1 # a3-highgpu-8g-use2-3 Ready agent 97s v1.33.1 # a3-highgpu-8g-use2-4 Ready agent 96s v1.33.1 # a3-ultragpu-8g-use1-0 Ready agent 96s v1.33.1 # a3-ultragpu-8g-use1-1 Ready agent 96s v1.33.1 # a3-ultragpu-8g-use1-2 Ready agent 96s v1.33.1 # a3-ultragpu-8g-use1-3 Ready agent 96s v1.33.1 # a3-ultragpu-8g-use1-4 Ready agent 95s v1.33.1 # kwok-control-plane Ready control-plane 7m58s v1.33.1 Cluster Specification #The following example demonstrates key components of the KEMU cluster specification. The specification contains 3 main sections:\nkindConfig - a YAML configuration file used for creating Kind Cluster. This is a standard Kind Configuration which is passed to the Kind cluster provisioner without any modifications. clusterAddons define a list of Helm Charts to be installed as a part of cluster bootstrap process. Each cluster addon can be parameterized with valuesObject containing Helm Chart values for the installation. nodeGroups define emulated node groups sharing similar properties (instance type, capacity) and the placement of the nodes. Node placement allows configuring number of nodes in different availability zones. Example specification:\napiVersion: kemu.datastrophic.io/v1alpha1 kind: ClusterConfig spec: kindConfig: | kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane - role: worker - role: worker clusterAddons: - name: prometheus repoName: prometheus-community repoURL: https://prometheus-community.github.io/helm-charts namespace: monitoring chart: prometheus-community/kube-prometheus-stack version: 75.16.1 valuesObject: | alertmanager: enabled: false nodeGroups: - name: a2-ultragpu-8g placement: - availabilityZone: use1 replicas: 50 - availabilityZone: use2 replicas: 50 - availabilityZone: use3 replicas: 50 nodeTemplate: metadata: labels: datastrophic.io/gpu-type: nvidia-a100-80gb # Custom label for GPU targeting. capacity: cpu: 96 memory: 1360Gi ephemeralStorage: 3Ti nvidia.com/gpu: 8 Scheduling and monitoring of workloads #A typical scheduling experimentation workflow consists of a cluster bootstrap, scheduling and monitoring of the workloads, iteration on the cluster or workload configuration, and tearing down the cluster once experiments are complete.\nTo demonstrate how KEMU helps with the experimentation, we will create a 2,500 node cluster with A100, H100, and H200 Nvidia GPUs distributed across 3 availability zones (cluster specification is available in kemu/examples/gcp-large.yaml in GitHub). Once the cluster is up and running, we will create a number of jobs to saturate the cluster, observe the allocation, and modify scheduling constraints to understand the impact they have on scheduling efficiency.\nTo create a cluster and explore it using Prometheus stack, follow these steps:\n# Create KEMU cluster: kemu create-cluster --name gcp-large --kubeconfig $(pwd)/kemu.config --cluster-config https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/gcp-large.yaml # Retrieve Grafana password: kubectl --kubeconfig $(pwd)/kemu.config get secret prometheus-grafana -n monitoring -o jsonpath=\u0026#39;{.data.admin-password}\u0026#39; | base64 -d # Expose Grafana port on the localhost: kubectl --kubeconfig $(pwd)/kemu.config port-forward --namespace monitoring svc/prometheus-grafana 8080:80 After the cluster bootstrap is complete and port forwarding is enabled, a Grafana instance will become available at http://localhost:8080/ and will require authentication with username admin and the retrieved password. Navigate to the pre-installed cluster overview dashboard for cluster-level information about cluster composition, available and allocated capacity, and their distribution across availability zones and device types:\nOptimal scheduling example #Let\u0026rsquo;s create some synthetic Jobs that require different GPU types and observe the cluster saturation. The number of jobs, their parallelism, and resource requirements are designed to demonstrate 100% GPU allocation (2,550 pods × 8 GPUs = 20,400 GPUs allocated out of 20,400 available).\nBelow is an example job manifest for a distributed A100 job with 100 workers. Note the annotation pod-complete.stage.kwok.x-k8s.io/delay: \u0026quot;10m\u0026quot; that instructs KWOK for how long it should maintain pods in running state before marking them completed. Note, there\u0026rsquo;s no scheduling constraints specified for this job beyond node affinity and tolerations to land its pods on KWOK nodes with a particular GPU type.\napiVersion: batch/v1 kind: Job metadata: generateName: job-a100-no-az-affinity- spec: completions: 100 parallelism: 100 template: metadata: annotations: pod-complete.stage.kwok.x-k8s.io/delay: \u0026#34;10m\u0026#34; spec: restartPolicy: Never affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: datastrophic.io/gpu-type operator: In values: [\u0026#34;nvidia-a100-80gb\u0026#34;] - key: type operator: In values: [\u0026#34;kwok\u0026#34;] tolerations: - effect: \u0026#34;NoSchedule\u0026#34; key: \u0026#34;kwok.x-k8s.io/node\u0026#34; value: \u0026#34;fake\u0026#34; containers: - name: fake-container image: fake-image resources: requests: cpu: 180 memory: 1100Gi nvidia.com/gpu: 8 limits: cpu: 180 memory: 1100Gi nvidia.com/gpu: 8 For convenience, we will be using workload specs from kemu/examples/workloads and will create 15 100-worker A100, 15 50-worker H100, and 12 25-worker H200 jobs.\n# Note: Pods will run for 10 minutes before completing (configured via KWOK annotation). for i in {1..15}; do kubectl --kubeconfig $(pwd)/kemu.config create -f https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-a100-no-affinity-100-workers.yaml done for i in {1..15}; do kubectl --kubeconfig $(pwd)/kemu.config create -f https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-h100-no-affinity-50-workers.yaml done for i in {1..12}; do kubectl --kubeconfig $(pwd)/kemu.config create -f https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-h200-no-affinity-25-workers.yaml done Once the jobs are created, navigate to Grafana dashboard to observe cluster saturation and workload status: As we can see, all jobs ran in parallel, resulting in 2,550 running pods distributed evenly across 3 availability zones. No pods were stuck in pending state, and overall cluster GPU allocation achieved 100%. This level of cluster allocation is rarely achieved in production environments, where scheduling constraints like availability zone affinity, anti-affinity rules, and resource fragmentation typically reduce utilization. The following section demonstrates the impact of adding just one common constraint.\nSuboptimal scheduling example #Availability zone affinity is a common requirement for distributed batch jobs for achieving better network performance and avoiding cross-AZ traffic costs overhead. We will apply the following podAffinity rule to jobs from the previous example and repeat the experiment:\naffinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: \u0026#34;datastrophic.io/workload\u0026#34; operator: In values: - \u0026#34;JOB_ID\u0026#34; In order to configure pod affinity for a specific job, it is required to specify unique label and label selector in job template prior to applying manifests. We will use existing templates and substitute JOB_ID placeholder. For convenience, workload templates are provided in kemu/examples/workloads.\n# Note: Pods will run for 10 minutes before completing (configured via KWOK annotation). for i in {1..15}; do curl -s https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-a100-az-affinity-100-workers.tmpl | \\ sed \u0026#34;s/JOB_ID/a100-job-with-affinity-${i}/g\u0026#34; | \\ kubectl --kubeconfig $(pwd)/kemu.config create -f - done for i in {1..15}; do curl -s https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-h100-az-affinity-50-workers.tmpl | \\ sed \u0026#34;s/JOB_ID/h100-job-with-affinity-${i}/g\u0026#34; | \\ kubectl --kubeconfig $(pwd)/kemu.config create -f - done for i in {1..12}; do curl -s https://raw.githubusercontent.com/datastrophic/kemu/refs/tags/v0.1.0/examples/workloads/job-h200-az-affinity-25-workers.tmpl | \\ sed \u0026#34;s/JOB_ID/h200-job-with-affinity-${i}/g\u0026#34; | \\ kubectl --kubeconfig $(pwd)/kemu.config create -f - done Once the jobs are created, navigate to Grafana dashboard to observe cluster saturation and workload status: From the dashboard, we can see that while all jobs are in the Running state, there are pods in the Pending state while unallocated capacity remains available.\nWhat we\u0026rsquo;re observing is partial admission - a common scheduling problem in Kubernetes clusters. The Kubernetes Scheduler processes pods individually without job-level awareness. Pods created by jobs are not scheduled in batch but rather one-by-one resulting in fragmentation and inability to schedule all pods even if there is cluster capacity available.\nFor distributed training workloads that require all workers to start simultaneously (all-or-nothing semantics), this partial admission creates a deadlock situation. This simple experiment reveals a scheduling issue that would be expensive to discover in production. With KEMU, you can:\nTest different node distributions across availability zones; Experiment with gang scheduling solutions (Volcano, YuniKorn); Measure the impact of scheduling policies on cluster allocation and utilization; Validate scheduler behavior under different constraint scenarios. To iterate on solutions, you could modify the cluster specification to adjust zone distribution, install a gang scheduler addon, or test different scheduling policies—all without touching production infrastructure. When you\u0026rsquo;re done experimenting, clean up with kemu delete-cluster.\nConclusion #Emulated clusters provide safe environments for testing scheduling optimizations at scale without imposing risks on production infrastructure. Existing solutions such as Kind and KWOK are battle tested and widely used by the community for solving this problem.\nHowever, the end-to-end cluster bootstrap using the existing technologies suffers from the proliferation of tools, fragmented configuration, the lack of native support for cluster addons, and the multi-step process required for configuring the desired cluster shape.\nKEMU addresses these problems with a single opinionated utility driven by declarative configuration. Once defined, an emulated cluster can be reliably recreated by a human or CI system in minutes without custom scripting or extensive templating. Changing cluster shape, size, capacity, or pre-installed software only requires updating the cluster specification.\nKEMU eliminates the complexity of emulated cluster management, freeing you from maintaining boilerplate scripts and configuration files so you can focus on experimentation and rapid iteration. Head over to the KEMU GitHub repository to get started. We\u0026rsquo;d love your feedback and contributions!\nReferences # Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler - a KubeCon talk that brought original inspiration for my interest in automation of the emulated clusters. Kubernetes E2E Framework is an amazing work of the community in the Kubernetes testing space. KEMU relies on its API for Kind cluster management. Go Helm Client is used in KEMU for managing Helm Chart installations. kind and KWOK are well-known tools for cluster emulation KEMU relies on. ","date":"4 November 2025","permalink":"https://datastrophic.io/declarative-kubernetes-cluster-emulation-with-kemu/","section":"","summary":"Optimizing AI workload scheduling requires extensive experimentation and observation, but testing scheduler modifications in production is risky: configuration errors can cause multi-day delays and wasted capacity. This post introduces KEMU, a declarative Kubernetes Emulator Utility that replaces fragmented multi-tool cluster setups with a single configuration specification, enabling safe experimentation with large-scale GPU clusters on minimal resources.","title":"KEMU: A Declarative Approach to Emulating Kubernetes Clusters at Scale"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kind/","section":"Tags","summary":"","title":"Kind"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kubernetes/","section":"Tags","summary":"","title":"Kubernetes"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kwok/","section":"Tags","summary":"","title":"Kwok"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/istio/","section":"Tags","summary":"","title":"Istio"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kubeflow/","section":"Tags","summary":"","title":"Kubeflow"},{"content":"Publicly exposed insecure service endpoints on Kubernetes produce a major risk of malicious workloads being deployed on your clusters. We\u0026rsquo;ve seen reports of the Kubernetes Dashboard, the Kubeflow Central Dashboard, and the Kubeflow Pipelines all were compromised when publicly exposed to the Internet. Combined with wide RBAC permissions, a publicly exposed software with workload scheduling capabilities opens your clusters for malicious deployments to anybody knowing the endpoint URL.\nThis blog post focuses on building a secure ingress and authentication stack on Kubernetes with Istio targeting Kubeflow installations. The post covers the existing approach used in the open-source Kubeflow distribution and its shortcomings and provides an alternative solution that uses the latest security features from Istio and an alternative authentication proxy.\nKubeflow Ingress and Authentication #Overview #The main open source Kubeflow releases reside in the kubeflow/manifests repository that aims at easing the installation and providing a tested starting point for derived distributions (link). The repository provides manifests for both the Kubeflow components and the dependencies required for the ingress and security stack such as Istio, Dex, and OIDC AuthService. Kubeflow relies on Istio for ingress, traffic routing, and authorization policies for multi-tenancy. Let\u0026rsquo;s consider the high-level ingress architecture and resource placement shown in the diagram below:\nKubeflow installation configures the default shared Istio Ingress Gateway running in the istio-system namespace using Gateway and VirtualService custom resources for routing and the EnvoyFilter for the request authorization. The EnvoyFilter forwards the incoming traffic to the OIDC AuthService that validates the authorization cookie and can either allow the request to proceed, deny it due to invalid authorization, or trigger an authentication workflow with an external system such as Dex.\nShortcomings of the existing approach #Tight coupling of the ingress stack with Kubeflow #Kubeflow relies on several external systems for its security-related features: cert-manager, Istio, Dex, and OIDC AuthService. The convenience of kubeflow/manifests providing all of these dependencies in one place brings additional coupling when the ingress resources are deployed. For example, Gateway and EnvoyFilter resources are deployed in the kubeflow namespace but at the same time, they configure the default Istio Ingress Gateway running in the istio-system namespace that can be used by other services in the cluster. When the Gateway is uninstalled, the configuration for the shared ingress gateway is dropped with it too.\nWhen Kubeflow installation becomes yet another cluster citizen, it should be able to seamlessly integrate with the existing platform-side components without producing alternative authentication paths or overwriting existing routes in the default Istio Gateway.\nUse of the insecure Gateway and endpoints #As of version 1.4, the Kubeflow manifests use HTTP Gateway without any TLS in place. TLS is an essential security measure nowadays and, surprisingly, publicly exposed endpoints serve plaintext HTTP disregarding the environment they are deployed into (e.g. on-premise or air-gapped clusters).\nThe Dex installation coming with Kubeflow is also exposed via the same Gateway and its clients (such as OIDC AuthService) don\u0026rsquo;t have a way to verify the identity of the service. Although the OIDC AuthService supports the functionality for verifying the OIDC provider endpoint by providing a CA bundle it is not utilized.\nEnvoyFilter maintainability issue #Envoy\u0026rsquo;s ext_authz filter configured with the EnvoyFilter CRD used for forwarding requests to the OIDC AuthService is a pretty low-level primitive which could be difficult to maintain and troubleshoot. The major pain points are described in the Istio Better External Authorization design doc. Here are a few excerpts that are relevant for Kubeflow:\nNo support for the conditional triggering of the external authorization flow. Some paths don\u0026rsquo;t need it (e.g. user-facing authentication endpoint itself). As ext_authz depends on the Envoy API, the EnvoyFilter can start failing after a small change in the upstream. Sometimes they are failing silently. Overall difficult maintenance and troubleshooting of the EnvoyFilter CRD. OIDC AuthService maintenance and community support #OIDC AuthService is a great solution for the auth proxy, however, the project\u0026rsquo;s GitHub doesn\u0026rsquo;t look very active and it is not clear what amount of maintenance it receives. It also doesn\u0026rsquo;t have any releases in its GitHub repository so the whole release process is opaque. The single source of truth seems to be the Docker image tag (commit hash) used in manifests. Another concern with OIDC AuthService is that it seems to be tailored for Kubeflow needs only and is not used in any other setup. This could make it potentially vulnerable for non-Kubeflow specific use-cases that haven\u0026rsquo;t been tested but might happen in production deployments.\nProposed solution for Secure Ingress and External Authentication #Overview #Decoupling the ingress stack from Kubeflow #The ingress and authentication stack should be treated as a cluster-scoped entity so that all cluster tenants (services and applications) can integrate with it and benefit from the pre-configured security and authentication flow. That way, Kubeflow becomes another consumer of the authentication stack and doesn\u0026rsquo;t deal with the installation of the security primitives.\nIndependent installation and management of the ingress and authentication stack allows using the latest stable software versions and the recommended installation methods such as official Helm Charts instead of the back-ported manifests\nSecuring the Endpoints #Istio natively supports TLS at the Gateway and with the Cert-manager available on the cluster, it is possible to create a CA ClusterIssuer and provide a certificate to the Gateway. The CA ClusterIssuer can then also be used to mount a CA file to the authentication proxy for validating the Dex identity. Additionally, it is beneficial to enable mutual TLS for all user-facing and authentication-related components when possible (not all Kubeflow components work well when Istio sidecars are injected).\nUsing Istio External Auth #Starting from version 1.9 Istio supports external authorization and allows configuring AuthorizationPolicies to delegate authorization to external systems. This functionality came to replace the low-level EnvoyFilter configuration API and was driven by the shortcomings of the ext_authz approach. There\u0026rsquo;s a great blog post from Istio describing it: Better External Authorization.\nMigrating to OAuth2 Proxy with Dex #Dex is a popular choice for Kubernetes authentication used in production, however, an alternative solution will be used instead of OIDC AuthService. OAuth2 Proxy looks like a better alternative based on its functionality, GitHub activity, availability of the versioned releases, quality documentation, and the official Helm Chart for the installation.\nDeployment layout #Let\u0026rsquo;s consider the following deployment and ingress diagram for a dedicated Kubeflow cluster:\nThere are several differences and improvements compared to the default Kubeflow installation:\nThe proposed layout follows the idea of the centralized security and authentication stack independent from Kubeflow. The consumers of the stack can be added to or removed from a cluster without affecting other consumers. This setup is beneficial for installations where Kubeflow is yet another tenant sharing the cluster with other business applications. The Secure Istio Gateway with TLS termination running in a dedicated ingress namespace as the security best practice doesn\u0026rsquo;t recommend installing it in the Istio namespace and uses a Certificate issued by the cert-manager. Istio External Authorization is configured for the Ingress Gateway by an AuthorizationPolicy and verifies the incoming requests with Dex via OAuth2 Proxy. Dex and OAuth2 Proxy have VirtualService routes defined for them and will be using the Ingress Gateway address for the authentication endpoints and callbacks so that both internal and external users and systems have the access to them. This setup is useful when the OIDC provider is external to the cluster or running at a different address. Security-related components deployed in a dedicated namespace and can have additional policies applied to them in a narrow scope. This layout is also beneficial if other services running on the cluster and which require authentication but are not related to Kubeflow, for example, user-facing Grafana and Prometheus. Implementing Secure Ingress and Authentication #This section contains practical steps and code snippets for installing and configuring the secure ingress and authentication stack. It mostly focuses on the generic part that applies to any cluster that requires setting up the security, and at the end provides a sub-section with a basic Kubeflow installation to verify the setup.\nThe main topics covered in this section include setting up required security dependencies with Helm, creating a CA ClusterIssuer with a self-signed CA, configuring secure ingress, configuring Istio External Authorization, installing and configuring OAuth2 Proxy and Dex for authentication, and installing a minimalistic Kubeflow distribution.\nNOTE: The following tutorial was created on an on-premises deployment with MetalLB assigning addresses for Services with the LoadBalancer type. The exposed ingress endpoints with the network IPs assigned will be referenced by the IP address instead of FQDN for simplicity and to avoid setting up and configuring a DNS Server.\nSecuring the Ingress Gateway #The first step in configuring secure ingress on a cluster is to get the required software and configure it to serve the traffic. Disregarding whether it is required solely for Kubeflow or generic cluster security, this part of the setup will be the same.\nInstalling Cert-manager, Istio, and Ingress Gateway #Installing cert-manager:\nhelm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \\ --version v1.6.1 \\ --set installCRDs=true \\ --namespace cert-manager \\ --create-namespace Istio ships independent Helm Charts for CRDs, Istiod, and the Ingress Gateway. The source code for the Charts can be found here. At this point, only CRDs and Istiod will be needed:\nhelm repo add istio https://istio-release.storage.googleapis.com/charts helm repo update helm install istio-base istio/base \\ --version 1.12.1 \\ --namespace istio-system \\ --create-namespace helm install istiod istio/istiod \\ --version 1.12.1 \\ --namespace istio-system \\ --wait Create the Istio Ingress Gateway in a dedicated ingress namespace:\nhelm install istio-ingressgateway istio/gateway \\ --version 1.12.1 \\ --namespace ingress \\ --create-namespace --wait The Ingress Gateway from this Chart creates an Envoy Proxy Deployment and a Service with the LoadBalancer type. At this point there are no routes defined and all requests to the Service will be dropped (you can check it with curl). Verify the Service has the external address assigned. For example:\nkubectl get svc istio-ingressgateway -n ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE istio-ingressgateway LoadBalancer 10.99.197.129 192.168.50.150 15021:31107/TCP,80:32027/TCP,443:31920/TCP 11m Creating CA ClusterIssuer for signing certificates #For simplicity, we will be using the ClusterIssuer to ease the certificate issuance and use the same CA for signing all certificates. For that, it is required to create a CA key and certificate to provide to the Cert-manager ClusterIssuer. We will use cfssl but any other appropriate\ntool can be used instead.\nCreate a CSR (Certificate Signing Request) file in json format. For example:\ncat \u0026lt;\u0026lt;EOF \u0026gt; csr.json { \u0026#34;CN\u0026#34;: \u0026#34;Datastrophic\u0026#34;, \u0026#34;key\u0026#34;: { \u0026#34;algo\u0026#34;: \u0026#34;rsa\u0026#34;, \u0026#34;size\u0026#34;: 2048 }, \u0026#34;names\u0026#34;: [ { \u0026#34;C\u0026#34;: \u0026#34;US\u0026#34;, \u0026#34;L\u0026#34;: \u0026#34;San Francisco\u0026#34;, \u0026#34;O\u0026#34;: \u0026#34;Datastrophic\u0026#34;, \u0026#34;ST\u0026#34;: \u0026#34;California\u0026#34; } ] } EOF Then, run cfssl to generate the initial CA key and certificate:\ncfssl gencert -initca csr.json | cfssljson -bare ca Create a Kubernetes Secret to hold the key and certificate, as per cert-manager docs. The secret for the ClusterIssuer should be created in the \u0026ldquo;cluster resource namespace\u0026rdquo; which defaults to cert-manager:\nkubectl create secret tls ca-secret \\ --namespace cert-manager \\ --cert=ca.pem \\ --key=ca-key.pem Create a CA ClusterIssuer referencing the previously created Secret:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: ca-issuer spec: ca: secretName: ca-secret EOF Verify the ClusterIssuer is ready:\nkubectl get clusterissuer -o wide NAME READY STATUS AGE ca-issuer True Signing CA verified 2s More ClusterIssuer configuration options available in the Cert-manager docs.\nConfiguring Istio Gateway to serve HTTPS traffic #To expose services via HTTPS, it is required to configure a secure Istio Gateway. For this purpose, we will use Cert-manager to issue a certificate for the Istio Ingress Gateway address and provide it to the Gateway.\nDiscover the IngressGateway address to use in the certificate:\nexport INGRESS_HOST=$(kubectl get svc istio-ingressgateway --namespace ingress -o yaml -o jsonpath=\u0026#39;{.status.loadBalancer.ingress[0].ip}\u0026#39;) NOTE: if running in the cloud and the LoadBalancer service type is bound to a load balancer, then .status.loadBalancer.ingress[0].ip might render an empty result. If a LoadBalancer service has a DNS name assigned to it, use .status.loadBalancer.ingress[0].hostname instead. Alternatively, run kubectl describe svc istio-ingressgateway --namespace ingress and save the publicly exposed address.\nThe Certificate would look as follows (we\u0026rsquo;ll be using the IP address from the previous step in the ipAddresses field):\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: gateway-cert namespace: ingress spec: secretName: gateway-cert ipAddresses: - \u0026#34;${INGRESS_HOST}\u0026#34; duration: 2160h # 90d renewBefore: 360h # 15d subject: organizations: - Datastrophic issuerRef: name: ca-issuer kind: ClusterIssuer EOF Verify the Certificate is created:\nkubectl get certificate gateway-cert -o wide -n ingress NAME READY SECRET ISSUER STATUS AGE gateway-cert True gateway-cert ca-issuer Certificate is up to date and has not expired 16s Create a secure Istio Gateway that configures the Ingress proxies to use the certificate created at the previous step:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: networking.istio.io/v1beta1 kind: Gateway metadata: name: ingress-gateway namespace: ingress spec: selector: app: istio-ingressgateway istio: ingressgateway servers: - port: number: 443 name: https-main protocol: HTTPS hosts: - \u0026#34;*\u0026#34; tls: mode: SIMPLE credentialName: gateway-cert EOF Before deploying the Gateway, curl to https://$INGRESS_HOST returns Connection refused. However, after the Gateway was created, the certificate can be verified by running:\ncurl --cacert ca.pem -v https://$INGRESS_HOST Where --cacert ca.pem points to the previously created root CA. It is expected that the above command returns 404 as there are no VirtualService routes configured yet.\nAuthorizing user requests #After the TLS ingress is configured, we can now proceed with Istio External Authorization, Dex, and OAuth2 Proxy. As a result, the Istio Ingress Gateway will be using OAuth2 Proxy as an external authorization service which in turn will trigger authorization flow with Dex. Dex configuration for this blog post will serve static users but can be configured to work with supported providers.\nFirst, let\u0026rsquo;s create a dedicated authnamespace with Istio sidecar injection enabled:\nkubectl create -f - \u0026lt;\u0026lt;EOF apiVersion: v1 kind: Namespace metadata: name: auth labels: istio-injection: enabled EOF Installing Dex and configuring clients #Dex will be installed using Helm, and an example Dex configuration can be found here. In the configuration below, the Dex configuration contains a record for an OAuth2 Proxy static client that at this point is not yet installed. It also includes two static users for testing:\n# bcrypt hash for \u0026#34;password\u0026#34; export PWD_HASH=$(htpasswd -bnBC 10 \u0026#34;\u0026#34; password | tr -d \u0026#39;:\\n\u0026#39;) cat \u0026lt;\u0026lt;EOF \u0026gt; dex-values.yaml config: issuer: \u0026#34;https://${INGRESS_HOST}/dex\u0026#34; storage: type: kubernetes config: inCluster: true oauth2: skipApprovalScreen: true staticClients: - id: oauth2-proxy name: OAuth2 Proxy secret: \u0026#34;LG7jUjNiyVDPJdlarO5Mgz3CxS7kNL/1OZ0spRsL\u0026#34; redirectURIs: - \u0026#34;https://${INGRESS_HOST}/oauth2/callback\u0026#34; # Password DB must be enabled in order to specify static users enablePasswordDB: true staticPasswords: - email: \u0026#34;user1@datastrophic.io\u0026#34; hash: \u0026#34;${PWD_HASH}\u0026#34; - email: \u0026#34;user2@datastrophic.io\u0026#34; hash: \u0026#34;${PWD_HASH}\u0026#34; EOF Install Dex with the provided configuration:\nhelm repo add dex https://charts.dexidp.io helm repo update helm install dex dex/dex \\ --version 0.6.3 \\ --values dex-values.yaml \\ --namespace auth \\ --wait Expose Dex at the endpoint\u0026rsquo;s /dex path via a VirtualService:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: dex namespace: auth spec: hosts: - \u0026#34;*\u0026#34; gateways: - ingress/ingress-gateway http: - name: \u0026#34;dex\u0026#34; match: - uri: prefix: \u0026#34;/dex\u0026#34; route: - destination: host: dex.auth.svc.cluster.local port: number: 5556 EOF Installing OAuth2 Proxy #OAuth2 Proxy has quite a few configuration options described in oauth2-proxy documentation\nand available in the example values.yaml in GitHub. The majority of the examples set ssl_insecure_skip_verify parameter to true to skip the verification of the OIDC provider endpoint. This is convenient when it is running with a self-signed certificate, however, if the certificate verification is skipped, this means we\u0026rsquo;re ignoring who is authenticating the users. In this setup, a dedicated certificate will be issued for the ingress endpoint running Dex and mounted to the OAuth2 Proxy for validating the certificates against the CA.\nCreate a second certificate for the Gateway address but in the auth namespace:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: gateway-cert namespace: auth spec: secretName: gateway-cert ipAddresses: - \u0026#34;${INGRESS_HOST}\u0026#34; duration: 2160h # 90d renewBefore: 360h # 15d subject: organizations: - Datastrophic issuerRef: name: ca-issuer kind: ClusterIssuer EOF Create the OAuth2 Proxy Helm configuration:\ncat \u0026lt;\u0026lt;EOF \u0026gt; oauth2-proxy-values.yaml config: clientID: \u0026#34;oauth2-proxy\u0026#34; # openssl rand -base64 32 | head -c 40 clientSecret: \u0026#34;LG7jUjNiyVDPJdlarO5Mgz3CxS7kNL/1OZ0spRsL\u0026#34; # openssl rand -base64 32 | head -c 32 | base64 #cookieSecret: \u0026#34;SXRNTGYzNUFtNi9MTGUvbXJmUnlLdUlYTU00a29ick4=\u0026#34; configFile: |- provider = \u0026#34;oidc\u0026#34; provider_ca_files = \u0026#34;/etc/gateway-cert/ca.crt\u0026#34; oidc_issuer_url = \u0026#34;https://${INGRESS_HOST}/dex\u0026#34; set_authorization_header = true set_xauthrequest = true cookie_samesite = \u0026#34;lax\u0026#34; email_domains = [\u0026#34;*\u0026#34;] skip_provider_button = true upstreams = [ \u0026#34;static://200\u0026#34; ] extraVolumes: - name: gateway-cert secret: secretName: gateway-cert extraVolumeMounts: - mountPath: /etc/gateway-cert/ name: gateway-cert EOF Install OAuth2 Proxy with the provided configuration:\nhelm repo add oauth2-proxy https://oauth2-proxy.github.io/manifests helm repo update helm install oauth2-proxy oauth2-proxy/oauth2-proxy \\ --version 5.0.6 \\ --namespace auth \\ --values oauth2-proxy-values.yaml \\ --wait Expose OAuth2 Proxy at the endpoint\u0026rsquo;s /oauth2 path via a VirtualService:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: oauth2-proxy namespace: auth spec: hosts: - \u0026#34;*\u0026#34; gateways: - ingress/ingress-gateway http: - name: \u0026#34;oauth2\u0026#34; match: - uri: prefix: \u0026#34;/oauth2\u0026#34; route: - destination: host: oauth2-proxy.auth.svc.cluster.local port: number: 5556 EOF Configuring Istio External Authorization #Istio External Authorization is a mesh-wide configuration property that is applied to Istiod. In the example below, we register a new external authorization service (OAuth2 Proxy):\ncat \u0026lt;\u0026lt;EOF \u0026gt; istio-values.yaml meshConfig: extensionProviders: - name: oauth2-proxy envoyExtAuthzHttp: service: oauth2-proxy.auth.svc.cluster.local port: 80 includeHeadersInCheck: [\u0026#34;authorization\u0026#34;, \u0026#34;cookie\u0026#34;] headersToUpstreamOnAllow: [\u0026#34;authorization\u0026#34;, \u0026#34;path\u0026#34;, \u0026#34;x-auth-request-user\u0026#34;, \u0026#34;x-auth-request-email\u0026#34;, \u0026#34;x-auth-request-access-token\u0026#34;] headersToDownstreamOnDeny: [\u0026#34;content-type\u0026#34;, \u0026#34;set-cookie\u0026#34;] EOF helm upgrade istiod istio/istiod \\ --namespace istio-system \\ --values istio-values.yaml \\ --wait kubectl rollout restart deployment/istiod -n istio-system And to enable the external authorization, it is required to apply the AuthorizationPolicy referencing the above extension provider to the Istio Ingress Gateway:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: external-auth namespace: ingress spec: selector: matchLabels: app: istio-ingressgateway istio: ingressgateway action: CUSTOM provider: name: oauth2-proxy rules: - to: - operation: hosts: [\u0026#34;*\u0026#34;] notPaths: [\u0026#34;/dex/*\u0026#34;] # skipping Dex running on the same Gateway to avoid redirect loops EOF Now, everything is ready and all unauthorized requests should be redirected to Dex by OAuth2 Proxy. To verify that, navigate to https://$INGRESS_HOST in your browser. There are two static users user1@datastrophic.io and user2@datastrophic.io with a password: password. As we don\u0026rsquo;t have user-facing applications running yet, you\u0026rsquo;ll be redirected to the root path and get a 404 which is expected.\nBase Kubeflow installation #To verify the secure ingress for Kubeflow, let\u0026rsquo;s install several basic components, log in as different users, and use the collaboration feature to share a notebook.\ndatastrophic/kubeflow-kustomize repository contains kustomizations for a demo installation of the Kubeflow based on kubeflow/manifests repository. The kustomizations in the repository modify base manifests to make them work with the custom Istio Gateway and OAuth2 Proxy headers. The kustomizations also patch the Central Dashboard VirtualService to add a redirect for /logout path to the OAuth2 Proxy sign out endpoint:\napiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: centraldashboard spec: gateways: - ingress/ingress-gateway http: - match: - uri: exact: /logout name: logout redirect: uri: /oauth2/sign_out - match: - uri: prefix: / rewrite: uri: / route: - destination: host: centraldashboard.kubeflow.svc.cluster.local port: number: 80 To install the basic components of Kubeflow, clone the repository and from the root directory run:\nkubectl apply -k kubeflow Once all pods in the Kubeflow namespace are up and running, navigate to https://$INGRESS_HOST. The Gateway should redirect the browser to the Dex login page and, after the login is successful, to the Central Dashboard page. Below, is a quick demo:\nThis demonstrates that the external authorization policy works as expected, required headers are set by the authentication workflow, AuthorizationPolicies are applied correctly, and the configuration of Kubeflow components is compatible with the provided authorization stack.\nConclusion #Setting up the secure ingress with authentication requires an understanding of all the moving parts and interactions between them to address this task properly. Treating the authentication stack independently of the applications that depend on it looks preferable from the cluster management perspective when multiple applications benefit from the centralized solution.\nThe manual installation is pretty involving and error-prone. As there are several systems with non-trivial configuration involved, these steps are great candidates for being automated and installed as self-contained units of the infrastructure with solutions such as Flux or ArgoCD.\nReferences # Istio documentation on Traffic Management Better External Authorization blog post from Istio Istio documentation on best practices for Installing Gateways Istio OIDC Authentication blog post from JetStack Configuring Istio with OIDC authentication blog post by Justin Gauthier ","date":"16 December 2021","permalink":"https://datastrophic.io/secure-kubeflow-ingress-and-authentication/","section":"","summary":"Publicly exposed insecure service endpoints on Kubernetes produce a major risk of malicious workloads being deployed on your clusters. We’ve seen reports of the Kubernetes Dashboard, the Kubeflow Central Dashboard, and the Kubeflow Pipelines all were compromised when publicly exposed to the Internet. Combined with wide RBAC permissions, a publicly exposed software with workload scheduling capabilities opens your clusters for malicious deployments to anybody knowing the endpoint URL. This blog post focuses on building a secure ingress and authentication stack on Kubernetes with Istio targeting Kubeflow installations.","title":"Secure Kubeflow Ingress and Authentication with Istio External Auth, Dex, and OAuth2 Proxy"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/devops/","section":"Tags","summary":"","title":"Devops"},{"content":"Whether you\u0026rsquo;re looking for a more powerful development environment or a production-grade Kubernetes cluster for experiments, this guide provides end-to-end deployment and configuration instructions to get the cluster up and running.\nThe first part of this guide covers the planning and provisioning of the infrastructure with Proxmox and Terraform. The second part is dedicated to installing Kubernetes and essential software such as Calico for networking, OpenEBS for volume provisioning, and MetalLB for network load balancing. At the end, the guide provides steps for deploying the Kubernetes Dashboard with restricted permissions.\nPlanning and provisioning the infrastructure #This section contains basic information on how to get a virtual infrastructure up and running in an automated manner. If you already have the infrastructure ready (whether it\u0026rsquo;s a multi-server rack or several pre-provisioned VMs) - just skip ahead to the Kubernetes deployment part of this guide. However, if you just have a spare server or a commodity workstation you\u0026rsquo;d like to use, then this section might be helpful for bootstrapping the infrastructure from scratch.\nDeployment layout #There are several options for how the target Kubernetes cluster will fit into the existing network and how clients will access it. Also, the target cluster may consist of several hardware nodes, a virtualized environment, or a hybrid of both. Let\u0026rsquo;s look at the following layout:\nWe\u0026rsquo;re looking at a network with CIDR 192.165.0.0/24 behind a router. This can be an existing home router connected to the ISP, or another dedicated hardware router connected to the home gateway. The general idea here is that the network address range is split into two parts: DHCP addresses that are dynamically assigned to clients connecting to the network and the reserved static address range to be used for the physical nodes and VMs. Static addressing of the nodes is important for deployment automation that is using host addresses to connect to the hosts and apply changes. This would also allow other devices to connect to services running on Kubernetes using the local network addresses. While this network setup is pretty naive for a potentially internet-facing deployment, it should be considered as a basic building block of the infrastructure that can be implemented in a variety of ways (e.g. by using VLANs).\nWhen dealing with virtualization, it is important to take into account the overhead it brings both when running a hypervisor and when deciding on the number of virtual machines to create on a physical node. Proxmox VE is an open-source small-footprint hypervisor that is based on Debian Linux and will be used for virtualization in this guide. One of the additional benefits it has is a Terraform provider that allows to declaratively define virtual machines based on templates and to provision them automatically.\nInfrastructure automation with Proxmox and Terraform #When working with on-premises environments the infrastructure provisioning might be a tedious task. However, with an intermediate hypervisor layer, it is possible to achieve the same automation levels as with public cloud providers. Terraform Proxmox provider brings Infrastructure-as-a-Code capabilities for environments running Proxmox.\nNOTE: In order to continue, Proxmox VE must be installed on the target machines. To install Proxmox VE, follow the official documentation.\nPrior to the provisioning of the VMs themselves, it is beneficial to create a cloud-init template to simplify the configuration and provisioning of the future VMs. The template can be created manually on the PVE host as described in several blog posts such as Deploy Proxmox virtual machines using cloud-init, or we can use Ansible to automate this step. A working Ansible playbook with the instructions can be found at datastrophic/kubernetes-deployment/proxmox/.\nOnce the VM template is created, we can define a Terraform configuration to provision VMs. Here\u0026rsquo;s an excerpt from main.tf with full instructions available at datastrophic/kubernetes-deployment/proxmox/:\nresource \u0026#34;proxmox_vm_qemu\u0026#34; \u0026#34;control_plane\u0026#34; { count = 1 name = \u0026#34;control-plane-${count.index}.k8s.cluster\u0026#34; target_node = \u0026#34;${var.pm_node}\u0026#34; clone = \u0026#34;ubuntu-2004-cloudinit-template\u0026#34; os_type = \u0026#34;cloud-init\u0026#34; cores = 4 sockets = \u0026#34;1\u0026#34; cpu = \u0026#34;host\u0026#34; memory = 2048 scsihw = \u0026#34;virtio-scsi-pci\u0026#34; bootdisk = \u0026#34;scsi0\u0026#34; disk { size = \u0026#34;20G\u0026#34; type = \u0026#34;scsi\u0026#34; storage = \u0026#34;local-lvm\u0026#34; iothread = 1 } network { model = \u0026#34;virtio\u0026#34; bridge = \u0026#34;vmbr0\u0026#34; } # cloud-init settings # adjust the ip and gateway addresses as needed ipconfig0 = \u0026#34;ip=192.168.0.11${count.index}/24,gw=192.168.0.1\u0026#34; sshkeys = file(\u0026#34;${var.ssh_key_file}\u0026#34;) } A few things in the above configuration to pay attention to:\nclone must point to the unique name of the VM template created at the previous step ipconfig0 should respect the configuration of the network the VMs are running in. In this case, we assign VMs static IP addresses within the external (to PVE) network range so they look like regular hosts without the need for NAT routing. Once the configuration is adjusted for the target environment needs, it is sufficient to run terraform to get target VMs created:\nterraform init terraform plan -var=\u0026#34;pm_user=\u0026lt;PVE user\u0026gt;\u0026#34; -var=\u0026#34;pm_password=\u0026lt;PVE password\u0026gt;\u0026#34; -out plan terraform apply \u0026#34;plan\u0026#34; Installing the Kubernetes cluster and essentials #The CNCF technology landscape is broad and there\u0026rsquo;s a lot of vendors providing solutions for various aspects of Kubernetes, and whole Kubernetes distributions themselves. A fully-functioning Kubernetes cluster requires several essential things such as container runtime, a Kubernetes distribution itself, a CNI (Container Network Interface) implementation for pod networking, a networking load balancer for exposing Services with LoadBalancer type on-premises, and a CSI (Container Storage Interface) implementation for volume provisioning.\nUnlike \u0026ldquo;Kubernetes the Hard Way\u0026rdquo;, this guide relies on Ansible automation for the Kubernetes deployment and just covers the high-level steps required for the Kubernetes cluster bootstrap. Under the hood, the automation is using kubeadm in conjunction with declarative configuration for the cluster deployment. The source code of Ansible playbooks is available at github.com/datastrophic/kubernetes-deployment.\nBefore you begin #Prior to going forward with the installation, it is recommended to clone the source code repository for this guide locally and double-check and update the following files to match your environment:\nthe Ansible inventory file that contains the addresses of the nodes in kubernetes-deployment/ansible/inventory.yaml the default Ansible variables in kubernetes-deployment/ansible/group_vars/all that contain Kubernetes version, MetalLB address range, etc. The client machine must have SSH access to the cluster nodes and sudo privileges on the target hosts.\nkubeadm, containerd, and Calico #In this guide, the Kubernetes distribution of choice is the vanilla open-source Kubernetes that comes with kubeadm tool for cluster bootstrapping. Vanilla Kubernetes has a bigger footprint compared to e.g. k3s and might not be a good fit for resource-constrained environments. However, it is vendor independent and fully open-source, doesn\u0026rsquo;t have any modifications, and both the API changes and the tooling have the same release cadence so there\u0026rsquo;s a lower risk of running into incompatibilities or delays.\nPrior to deploying the Kubernetes itself, the cluster nodes require additional configuration and software installed:\nNodes must have swap disabled, iptables enabled, and allow forwarding and bridged traffic as per Bootstrapping clusters with kubeadm. Nodes must have container runtime installed. The most standard container runtime used in various cloud and vendor Kubernetes distributions is containerd, so we will use it. Additional information on why we\u0026rsquo;re not going to use Docker can be found in Don\u0026rsquo;t Panic: Kubernetes and Docker. Nodes must have the following packages installed: kubelet, kubectl, and kubeadm. These can be installed via the standard package manager such as apt. There\u0026rsquo;s a dedicated playbook for bootstrapping the nodes with all the required configuration and dependencies available at ansible/bootstrap.yaml. Double-check the defaults, and from the root of the repo run:\nansible-playbook -i ansible/inventory.yaml ansible/bootstrap.yaml -K Once all the prerequisites are in place, we can use kubeadm for the cluster bootstrapping. The Kubernetes cluster installation consists of two major steps: bootstrapping of the control plane and joining the worker nodes. We can do it by running ansible/kubernetes-install.yaml playbook:\nansible-playbook -i ansible/inventory.yaml ansible/kubernetes-install.yaml -K The playbook runs kubeadm init on the control plane nodes and uses a declarative cluster configuration which is the preferred way of configuring kubeadm. The configuration template is available at ansible/roles/kubeadm-init/templates/kubeadm.yaml. Once the control plane bootstrap is complete, Ansible fetches a token and a certificate hash that are required for the worker nodes to authenticate with the API Server and runs kubeadm join on the worker nodes.\nThe playbook also deploys Calico for cluster networking although multiple options are available. The choice of Calico is motivated by it being the most widely adopted networking and security solution for Kubernetes (at the moment of writing).\nOnce the playbook execution completes, a kubeconfig file admin.conf will be fetched to the current directory. To verify the cluster is bootstrapped and connected, run:\n$\u0026gt; kubectl --kubeconfig=admin.conf get nodes NAME STATUS ROLES AGE VERSION control-plane-0.k8s.cluster Ready control-plane,master 4m40s v1.21.6 worker-0 Ready \u0026lt;none\u0026gt; 4m5s v1.21.6 worker-1 Ready \u0026lt;none\u0026gt; 4m5s v1.21.6 worker-2 Ready \u0026lt;none\u0026gt; 4m4s v1.21.6 NOTE: it is recommended to export admin.conf location to run kubectl commands without providing --kubeconfig flag every time:\nexport KUBECONFIG=$(pwd)/admin.conf Essential software #With the Kubernetes cluster up and running we now can deploy and run containers on it. However, a couple of essential parts of the fully-functional cluster are still missing: the dynamic volume provisioning and the support for Services with LoadBalancer type.\nVolume Provisioning with OpenEBS #The volume provisioner solution comes in handy in both the situations when 3rd-party applications require a default StorageClass to provision PersistentVolumes and also when data replication is required for high availability guarantees.\nUsing OpenEBS for the home lab setup seems reasonable as it provides Local Engines for provisioning PersistentVolumes backed directly by the local disks on hosts that should make the IO pretty fast. If data replication is required, OpenEBS has several Replicated Engines but the performance of those varies.\nAnother alternative considered was Rook that provides multiple file access APIs such as block, shared file system, and object store; and also has several options for the backend storage. The main user-facing advantage of Rook for home lab purposes was the out-of-the-box support for RWX (ReadWriteMany) volumes. However, OpenEBS with its local PersistentVolumes looked like a lighter and simpler alternative compared to the Ceph-backed Rook even with the lack of RWX.\nTo deploy a minimal installation with host-local PersistentVolumes, OpenEBS provides a \u0026ldquo;lite\u0026rdquo; version:\nkubectl apply -f https://openebs.github.io/charts/openebs-operator-lite.yaml Once the Operator is installed, create a StorageClass and annotate it as default. This would allow using OpenEBS for volume provisioning without the need to specify the StorageClass for PersistentVolumes every time:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: openebs-hostpath annotations: storageclass.kubernetes.io/is-default-class: \u0026#34;true\u0026#34; openebs.io/cas-type: local cas.openebs.io/config: | - name: StorageType value: \u0026#34;hostpath\u0026#34; - name: BasePath value: \u0026#34;/var/openebs/local/\u0026#34; provisioner: openebs.io/local volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete EOF To verify the installation, there are steps available in the official OpenEBS documentation but there is also an end-to-end example available at the end of this guide.\nA Network Load Balancer with MetalLB #One last missing piece of functionality in the provisioned cluster is the ability to expose Services of the LoadBalancer type to the local network. When running in the cloud, this functionality is provided by the Kubernetes integrations with cloud providers and corresponding network-facing load balancers are provisioned by using the infrastructure provider. When running on bare metal, there\u0026rsquo;s no such integration available in Kubernetes out-of-the-box.\nMetalLB is the most widely used solution for network load balancing, however other solutions started to appear.\nMetalLB installation is configured via a ConfigMap and can contain multiple address pools:\napiVersion: v1 kind: ConfigMap metadata: namespace: metallb-system name: config data: config: | address-pools: - name: default protocol: layer2 addresses: - \u0026#34;{{ lab.metallb_address_range }}\u0026#34; The template above is a part of the Ansible ansible/metallb.yaml playbook that installs the MetalLB and configures it to allocate addresses from the lab.metallb_address_range variable specified in the group_vars. The address range must be relevant for the target environment (part of the reserved static address range described in the deployment layout section so that the addresses can be allocated. To install MetalLB, run:\nansible-playbook -i ansible/inventory.yaml ansible/metallb.yaml -K Verifying the installation #To verify the installation, we are going to create a MinIO Deployment with a PersistentVolume for storage, and expose the deployment to the local network via the LoadBalancer Service type. The example is based on the Kubernetes storage examples.\nCreate a PersistentVolumeClaim: kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: minio-pv-claim spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi EOF Create a Deployment: kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: apps/v1 kind: Deployment metadata: name: minio-deployment spec: selector: matchLabels: app: minio strategy: type: Recreate template: metadata: labels: app: minio spec: volumes: - name: storage persistentVolumeClaim: claimName: minio-pv-claim containers: - name: minio image: minio/minio:latest args: - server - /storage - --console-address - \u0026#34;:9001\u0026#34; env: - name: MINIO_ACCESS_KEY value: \u0026#34;minio\u0026#34; - name: MINIO_SECRET_KEY value: \u0026#34;minio123\u0026#34; ports: - containerPort: 9000 hostPort: 9000 - containerPort: 9001 hostPort: 9001 volumeMounts: - name: storage mountPath: \u0026#34;/storage\u0026#34; EOF Verify the PersistentVolumeClaim is bound and a PersistentVolume is created: kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE minio-pv-claim Bound pvc-f43856ab-d0a2-42d3-8088-3010f7966ab9 1Gi RWO openebs-hostpath 77s kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-f43856ab-d0a2-42d3-8088-3010f7966ab9 1Gi RWO Delete Bound minio/minio-pv-claim openebs-hostpath 2m42s Verify the Deployment is healthy: kubectl describe deployment minio-deployment ... Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: \u0026lt;none\u0026gt; NewReplicaSet: minio-deployment-877b8596f (1/1 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 7m4s deployment-controller Scaled up replica set minio-deployment-877b8596f to 1 Expose the Deployment via a Service of the LoadBalancer type: kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: v1 kind: Service metadata: name: minio spec: ports: - name: http port: 9000 protocol: TCP targetPort: 9000 - name: http-ui port: 9001 protocol: TCP targetPort: 9001 selector: app: minio type: LoadBalancer EOF Verify the Service is created and has the External IP set. For example: kubectl get service minio NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE minio LoadBalancer 10.109.223.141 192.168.0.151 9000:31073/TCP 7s The EXTERNAL-IP address should be from the local network range, and now, you should be able to navigate to http://EXTERNAL-IP:9001 from a browser and see the MinIO Console login screen.\nThe default credentials are specified in the MinIO Deployment are minio and minio123 for login and password correspondingly. After the login, create a bucket named test, and let\u0026rsquo;s verify it is created on the PersistentVolume:\nkubectl exec deploy/minio-deployment -- bash -c \u0026#34;ls -la /storage\u0026#34; total 16 drwxrwxrwx 4 root root 4096 Dec 1 19:04 . drwxr-xr-x 1 root root 4096 Dec 1 19:00 .. drwxr-xr-x 6 root root 4096 Dec 1 18:39 .minio.sys drwxr-xr-x 2 root root 4096 Dec 1 19:04 test That wraps the verification: test folder created from the UI exposed to the local network was saved on the PersistentVolume mounted at /storage path.\nObservability #The final important piece of any permanent cluster is the observability stack. Depending on your cluster size, it could be just an instance of the Kubernetes Dashboard or the Prometheus Operator. This guide focuses on the Kubernetes Dashboard but it is important to note that it doesn\u0026rsquo;t provide any historical data view, custom dashboarding, or alerting. If those features are must have for your cluster - the Prometheus Operator would be a great place to start.\nKubernetes Dashboard #If the cluster is constrained in resources so it is hard to squeeze the full Prometheus stack onto it, then the Kubernetes Dashboard would be the must-have minimum solution for the observability. The Kubernetes Dashboard has its respective installation guide and here we\u0026rsquo;ll focus on the appropriate RBAC permissions for the ServiceAccount used by it.\nFirst, let\u0026rsquo;s install the Kubernetes Dashboard:\nkubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.4.0/aio/deploy/recommended.yaml While the Kubernetes Dashboard allows creating new resources and editing the existing ones, using it in read-only mode is more secure and wouldn\u0026rsquo;t impose any security risks should anybody gain the access to the UI. The scope of visibility of the Dashboard is controlled via RBAC of the users accessing it.\nThe most conservative approach would be to use an Aggregated ClusterRole based on the default viewer role and extend it with additional rules as needed:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: dashboard-viewer aggregationRule: clusterRoleSelectors: - matchLabels: rbac.authorization.k8s.io/aggregate-to-view: \u0026#34;true\u0026#34; - matchLabels: rbac.homelab.k8s.io/aggregate-to-view: \u0026#34;true\u0026#34; rules: [] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: dashboard-extended-view labels: rbac.homelab.k8s.io/aggregate-to-view: \u0026#34;true\u0026#34; rules: - apiGroups: - \u0026#34;\u0026#34; resources: - nodes - extensions - apps - batch - storage - networking verbs: - get - list - watch EOF The ClusterRole provides extended view permissions but still doesn\u0026rsquo;t allow viewing Secrets and resources from rbac.authorization.k8s.io API group. Now, let\u0026rsquo;s create a dedicated ServiceAccount and bind it to the created ClusterRole:\nkubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: v1 kind: ServiceAccount metadata: name: dashboard-viewer namespace: kubernetes-dashboard --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: dashboard-viewer roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: dashboard-viewer subjects: - kind: ServiceAccount name: dashboard-viewer namespace: kubernetes-dashboard EOF The Dashboard can be accessed either via kubectl proxy, or via port forwarding:\nkubectl -n kubernetes-dashboard port-forward service/kubernetes-dashboard 8443:443 The Dashboard will be available at https://localhost:8443/.\nTo discover the ServiceAccount token for accessing the Dashboard, run:\nkubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get sa/dashboard-viewer -o jsonpath=\u0026#34;{.secrets[0].name}\u0026#34;) -o go-template=\u0026#34;{{.data.token | base64decode}}\u0026#34; The Dashboard will display notifications about the inability to list Secrets or resources from the rbac.authorization.k8s.io API group. This is expected because the ClusterRole doesn\u0026rsquo;t allow that.\nConclusion #There\u0026rsquo;s been a lot described in this guide and that might be overwhelming. Although we have a fully functioning Kubernetes cluster suitable for a local network, it\u0026rsquo;s not the end of the story yet. If it is planned for the cluster to be multi-tenant - then it will require an integrated solution for AuthN/Z such as Dex. Also, this guide doesn\u0026rsquo;t cover how to set up and configure TLS-secured Ingress and authenticated access for the services deployed on the cluster. Both of these topics will be covered in later posts.\n","date":"1 December 2021","permalink":"https://datastrophic.io/kubernetes-homelab-with-proxmox-kubeadm-calico-openebs-and-metallb/","section":"","summary":"Whether you’re looking for a more powerful development environment or a production-grade Kubernetes cluster for experiments, this guide provides end-to-end deployment and configuration instructions to get the cluster up and running. The first part of this guide covers the planning and provisioning of the infrastructure with Proxmox and Terraform. The second part is dedicated to installing Kubernetes and essential software such as Calico for networking, OpenEBS for volume provisioning, and MetalLB for network load balancing.","title":"The Ultimate Kubernetes Homelab Guide: From Zero to Production Cluster On-Premises"},{"content":"With Kubeflow gaining traction in the community and its early adoption in enterprises, security and observability concerns become more and more important. Many organizations that are running AI/ML workloads, operate with sensitive personal or financial data and have stricter requirements for data encryption, traceability, and access control. Quite often, we can see the use of the Istio service mesh for solving these problems and gaining other benefits of the rich functionality it provides.\nKubeflow relies on Istio for traffic routing, authorization policies, and user access control. However, at the moment of writing, it did not fully support Istio for the workloads running on top of it. This post covers architectural and design issues specific to running Kubeflow workloads on Istio and focuses on specific problems of the AI/ML training jobs: TFJob, PyTorchJob, and alike. In the end, the post presents a reference implementation of the Istio Aux Controller - an auxiliary Kubernetes Operator that helps to solve these problems in a fully automated manner.\nIstio #High-level architecture #It is important to have a basic understanding of how Istio is designed at a high level.\nOfficial documentation provides an in-depth overview of all components but for the purpose of this post, we will be focusing mostly on the data plane.\nImage source: Istio architectrue documentation\nIstio Control Plane injects Envoy proxies as sidecar containers running alongside the payload containers in the same pod. Once the proxy is up and running, it starts managing all network communication between pods in the mesh and also receiving configuration updates from the Control Plane. All the access policies and traffic routes are configured via Control Plane and then enforced by proxies.\nTo enable sidecar injection at the namespace level, the namespace should have istio-injection: enabled label\nSidecar injection #Let\u0026rsquo;s take a deeper look into the timeline of the events when Istio injection is enabled and a new Pod is being created:\nThe Istio CNI plugin configures Pod\u0026rsquo;s iptables to route all traffic to the Proxy. If there are any initContainers specified, they start and must complete prior to starting the payload and sidecar containers. Payload and sidecar containers start. The network availability issue #While the injection model looks straightforward, there\u0026rsquo;s one major design flaw here - the Pod network is unreachable until the proxy sidecar starts. Let\u0026rsquo;s revisit the timeline from this perspective:\nIstio CNI plugin configures routing of all traffic to a non-existent proxy (network becomes unavailable) initContainers run Payload and sidecar containers start Proxy starts (network is available again) This means that if any of the payload containers or initContainers requires network access - it is sensitive to this issue:\nwhen a payload container requires network connectivity on start - it will crashloop until the sidecar proxy is started a situation when any of the initContainers depends on fetching the data over the network (and fails otherwise) introduces a deadlock because none of the payload or sidecar containers can start until all the initContainers complete. The initContainers deadlock issue is beyond the scope of this post as it doesn\u0026rsquo;t affect the Kubeflow training jobs.\nThe Job completion issue #Apart from the racy network availability during the Pod startup, there\u0026rsquo;s another issue with the Kubernetes Job-like resources and their handling of the sidecar. Depending on the type of a Kubernetes Controller or an Operator managing the created resource, the problem is that the Istio Proxy keeps running after the payload container is completed and prevents the Job (and Job-like) resources from completion.\nTraining Operators on Istio #When running distributed training jobs using Tensorflow, PyTorch, or MXNet Operators, it is pretty standard for the training code to access the dataset at a remote location over the network (e.g. from cloud storage). This makes it sensitive to the network availability issue and can lead to sporadic failures when running on Istio. Tensorflow will be used for illustration purposes here, however, the problem surface and approaches to solving it are equally applicable to the PyTorch, MXNet, and other Training Operators.\nLet\u0026rsquo;s consider this naive MNIST classification code as an example workload. Note, the mnist.load_data() call downloads the sample dataset from a remote location and requires the network to be available.\nimport tensorflow as tf # import MNIST dataset mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 # define and compile the model model = tf.keras.models.Sequential( [ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation=\u0026#34;relu\u0026#34;), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10), ] ) model.compile( optimizer=\u0026#34;adam\u0026#34;, loss=\u0026#34;sparse_categorical_crossentropy\u0026#34;, metrics=[\u0026#34;accuracy\u0026#34;] ) # train the model model.fit(x_train, y_train, epochs=5) To run this code on a Kubernetes cluster, it needs to be saved into a file (for example, mnist.py) and packed into a Docker image so that it can be pulled and used on any of the cluster nodes by training operator workers. We will use a pre-built Docker image that already includes the code from the above snippet: datastrophic/tensorflow:2.6.0-mnist. Let\u0026rsquo;s create the following TFJob:\napiVersion: kubeflow.org/v1 kind: TFJob metadata: name: mnist spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: datastrophic/tensorflow:2.6.0-mnist command: [\u0026#39;python\u0026#39;, \u0026#39;-u\u0026#39;, \u0026#39;mnist.py\u0026#39;] It can take some time to pull the image but once it is pulled and launched we can check its logs to see it was unable to download the dataset. For that, let\u0026rsquo;s look into one of the worker pods logs:\n$\u0026gt; kubectl logs mnist-worker-0 -c tensorflow Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz Traceback (most recent call last): ... \u0026lt;part of the log omitted for better readability\u0026gt; Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz: None -- [Errno 101] Network is unreachable Although after a couple of attempts the Job will be able to start and pull the data - in a situation when Istio Proxy becomes ready before the payload container attempts to access the network - the Job won\u0026rsquo;t be able to complete with a single container still running. And this single container is the Istio Proxy that is unaware of other sidecars. We can see the event timeline here:\n$\u0026gt; kubectl get pod mnist-worker-0 -w NAME READY STATUS RESTARTS AGE mnist-worker-0 0/2 Init:0/1 0 1s mnist-worker-0 0/2 PodInitializing 0 6s mnist-worker-0 1/2 Running 0 9s mnist-worker-0 2/2 Running 0 16s mnist-worker-0 1/2 NotReady 0 2m21s Let\u0026rsquo;s now take a look into the prior art and the possible workarounds discussed in the community.\nPrior art #There were quite a few discussions, threads, and blog posts about how these issues can be resolved or if there\u0026rsquo;s any workaround for them. What follows is a quick overview of the most frequently mentioned approaches.\nThe networking issue #One of the most common solutions for this problem is to modify the container command and wait for the sidecar proxy to become available as for example recommended in istio/issues#11130. The modified command for the TFJob can look as follows:\ncommand: [\u0026#39;bash\u0026#39;, \u0026#39;-c\u0026#39;] args: [\u0026#39;until curl --silent --head --fail localhost:15000 \u0026gt; /dev/null; do sleep 1; done; python -u mnist.py\u0026#39;] The entrypoint probes Envoy proxy port 15000 until it becomes available and executes the training code only after that.\nYet another intuitive solution when the network access to the remote data is not stable is to introduce retries in the source code responsible for its retrieval. For example:\n@retry(wait_fixed=1000) def load_dataset(): mnist = tf.keras.datasets.mnist return mnist.load_data() This looks more like a bandaid for the given example but, in general, retries can improve resilience and help to avoid transient failures in the presence of unreliable data sources.\nThe sidecar termination issue #One of the available approaches is similar to the Envoy probing and proposes to change the entrypoint and terminate the Istio Proxy either via pkill or by calling a dedicated endpoint http://127.0.0.1:15020/quitquitquit. Based on this GitHub comment, the final entrypoint command for the example MNIST TFJob would look like this:\ncommand: [\u0026#34;/bin/bash\u0026#34;, \u0026#34;-c\u0026#34;] args: - | trap \u0026#34;curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit\u0026#34; EXIT while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done python -u mnist.py An important note on using pkill instead of /quitquitquit is that pkill would require a shared process namespace between containers in the pod which has its own security implications.\nAnother approach described in Handling Istio Sidecars in Kubernetes Jobs proposes a helper process to wrap the entrypoint and communicate with Envoy waiting for it to start and terminating it after the wrapped application stops.\nThe Good, Bad, and Ugly: Istio for Short-lived Pods proposes to inject a wrapper binary, and overwrite the entrypoint command via a webhook, and then trigger the binary subcommand from an accompanying controller to terminate the proxy (similar to kubectl exec).\nConclusion #All the approaches described above have pros and cons but the main drawback is that the initial workloads can not be moved to Istio without modifying either the manifests, entrypoints, or the source code (in case of retries). At any reasonable scale, the number of changes would be significant enough to abandon an initiative like this one. The automated mutation of the entrypoint looks the closest to a proper solution, however, proposes to inject an init container with a wrapper binary and mutate the entrypoint which is not always feasible as there could be issues related to container ordering and multi-container pods.\nMeet Istio AUX Controller #Overview #All the workarounds and the lack of an out-of-the-box solution lead me to prototyping a simple MutatingAdmissionWebhook and a Pod Controller that aimed at solving the above issues with the following principles in mind:\nThe existing user code including Kubernetes manifests should not change to work on Istio. Full automation. Once the solution is in place - it can be enabled or disabled per namespace by a user. Narrow scope and the low impact that doesn\u0026rsquo;t require changing of the global settings. Container entrypoint must not be mutated. The majority of the workarounds deal with single-container Pods in Jobs. There might be other containers dependent on the network. The good news is that in version 1.7, Istio introduced a global configuration property values.global.proxy.holdApplicationUntilProxyStarts that injects the sidecar container at the beginning of the container list of a Pod and causes other containers to wait until it starts. This is described in great details in a blog post by Marko Lukša: Delaying application start until sidecar is ready.\nIstio AUX contains a MutatingAdmissionWebhook that mutates the pods submitted to namespaces with specific labels and adds an Istio-specific annotation to Pods:\nproxy.istio.io/config: \u0026#34;holdApplicationUntilProxyStarts: true\u0026#34; That way, Istio Operator will take care of the rearranging of the sidecars and delaying the first non-Istio container start until the proxy is ready. This can also be solved, by setting the same Istio Proxy property globally, however, it is false by default and it\u0026rsquo;s not clear whether this setting can impact other existing deployments outside Kubeflow.\nAnother part of the Istio AUX Controller is the Controller itself that is also scoped to namespaces with specific labels and subscribed to Pod Update events. All the container status changes trigger the reconciliation, and the controller keeps checking what containers are still running in the Pod. Once there\u0026rsquo;s only one left and it is Istio Proxy, the Controller execs into a pod and runs curl -sf -XPOST http://127.0.0.1:15020/quitquitquit inside it. Istio Proxy container image has curl pre-installed so there\u0026rsquo;s no need for an additional binary or a sidecar to terminate the proxy.\nThe termination heuristic is pretty naive but it is easy to extend it to a more sophisticated version e.g. checking against a list of container names that have to exit prior to terminating the Proxy.\nIstio AUX Controller is a reference implementation for the above approach and is available on GitHub at datastrophic/istio-aux.\nDemo #Prerequisites #You should have a Kubernetes cluster available, kind will suffice but ensure the Docker daemon has sufficient resources to accommodate for cert-manager, Istio, Kubeflow Training Operator, and run a two-pod TFJob (8CPU, 8GB RAM should be sufficient). The following software is required:\nkind kubectl istioctl Cluster setup #The cluster setup is pretty straightforward. The only highlight here is that we will use the Composite Operator that supports all types of training jobs (former TF Operator).\nkind create cluster # wait for node(s) to become ready kubectl wait --for condition=Ready node --all # install cert-manager kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml # wait for pods to become ready kubectl wait --for=condition=Ready pods --all --namespace cert-manager # install istio istioctl install --set profile=demo -y # install the training operator kubectl apply -k \u0026#34;github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=master\u0026#34; # wait for pods to become ready kubectl wait --for=condition=Ready pods --all --namespace kubeflow # install the Istio AUX controller kubectl apply -k \u0026#34;github.com/datastrophic/istio-aux.git/config/default?ref=master\u0026#34; Deploying the workloads #Let\u0026rsquo;s create a TFJob that will be used for testing, enable Istio injection for the default namespace, and submit the job:\nkubectl label namespace default istio-injection=enabled cat \u0026lt;\u0026lt;EOF \u0026gt;./tfjob.yaml apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: mnist spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: datastrophic/tensorflow:2.6.0-mnist command: [\u0026#39;python\u0026#39;, \u0026#39;-u\u0026#39;, \u0026#39;mnist.py\u0026#39;] EOF kubectl create -f tfjob.yaml kubectl get pods -w We\u0026rsquo;ll see that the Pods will eventually get stuck in the NotReady state with one container still running.\nNow let\u0026rsquo;s enable the Istio AUX Controller for the default namespace and redeploy the TFJob one more time.\nkubectl delete -f tfjob.yaml kubectl label namespace default io.datastrophic/istio-aux=enabled kubectl create -f tfjob.yaml kubectl get pods -w This time, all the pods reached the Completed state.\nIn the meantime, the Istio AUX Controller logs contain an output like this:\n... INFO\twebhook.webhook\tprocessing pod mnist-worker-0 INFO\twebhook.webhook\tpod mnist-worker-0 processed ... INFO\twebhook.webhook\tprocessing pod mnist-worker-1 INFO\twebhook.webhook\tpod mnist-worker-1 processed ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;} INFO\tistio-aux\tsome containers are still running, skipping istio proxy shutdown\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;, \u0026#34;containers\u0026#34;: [\u0026#34;tensorflow\u0026#34;]} ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;} INFO\tistio-aux\tsome containers are still running, skipping istio proxy shutdown\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;, \u0026#34;containers\u0026#34;: [\u0026#34;tensorflow\u0026#34;]} ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;} INFO\tistio-aux\tthe payload containers are terminated, proceeding with the proxy shutdown\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;} ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;} INFO\tistio-aux\tthe payload containers are terminated, proceeding with the proxy shutdown\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;} ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;} INFO\tistio-aux\tistio-proxy is already in a terminated state\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-0\u0026#34;} ... INFO\tistio-aux\tfound a pod with istio proxy, checking container statuses\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;} INFO\tistio-aux\tistio-proxy is already in a terminated state\t{\u0026#34;pod\u0026#34;: \u0026#34;mnist-worker-1\u0026#34;} ... Final thoughts #The proposed solution works for existing versions of Kubernetes and Istio but given the fast pace of their evolution might become outdated relatively quickly. It would be nice to have similar functionality in either but it is understandable that container interdependencies in Pods do not generalize well for a universal generic solution.\nIdeally, it would be great to have this problem solved by Kubernetes itself. As described in Sidecar container lifecycle changes in Kubernetes 1.18, it was proposed to assign containers with a lyfecycle type so that the sidecars would be terminated by the Kubelet once the payload containers complete.\nAlthough the reference implementation addresses a specific case and a subset of Kubeflow Operators it provides a relatively generic solution to a problem but of course, requires additional work to productionize it.\nPlease don\u0026rsquo;t hesitate to reach out to me with feedback and/or if you are interested in collaboration.\n","date":"4 October 2021","permalink":"https://datastrophic.io/kubeflow-training-operators-and-istio-solving-the-proxy-sidecar-lifecycle-problem-for-aiml-workloads/","section":"","summary":"With Kubeflow gaining traction in the community and its early adoption in enterprises, security and observability concerns become more and more important. Many organizations that are running AI/ML workloads, operate with sensitive personal or financial data and have stricter requirements for data encryption, traceability, and access control. Quite often, we can see the use of the Istio service mesh for solving these problems and gaining other benefits of the rich functionality it provides.","title":"Kubeflow Training Operators and Istio: solving the proxy sidecar lifecycle problem for AI/ML workloads"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/mlops/","section":"Tags","summary":"","title":"Mlops"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/operators/","section":"Tags","summary":"","title":"Operators"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/docker/","section":"Tags","summary":"","title":"Docker"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/marathon/","section":"Tags","summary":"","title":"Marathon"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/mesos/","section":"Tags","summary":"","title":"Mesos"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/spark/","section":"Tags","summary":"","title":"Spark"},{"content":"After several years of running Spark JobServer workloads, the need for better availability and multi-tenancy emerged across several projects author was involved in. This blog post covers design decisions made to provide higher availability and fault tolerance of JobServer installations, multi-tenancy for Spark workloads, scalability and failure recovery automation, and software choices made in order to reach these goals.\nSpark JobServer #Spark JobServer is widely used across a variety of reporting and aggregating systems. One of the valuable features among others is unified REST API to interact with Spark Contexts, execute jobs and retrieve results asynchronously from a cache. Unified API allows standardizing any Spark application and abstracts away the need for application developers to initialize and configure Spark Context every time a new application is developed.\nIn most of the cases, Spark applications are developed to be used with spark-submit which in turn will create a context at the moment of execution. Context creation is a costly operation and takes time depending on cluster utilization and resources requested. JobServer addresses this issue by maintaining long-running contexts so any loaded application doesn’t have to wait for a context to be initialized which in turn results in faster response and execution times and allows to use Spark applications as backends for querying data.\nOriginally, the JobServer was developed to run on Spark Standalone clusters and some of its design features address same problems as e.g. Application Master in YARN. This blog post is focused on design decisions targeted at increasing stability of the JobServer by utilizing Mesos as Spark cluster manager and Marathon as an orchestration system for providing high availability.\nSpark uses a Cluster Manager for scheduling tasks to run in distributed mode (Figure 1). Supported cluster managers are Spark Standalone, Mesos and YARN. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program)[source].\nFigure 1. Spark application execution flow\nJobServer runs Spark Applications either being Spark Driver itself or spawning a separate JVM per context thus being out-of-the-box compatible with supported Cluster Managers (Figure 2). Figure 2. JobServer top-level Architecture. Single JVM (left) and JVM per context (right)\nLimitations #Spark Standalone as a cluster manager has several significant limitations making maintenance and operations harder for engineers:\nSpark master is a single point of failure when single instance is used. In situations when workers fail and/or restart they register back to Spark Master and a cluster continues its operations. However, when the master fails and/or restarts workers are unable to register automatically and whole cluster restart is needed. This problem can be solved by running Spark Master in HA mode and performing leader election and service discovery with ZooKeeper. Different Spark versions across applications. With growing number of Spark applications dependencies versions start to diverge and at some point, it’s hard to perform a big-bang upgrade and the need for different environments emerges. That is, applications using latter Spark major releases will need another cluster with the corresponding version when standalone mode is used and this situation is suboptimal: number of clusters, amount of hardware and engineering time needed for support will grow significantly. Heterogeneous environments and dependencies. Although multiple Spark versions could be considered as a special case of this problem it’s different and can arise even if same Spark version is used. Applications not only can depend on different third-party libraries (e.g. hadoop and/or file format families) but also be compiled with different Java versions (in case JVM languages are used). Managing classpaths and runtime class version conflicts (also known as ‘jar hell’) is a time-consuming task which is better to avoid by means of stronger isolation. So let’s look at the requirements expected to be met by a cluster manager (Omega paper by Google can be used as a reference):\nEfficiency efficient sharing of resources across applications utilization of cluster resources in the most optimal manner Flexibility support of wide array of current and future frameworks dealing with hardware heterogeneity orchestration framework for applications providing high availability guarantees support of resource requests of different types (RAM, CPU, ports) Scalability scaling to clusters of hundreds of nodes scheduling system response times must remain acceptable while increasing number of machines and applications Robustness fault tolerance guarantees for the system and applications high availability of central scheduler component While part of these requirements is met by YARN it won’t provide high availability guarantees for JobServer itself. Service failures can be addressed by means of systemd or upstart but a hardware failure will need a manual maintenance in most of the cases involving provisioning of a new machine if there’s no reserved one available and deployment of JobServer to it. Given that all these steps are automated with tools like Ansible or Chef the downtime for customer-facing applications is still unacceptable.\nAnother solution available in the open-source world is Apache Mesos. While it can be used as Spark cluster manager out of the box, it’s also possible to execute standalone applications as long-running cluster tasks by means of Marathon - a container orchestration platform for Mesos.\nMesos overview #Mesos is a cluster resource manager which provides linear scalability, high availability and container support with a unique approach of two-level scheduling. Official documentation provides a detailed overview of Mesos architecture and its components, and here’s a really quick recap to be on the same page (Figure 3).\nFigure 3. Mesos architecture overview\nMaster a mediator between slave resources and frameworks enables fine-grained sharing of resources by making resource offers serves as master for Spark (not a single point of failure) Slave (Agent) manages resources on physical node and runs executors Framework application that solves a specific use case (e.g. Spark) Scheduler negotiates with master and handles resource offers Executors consume resources and run tasks on slaves In Mesos terminology, Spark is a framework that acquires cluster resources to execute its jobs. Depending on job resource demands (RAM, CPU) Spark accepts or declines resource offers made by Mesos Master allocation module. Allocation module uses Dominant Resource Fairness algorithm which in simple words orders sending of offers to frameworks based on their cluster usage i.e. frameworks using fewer resources than the others will receive offers first. More details are available in Dominant Resource Fairness explained blog post.\nSpark on Mesos #Spark supports two modes of running on Mesos: fine-grained(deprecated) and coarse-grained. To understand the difference let’s have a quick look into Mesos scheduling algorithm (Figure 4).\nFigure 4. Mesos scheduling overview\nAgent nodes continuously report to Master amount and type of available resources: RAM, CPU, disk, ports Allocation module starts offering resources to frameworks Framework receives offers if resources do not satisfy its needs - rejects the offer if resources satisfy its demands - creates list of tasks and sends to master Master verifies tasks and forwards to executor (and launches the executor if it’s not running) A task in Mesos terminology is a single unit of work executed by a framework. In fine-grained mode Spark wraps its every task in Mesos task thus relying on Mesos scheduling, while in coarse-grained mode Spark only runs its executors (Spark workers) and executes tasks relying on its own scheduler and RPC mechanism (Akka or Netty, depending on Spark version) for submitting tasks to executors (Figure 5)\nFigure 5. Spark coarse-grained mode\nHeterogeneity and multi-tenancy #Moving from Spark Standalone to Mesos addresses yet another problem of shrinking down a number of clusters being used and providing a better cluster utilization. While it’s a valid point that it’s possible to run multiple contexts on the same cluster even with Spark Standalone, it becomes impossible to manage incompatible Spark versions within the same installation not to mention Java versions. With proper packaging and environment setup, it’s easy to achieve these goals with Mesos which makes it possible to not only run heterogeneous Spark contexts on the same cluster but share it with other applications as well (Figure 6).\nFigure 6. Mesos multi-tenancy\nLet’s have a look at an example of running 4 Spark JobServer applications (jobserver-bob and jobserver-alice 2 instances each) on the same cluster. Each of JobServers creates Spark Context managed by driver program: spark-driver-bob with 14 CPU and spark-driver-alice with 21 CPU in total. Also, one can observe Marathon framework using 15% of cluster resources (Figure 7). Marathon is used to run JobServer instances which in turn use Mesos for running Spark Contexts. Details of running JobServer in Marathon will be covered in the next part of the blog post series.\nFigure 7. Spark on Mesos multi-tenancy\nA naive way for supporting multiple Spark (and Java) versions would be installing all the necessary binaries on every machine in a cluster. While this allows to ramp up really quickly, usually after not so long time environments start to diverge and maintenance becomes tricky.\nThe good news is that Spark-Mesos integration supports Docker: spark.mesos.executor.docker.image configuration parameter allows specifying a custom Docker image to use as an executor. Although it doesn’t look like the most important thing it provides a great flexibility when a number of environments and software versions being used grows.\nYet another important feature worth mentioning is Dynamic Resource Allocation which allows Spark to return unused resources to a cluster manager and acquire more resource when application demands grow. This provides a better resource utilization in case of multi-tenant workloads but it should be used with caution in case of latency-critical applications because resource acquisition takes some time and in worst case, some other framework can use requested resources. Dynamic allocation is supported for any Spark cluster manager using coarse-grained mode and in Mesos it’s a responsibility of an engineer to run Spark External Shuffle service in order to make it work.\nConclusion #With a minor tweaking of existing Spark jobs and Spark JobServer, it’s become possible to achieve better utilization of a single cluster instead of running multiple idling Standalone clusters (Figure 8). Given that a problem with incompatible versions is solved by means of isolation, it’s possible to migrate all of the existing Spark projects to Mesos without upgrading all of them and keep running different versions of Spark and Java which are currently in use.\nFigure 8. Mesos cluster utilization\nIt’s worth mentioning that JobServer is running on the same cluster with the same fault tolerance and high availability guarantees provided by Mesos and Marathon. Another important aspect of installations running on any kind of orchestration platform is physical node characteristics and cluster layout which takes those aspects into account. This topic will be covered in the next part of the series together with migration of JobServer to Marathon and using Docker as a main tool for packaging and distribution of the applications running on Mesos.\n","date":"12 October 2017","permalink":"https://datastrophic.io/spark-jobserver-from-spark-standalone-to-mesos-marathon-and-docker-part-i/","section":"","summary":"After several years of running Spark JobServer workloads, the need for better availability and multi-tenancy emerged across several projects author was involved in. This blog post covers design decisions made to provide higher availability and fault tolerance of JobServer installations, multi-tenancy for Spark workloads, scalability and failure recovery automation, and software choices made in order to reach these goals. Spark JobServer Spark JobServer is widely used across a variety of reporting and aggregating systems.","title":"Spark JobServer: from Spark Standalone to Mesos, Marathon and Docker"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/drf/","section":"Tags","summary":"","title":"DRF"},{"content":"Apache Mesos provides a unique approach to cluster resource management called two-level scheduling: instead of storing information about available cluster resources in a centralized manner it operates with a notion of resource offers which slave nodes advertise to running frameworks via Mesos master, thus keeping the whole system architecture concise and scalable. Master\u0026rsquo;s allocation module is responsible for making the decisions about which application should receive the next resource offer and it relies on Dominant Resource Fairness(DRF) algorithm for making these decisions. This post presents a set of experiments for different use cases using simple Mesos framework implementation to observe and analyze DRF behavior.\nArchitecture Recap #Mesos official documentation provides a great overview of the architecture, so let\u0026rsquo;s briefly recap the framework and resource offers part of it.\nFrameworks #Mesos frameworks consist of two main components: Scheduler and Executor.\nThe Scheduler is a single instance/process which negotiates with the Master and is responsible for handling resource offers, task submissions, tasks status updates, framework messages, and exceptional situations such as slave losses, disconnections, and various errors. Sometimes high availability is a requirement for schedulers, so some sort of leader election or specific logic is needed to avoid conflicts in task submissions.\nThe Executor is a process executed on Slave nodes that receive tasks from Scheduler and run them. Executor lifecycle is bound to Scheduler so when Scheduler finished its job it also shutdowns the executors (actually Mesos master performs this routine when framework terminates). If the Executor is a JVM process it usually has a thread pool for executing received tasks.\nTasks are serialized in protobuf and contain all the information about resources needed, executor to run on, some binary serialized payload (e.g. Spark tasks are serialized and transferred as a payload in Mesos tasks). Here\u0026rsquo;s an example of how Task is created:\ndef buildTask(offer: Offer, cpus: Double, memory: Int, executorInfo: ExecutorInfo) = { val cpuResource = Resource.newBuilder .setType(SCALAR) .setName(\u0026#34;cpus\u0026#34;) .setScalar(Scalar.newBuilder.setValue(cpus)) .setRole(\u0026#34;*\u0026#34;) .build val memResource = Resource.newBuilder .setType(SCALAR) .setName(\u0026#34;mem\u0026#34;) .setScalar(Scalar.newBuilder.setValue(memory)) .setRole(\u0026#34;*\u0026#34;) .build TaskInfo.newBuilder() .setSlaveId(SlaveID.newBuilder().setValue(offer.getSlaveId.getValue).build()) .setTaskId(TaskID.newBuilder().setValue(s\u0026#34;$uuid\u0026#34;)) .setExecutor(executorInfo) .setName(UUID.randomUUID().toString) .addResources(cpuResource) .addResources(memResource) .build() } def buildExecutorInfo(d: SchedulerDriver, prefix: String): ExecutorInfo = { val scriptPath = System.getProperty(\u0026#34;executor.path\u0026#34;,\u0026#34;/throttle/throttle-executor.sh\u0026#34;) ExecutorInfo.newBuilder() .setCommand(CommandInfo.newBuilder().setValue(\u0026#34;/bin/sh \u0026#34;+scriptPath)) .setExecutorId(ExecutorID.newBuilder().setValue(s\u0026#34;${prefix}_$uuid\u0026#34;)) .build() } Sources are available at the accompanying github repo.\nResource Offers # Slave nodes periodically report to the Master free resources they can provide. This also happens when executor tasks complete minimizing delay Allocation module starts offering the resources to frameworks. A dominant resource fairness algorithm is used to define the order in which frameworks are offered the resource. The framework can either accept or decline the offer depending on its demand. In case of decline (e.g. insufficient resources of some type) the resources are offered to the next framework according to DRF. If resources in the offer satisfy the framework\u0026rsquo;s demand then it creates a list of Tasks and sends it back to Master. Master could respond with an error if tasks resources exceed the amount provided in the offer. Master sends set of tasks to the Executor on the Slave (and launches the executor if it\u0026rsquo;s not running) Dominant Resource Fairness (DRF) #The main problem that DRF solves is sharing of the resources of multiple types (but not only CPU) which hasn\u0026rsquo;t been addressed before and caused insufficient resource distribution among applications with heterogeneous resource demands. To address this problem the notions of dominant share and dominant resource have been introduced.\nDominant resource - a resource of a specific type (cpu, memory, disk, ports) which is most demanded by the given framework among other resources it needs. This resource is identified as a share of the total cluster resources of the same type.\nExample 1: given a cluster with the total amount of 10 CPUs and 20GB RAM, resource demands are: Framework A \u0026lt; 4 CPU, 5 GB \u0026gt;, Framework B \u0026lt; 2 CPU, 8 GB \u0026gt;. The same expressed as a share of total cluster resources: Framework A \u0026lt; 40% CPU, 25% RAM \u0026gt;, Framework B \u0026lt; 20% CPU, 40% RAM \u0026gt;. So for Framework A CPU is a dominant resource, while for Framework B RAM is.\nDRF computes the share of dominant resource allocated to a framework (dominant share) and tries to maximize the smallest dominant share in the system. During the next round of resource offers allocation module applies DRF to identify the dominant shares of the frameworks and offers the resources first to the one with the smallest dominant share, then to the second smallest one, and so on.\nExample 2: consider the cluster with total amount of 10 CPUs and 20GB RAM, resource demands are: Framework A \u0026lt; 3 CPU, 2 GB \u0026gt;, Framework B \u0026lt; 1 CPU, 5 GB \u0026gt;. Same in percents: A_\u0026lt; 33% CPU, 10% RAM \u0026gt;, B\u0026lt; 10% CPU, 25% RAM \u0026gt;_. Steps:\n1. (10cpu, 20gb) to A: A(3cpu, 2gb, 33%), B (0cpu, 0gb, 0%) 2. (\u0026amp;nbsp;7cpu, 18gb) to B: A(4cpu, 5gb, 40%), B (1cpu, 5gb, 25%) 3. (\u0026amp;nbsp;6cpu, 13gb) to B: A(4cpu, 5gb, 40%), B (2cpu, 10gb, 50%) 4. (\u0026amp;nbsp;5cpu, \u0026amp;nbsp;8gb) to A: A(6cpu, 4gb, 66%), B (2cpu, 10gb, 50%) 5. (\u0026amp;nbsp;2cpu, \u0026amp;nbsp;6gb) to B: A(6cpu, 4gb, 66%), B (3cpu, 15gb, 75%) 6. (\u0026amp;nbsp;1cpu, \u0026amp;nbsp;1gb) to A: A(6cpu, 4gb, 66%), B (3cpu, 15gb, 75%) 7. (\u0026amp;nbsp;1cpu, \u0026amp;nbsp;1gb) to B: A(6cpu, 4gb, 66%), B (3cpu, 15gb, 75%) ... and so on until the task of one of the frameworks is finished and resources are released. Let\u0026rsquo;s walk through the steps. The first resource offer goes to A (let\u0026rsquo;s say it was executed first, so it receives the offer first). A accepts the offer and its dominant share becomes 33%, so the next offer goes to the framework with the smallest dominant share which is B (step 2). After accepting the offer dominant share of B becomes 25% which is less than the share of A, so it receives the next offer (step 3) and now its share is 50%. A becomes the framework with the smallest share and receives the next offer (step 4) and so on.\nIn the end, there is only 1cpu and 1gb RAM available, so that cluster CPU is utilized for 90% and RAM for 95% which is pretty good saturation.\nHopefully, the description was not very complicated, because the algorithm itself is very simple and sound. Please refer to the original paper \u0026ldquo;Dominant Resource Fairness: Fair Allocation of Multiple Resource Types\u0026rdquo; for more details.\nThere\u0026rsquo;s not always that perfect cluster saturation in real life so the next part of the post will cover different use cases and how DRF addresses them. We\u0026rsquo;re about to run several instances of simple Mesos Framework with different resource demands (configured at launch time) in parallel to observe DRF behavior and identify potential bottlenecks.\nExample Framework #For the purpose of observation of DRF behavior a simple Mesos framework is going to be used. The scheduler will submit one task per resource offer round, and the executor will emulate the processing by putting the task thread on sleep for a random amount of time. The main requirement of the framework is configurable task resource demands to properly set up different conditions for parallel execution with heterogeneous resource demands.\nResource offers handling in the Scheduler code:\noverride def resourceOffers(driver: SchedulerDriver, offers: util.List[Offer]): Unit = { for(offer \u0026lt;- offers){ stateLock.synchronized { logger.info(s\u0026#34;Received resource offer: cpus:${getCpus(offer)} mem: ${getMemory(offer)}\u0026#34;) if(isOfferValid(offer)){ val executorInfo = executors.getOrElseUpdate(offer.getSlaveId.getValue, buildExecutorInfo(driver, \u0026#34;DRFDemoExecutor\u0026#34;)) //amount of tasks is calculated to fully use resources from the offer val tasks = buildTasks(offer, config, executorInfo) logger.info(s\u0026#34;Launching ${tasks.size} tasks on slave ${offer.getSlaveId.getValue}\u0026#34;) driver.launchTasks(List(offer.getId), tasks) } else { logger.info(s\u0026#34;Offer provides insufficient resources. Declining.\u0026#34;) driver.declineOffer(offer.getId) } } } } def buildTasks(offer: Offer, config: Config, executorInfo: ExecutorInfo): List[TaskInfo] = { val amount = Math.min(getCpus(offer)/config.cpus, getMemory(offer)/config.mem).toInt (1 to amount).map(_ =\u0026gt; buildTask(offer, config.cpus, config.mem, executorInfo) ).toList } An excerpt from the Executor task handling method:\noverride def launchTask(driver: ExecutorDriver, task: TaskInfo): Unit = { threadPool.execute(new Runnable() { override def run(): Unit = { val taskStatus = TaskStatus.newBuilder().setTaskId(task.getTaskId) val taskId = task.getTaskId.getValue logger.info(s\u0026#34;Task $taskId received by executor: ${task.getExecutor.getExecutorId.getValue}\u0026#34;) driver.sendStatusUpdate( taskStatus .setState(TaskState.TASK_RUNNING) .build() ) val delay = 20000 + Random.nextInt(20000) logger.info(s\u0026#34;Running task for ${delay/1000f} sec.\u0026#34;) Thread.sleep(delay) val msg = s\u0026#34;Task $taskId finished\u0026#34; logger.info(msg) driver.sendStatusUpdate( taskStatus .setState(TaskState.TASK_FINISHED) .setData(ByteString.copyFrom(serialize(msg))) .build() ) } }) } And the framework\u0026rsquo;s entrypoint looks like this:\ncase class Config( mesosURL: String = \u0026#34;\u0026#34;, name: String = \u0026#34;\u0026#34;, cpus: Double = 0.1, mem: Int = 512 ) ... def run(config: Config): Unit ={ val framework = FrameworkInfo.newBuilder .setName(config.name) .setUser(\u0026#34;\u0026#34;) .setRole(\u0026#34;*\u0026#34;) .setCheckpoint(false) .setFailoverTimeout(0.0d) .build() val driver = new MesosSchedulerDriver(new DRFDemoScheduler(config), framework, config.mesosURL) driver.run() } Full source code available on github.\nThe config case class is created with scopt (scala options parser) and the framework invocation code looks as follows:\njava -cp /throttle/throttle-framework.jar \\ -Dexecutor.path=/throttle/drf-executor.sh \\ io.datastrophic.mesos.drf.DRFDemoFramework \\ --mesos-master zk://zookeeper:2181/mesos \\ --framework-name \u0026#39;Framework A\u0026#39; \\ --task-cpus 2 \\ --task-memory 1000 This allows running the same framework code with different task resource demands to easily setup experimental use cases.\nExperiments #A small virtual cluster will be used for the experiment with a total capacity of 6 CPUs and 6.6 GB RAM. We\u0026rsquo;re going to use small-sized dummy tasks (which simply sleep to emulate processing) so the total capacity is not big deal here. A dockerized environment is also available on github.\nRedistribution of resources between frameworks with same demands #Let\u0026rsquo;s first look at how Mesos will redistribute resources when new frameworks with similar resource demands are added and removed from the cluster. We\u0026rsquo;re going to use \u0026lt; 0.5 CPU, 512 RAM \u0026gt; (or \u0026lt; 8% CPU, 7.5% RAM \u0026gt;) as task resource demands and 20000 + Random.nextInt(20000) millis as task duration to make tasks running for 20-40 seconds each because when tasks are too small the cluster will not be fully loaded (default resource offers round frequency is 5 seconds).\nFirst Framework A will be launched as the only application running on the cluster. After a while we\u0026rsquo;ll see this picture in Mesos UI: Let\u0026rsquo;s now launch competing Framework B. In most cases the behavior is as expected - equal sharing of resources: But for short periods of time one can observe this picture (disregarding the framework): This is a result of the transitional situation when both frameworks have been having equal dominant share (50/50) and when two of Framework A\u0026rsquo;s tasks have finished close to each other in time (resulting in 4 running tasks). After that the first offer goes to A, it accepts and launches one task (now it\u0026rsquo;s 5 running tasks). The next offer goes to B and it accepts it (having no tasks finished up to the moment, the total amount of tasks is 7). The next offer will go to Framework A. It\u0026rsquo;s actually not an extraordinary situation though because these distortions are normal taking into account specifics of the framework implementations and how they work with resource offers.\nNow Framework C is launched in the cluster: For the reasons described above sometimes a different picture can be observed: From this, we can conclude that if more frameworks are running in parallel then these discrepancies would be observed more often, but none of the frameworks will occupy the majority of the resources forever. DRF defines the priorities in which frameworks are offered the resources, but within a single round, none of the frameworks are offered resources twice. There exists Weighted DRF implementation which changes this behavior.\nRedistribution of resources between frameworks with heterogeneous resource demands #Now Framework A\u0026rsquo;s tasks will consume \u0026lt; 2 CPU, 1000 RAM \u0026gt; (or \u0026lt; 33% CPU, 15% RAM \u0026gt;) and Framework B\u0026rsquo;s tasks will consume \u0026lt; 1 CPU, 2000 RAM \u0026gt; (or \u0026lt; 16% CPU, 30% RAM \u0026gt;).\nLet\u0026rsquo;s see how it works: Looks awesome, but here\u0026rsquo;s a trick in resource demands: frameworks together completely saturate the cluster and their dominant resources are not in the conflict. Two tasks of Framework A use ~60% of cluster CPU while two tasks of Framework B use ~67% of memory. But let\u0026rsquo;s increase Framework B\u0026rsquo;s memory demands up to 36% (2500MB), so to run two tasks it will need \u0026gt;72% of the max share and more resources in the offer. Most of the time we\u0026rsquo;ll be observing this picture: The picture could change when both tasks of Framework A are finished in between resource offer rounds and all freed resources are offered to Framework B which will launch two of its tasks. After that there are enough resources for launching only one task of Framework A : The conclusion is that running multiple small tasks is better than launching large ones in terms of time spent waiting for enough resources to be freed by other frameworks.\nDiscussion #After a number of simulations with a different number of frameworks with different dominant resources the observations described above still hold. Corner cases still appear but within different time frames, and resource allocations converge to a fair distribution of resources.\nOne important thing to note is that one resource offer represents the amount of resources available on one slave node. This means that if someone wants to execute a task that needs N cpus and M memory there should exist a physical node in the cluster which has this capacity.\nMesos enforces incremental task execution to run the whole job (fine-grained mode), but if some sort of gang scheduling is needed then the framework could implement a coarse-grained approach (like Spark does) to run its executors with maximum available resources and schedule the job by means of the framework. Another important thing to mention is that frameworks don\u0026rsquo;t have information about the total amount of cluster resources so it\u0026rsquo;s hardly possible to allocate all the resources at once.\nDRF Properties #The Dominant Resource Fairness algorithm satisfies several important properties which are worth mentioning to have a full picture. Cites are from the original paper \u0026ldquo;Dominant Resource Fairness: Fair Allocation of Multiple Resource Types\u0026rdquo;:\nSharing incentive: Each user should be better off sharing the cluster, than exclusively using her own partition of the cluster. Consider a cluster with identical nodes and n users. Then a user should not be able to allocate more tasks in a cluster partition consisting of 1/n of all resources. Strategy-proofness: Users should not be able to benefit by lying about their resource demands. This provides incentive compatibility, as a user cannot improve her allocation by lying. Envy-freeness: A user should not prefer the allocation of another user. This property embodies the notion of fairness. Pareto efficiency: It should not be possible to increase the allocation of a user without decreasing the allocation of at least another user. This property is important as it leads to maximizing system utilization subject to satisfying the other properties. Single resource fairness: For a single resource, the solution should reduce to max-min fairness. Bottleneck fairness: If there is one resource that is percent-wise demanded most of by every user, then the solution should reduce to max-min fairness for that resource. Population monotonicity: When a user leaves the system and relinquishes her resources, none of the allocations of the remaining users should decrease. Resource monotonicity: If more resources are added to the system, none of the allocations of the existing users should decrease. Rephrasing a sharing incentive point: users are offered to share the whole cluster resources instead of exclusively owning their own partition. While in both cases the user has guaranteed minimum allocation of 1/n of cluster resources (given that there\u0026rsquo;re n users) in case of partition allocation user will not be able to allocate more tasks than 1/n while with DRF it\u0026rsquo;s possible to allocate more tasks using resources released by other frameworks.\nWhere to go from here # The original Mesos paper: \u0026ldquo;Dominant Resource Fairness: Fair Allocation of Multiple Resource Types\u0026rdquo; Video of the DRF talk at USENIX Symposium on Networked Systems Design and Implementation Paper: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Paper on Omega scheduler describing different from two-level scheduling approaches: Omega: flexible, scalable schedulers for large compute clusters Project code and dockerized environment to work with at github ","date":"27 March 2016","permalink":"https://datastrophic.io/resource-allocation-in-mesos-dominant-resource-fairness-explained/","section":"","summary":"Apache Mesos provides a unique approach to cluster resource management called two-level scheduling: instead of storing information about available cluster resources in a centralized manner it operates with a notion of resource offers which slave nodes advertise to running frameworks via Mesos master, thus keeping the whole system architecture concise and scalable. Master’s allocation module is responsible for making the decisions about which application should receive the next resource offer and it relies on Dominant Resource Fairness(DRF) algorithm for making these decisions.","title":"Resource Allocation in Mesos: Dominant Resource Fairness"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/sheduling/","section":"Tags","summary":"","title":"Sheduling"},{"content":"This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks, and shuffle implementation and also describes the architecture and main components of Spark Driver. There\u0026rsquo;s a github.com/datastrophic/spark-workshop project created alongside this post which contains Spark Applications examples and dockerized Hadoop environment to play with. Slides are also available at slideshare.\nIntro #Spark is a generalized framework for distributed data processing providing functional API for manipulating data at scale, in-memory data caching, and reuse across computations. It applies a set of coarse-grained transformations over partitioned data and relies on the dataset lineage to recompute tasks in case of failures. Worth mentioning is that Spark supports the majority of data formats, has integrations with various storage systems, and can be executed on Mesos or YARN.\nPowerful and concise API in conjunction with rich library makes it easier to perform data operations at scale. E.g. performing backup and restore of Cassandra column families in Parquet format:\ndef backup(path: String, config: Config) { sc.cassandraTable(config.keyspace, config.table) .map(_.toEvent).toDF() .write.parquet(path) } def restore(path: String, config: Config) { sqlContext.read.parquet(path) .map(_.toEvent) .saveToCassandra(config.keyspace, config.table) } Or run discrepancies analysis comparing the data in different data stores:\nsqlContext.sql { \u0026#34;\u0026#34;\u0026#34; SELECT count() FROM cassandra_event_rollups JOIN mongo_event_rollups ON cassandra_event_rollups.uuid = cassandra_event_rollups.uuid WHERE cassandra_event_rollups.value != cassandra_event_rollups.value \u0026#34;\u0026#34;\u0026#34;.stripMargin } Recap #Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph representing transformations and dependencies between them.\nSpark Application (often referred to as Driver Program or Application Master) at a high level consists of SparkContext and user code which interacts with it creating RDDs and performing series of transformations to achieve the final result. These transformations of RDDs are then translated into DAG and submitted to Scheduler to be executed on a set of worker nodes.\nRDD: Resilient Distributed Datasets #RDD could be thought of as an immutable parallel data structure with failure recovery possibilities. It provides API for various transformations and materializations of data as well as for control over caching and partitioning of elements to optimize data placement. RDD can be created either from external storage or from another RDD and stores information about its parents to optimize execution (via pipelining of operations) and recompute partition in case of failure.\nFrom a developer\u0026rsquo;s point of view, RDD represents distributed immutable data (partitioned data + iterator) and lazily evaluated operations (transformations). As an interface RDD defines five main properties:\n//a list of partitions (e.g. splits in Hadoop) def getPartitions: Array[Partition] //a list of dependencies on other RDDs def getDependencies: Seq[Dependency[_]] //a function for computing each split def compute(split: Partition, context: TaskContext): Iterator[T] //(optional) a list of preferred locations to compute each split on def getPreferredLocations(split: Partition): Seq[String] = Nil //(optional) a partitioner for key-value RDDs val partitioner: Option[Partitioner] = None Here\u0026rsquo;s an example of RDDs created during a call of method sparkContext.textFile(\u0026quot;hdfs://...\u0026quot;) which first loads HDFS blocks in memory and then applies map() function to filter out keys creating two RDDs: HadoopRDD: getPartitions = HDFS blocks getDependencies = None compute = load block in memory getPrefferedLocations = HDFS block locations partitioner = None MapPartitionsRDD getPartitions = same as parent getDependencies = parent RDD compute = compute parent and apply map() getPrefferedLocations = same as parent partitioner = None RDD Operations Operations on RDDs are divided into several groups:\nTransformations apply user function to every element in a partition (or to the whole partition) apply aggregation function to the whole dataset (groupBy, sortBy) introduce dependencies between RDDs to form DAG provide functionality for repartitioning (repartition, partitionBy) Actions trigger job execution used to materialize computation results Extra: persistence explicitly store RDDs in memory, on disk or off-heap (cache, persist) checkpointing for truncating RDD lineage Here\u0026rsquo;s a code sample of a job which aggregates data from Cassandra in lambda style combining previously rolled-up data with the data from raw storage and demonstrates some of the transformations and actions available on RDDs\n//aggregate events after specific date for given campaign val events = sc.cassandraTable(\u0026#34;demo\u0026#34;, \u0026#34;event\u0026#34;) .map(_.toEvent)\t.filter { e =\u0026gt; e.campaignId == campaignId \u0026amp;\u0026amp; e.time.isAfter(watermark) } .keyBy(_.eventType) .reduceByKey(_ + _)\t.cache()\t//aggregate campaigns by type val campaigns = sc.cassandraTable(\u0026#34;demo\u0026#34;, \u0026#34;campaign\u0026#34;) .map(_.toCampaign) .filter { c =\u0026gt; c.id == campaignId \u0026amp;\u0026amp; c.time.isBefore(watermark) } .keyBy(_.eventType) .reduceByKey(_ + _) .cache() //joined rollups and raw events val joinedTotals = campaigns.join(events) .map { case (key, (campaign, event)) =\u0026gt; CampaignTotals(campaign, event) } .collect() //count totals separately val eventTotals = events.map{ case (t, e) =\u0026gt; s\u0026#34;$t -\u0026gt; ${e.value}\u0026#34; } .collect() val campaignTotals = campaigns.map{ case (t, e) =\u0026gt; s\u0026#34;$t -\u0026gt; ${e.value}\u0026#34; } .collect() Execution workflow recap # Here\u0026rsquo;s a quick recap on the execution workflow before digging deeper into details: user code containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks by DAGScheduler. Stages combine tasks that don’t require shuffling/repartitioning of the data. Tasks run on workers and results then returned to the client.\nDAG # Here\u0026rsquo;s a DAG for the code sample above. So basically any data processing workflow could be defined as reading the data source, applying a set of transformations, and materializing the result in different ways. Transformations create dependencies between RDDs and here we can see different types of them.\nThe dependencies are usually classified as \u0026ldquo;narrow\u0026rdquo; and \u0026ldquo;wide\u0026rdquo;: Narrow (\u0026ldquo;pipelineable\u0026rdquo;) each partition of the parent RDD is used by at most one partition of the child RDD allow for pipelined execution on one cluster node failure recovery is more efficient as only lost parent partitions need to be recomputed Wide (shuffle) multiple child partitions may depend on one parent partition require data from all parent partitions to be available and to be shuffled across the nodes if some partition is lost from all the ancestors a complete recomputation is needed Splitting DAG into Stages #Spark stages are created by breaking the RDD graph at shuffle boundaries RDD operations with \u0026ldquo;narrow\u0026rdquo; dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage operations with shuffle dependencies require multiple stages (one to write a set of map output files, and another to read those files after a barrier). In the end, every stage will have only shuffle dependencies on other stages, and may compute multiple operations inside it. The actual pipelining of these operations happens in the RDD.compute() functions of various RDDs There are two types of tasks in Spark: ShuffleMapTask which partitions its input for shuffle and ResultTask which sends its output to the driver. The same applies to types of stages: ShuffleMapStage and ResultStage correspondingly.\nShuffle #During the shuffle, ShuffleMapTask writes blocks to the local drive, and then the task in the next stages fetches these blocks over the network.\nShuffle Write redistributes data among partitions and writes files to disk each hash shuffle task creates one file per “reduce” task (total = MxR) sort shuffle task creates one file with regions assigned to the reducer sort shuffle uses in-memory sorting with spillover to disk to get the final result Shuffle Read fetches the files and applies reduce() logic if data ordering is needed then it is sorted on the “reducer” side for any type of shuffle In Spark, Sort Shuffle is the default one since 1.2, but Hash Shuffle is available too.\nSort Shuffle Incoming records accumulated and sorted in memory according to their target partition ids Sorted records are written to file or multiple files if spilled and then merged index file stores offsets of the data blocks in the data file Sorting without deserialization is possible under certain conditions (SPARK-7081) Spark Components #At 10K foot view there are three major components: Spark Driver separate process to execute user applications creates SparkContext to schedule jobs execution and negotiate with cluster manager Executors run tasks scheduled by the driver store computation results in memory, on disk or off-heap interact with storage systems Cluster Manager Mesos YARN Spark Standalone Spark Driver contains more components responsible for translation of user code into actual jobs executed on a cluster: SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster DAGScheduler computes a DAG of stages for each job and submits them to TaskScheduler determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum schedule to run the jobs TaskScheduler responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers SchedulerBackend backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN, Standalone, local) BlockManager provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap) Memory Management in Spark 1.6 #Executors run as Java processes, so the available memory is equal to the heap size. Internally available memory is split into several regions with specific functions. Execution Memory storage for data needed during tasks execution shuffle-related data Storage Memory storage of cached RDDs and broadcast variables possible to borrow from execution memory (spill otherwise) safeguard value is 50% of Spark Memory when cached blocks are immune to eviction User Memory user data structures and internal metadata in Spark safeguarding against OOM Reserved memory memory needed for running executor itself and not strictly related to Spark Where to go from here # Spark source code is a great source of information containing great scaladocs and absolutely worth checking out Official Spark documentation Great blog on Distributed Systems Architectures containing a lot of Spark-related stuff 0x0fff Spark Internals github project contains extremely deep explanations of different Spark aspects ","date":"3 March 2016","permalink":"https://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/","section":"","summary":"This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks, and shuffle implementation and also describes the architecture and main components of Spark Driver. There’s a github.com/datastrophic/spark-workshop project created alongside this post which contains Spark Applications examples and dockerized Hadoop environment to play with. Slides are also available at slideshare. Intro Spark is a generalized framework for distributed data processing providing functional API for manipulating data at scale, in-memory data caching, and reuse across computations.","title":"Apache Spark: core concepts, architecture and internals"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/dag/","section":"Tags","summary":"","title":"DAG"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/rdd/","section":"Tags","summary":"","title":"RDD"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/shuffle/","section":"Tags","summary":"","title":"Shuffle"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/akka/","section":"Tags","summary":"","title":"Akka"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/cassandra/","section":"Tags","summary":"","title":"Cassandra"},{"content":"This post is a follow-up of the talk given at Big Data AW meetup in Stockholm and focused on different use cases and design approaches for building scalable data processing platforms with SMACK(Spark, Mesos, Akka, Cassandra, Kafka) stack. While stack is really concise and consists of only several components it is possible to implement different system designs which list not only purely batch or stream processing, but more complex Lambda and Kappa architectures as well. So let\u0026rsquo;s start with a really short overview to be on the same page and continue with designs and examples coming from production projects experience.\nRecap # Spark - fast and general engine for distributed, large-scale data processing\nMesos - cluster resource management system that provides efficient resource isolation and sharing across distributed applications\nAkka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM\nCassandra - distributed, a highly available database designed to handle large amounts of data across multiple datacenters\nKafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-time data feeds\nStorage layer: Cassandra # Cassandra is well-known for its high availability and high-throughput characteristics and is able to handle enormous write loads and survive cluster nodes failures. In terms of the CAP theorem, Cassandra provides tunable consistency/availability for operations.\nWhat is more interesting when it comes to data processing is that Cassandra is linearly scalable(increased loads could be addressed by just adding more nodes to a cluster) and it provides cross-datacenter replication(XDCR) capabilities. Actually, XDCR provides not only replication but a set of really interesting use cases to be used for:\ngeo-distributed datacenters handling data specific for the region or located closer to customers data migration across datacenters: recovery after failures or moving data to a new DC separate operational and analytics workloads But all these features come for their own price and with Cassandra, this price is its data model, which could be thought of just as a nested sorted map that is distributed across cluster nodes by partition key, and entries are sorted/grouped by clustering columns. Here\u0026rsquo;s a small example:\nCREATE TABLE campaign( id uuid, year int, month int, day int, views bigint, clicks bigint, PRIMARY KEY (id, year, month, day) ); INSERT INTO campaign(id, year, month, day, views, clicks) VALUES(40b08953-a…,2015, 9, 10, 1000, 42); SELECT views, clicks FROM campaign WHERE id=40b08953-a… and year=2015 and month\u0026gt;8; To get specific data in some range the full key must be specified and no range clauses are allowed except for the last column in the list. This constraint is introduced to limit multiple scans for different ranges which will produce random access to disks and lower down the performance. This means that the data model should be carefully designed against the read queries to limit the number of reads/scans which leads to lesser flexibility when it comes to supporting new queries. Here are C* data modeling 101 slides that provide several examples of how CQL tables are represented internally.\nBut what if one has some tables that need to be joined somehow with other tables? Let\u0026rsquo;s consider the next case: calculate total views per campaign for a given month for all campaigns.\nCREATE TABLE event( id uuid, ad_id uuid, campaign uuid, ts bigint, type text, PRIMARY KEY(id) ); With the given model, the only way to achieve this goal is to read all campaigns, read all events, sum the proper ones (with matched campaign id) and assign them to the campaign. And it looks really challenging to implement such a sort of application because the amount of data stored in Cassandra could be really huge and won\u0026rsquo;t fit the memory. So the processing of such sort of data should be done in a distributed manner and Spark perfectly fits this use case.\nProcessing layer: Spark # The main abstraction Spark operates with is RDD(Resilient Distributed Dataset, a distributed collection of elements) and the workflow consists of four main phases:\nRDD operations(transformations and actions) form DAG (Direct Acyclic Graph) DAG is split into stages of tasks which are then submitted to the cluster manager stages combine tasks that don’t require shuffling/repartitioning tasks run on workers and results then return to the client Here\u0026rsquo;s how one can solve the above problem with Spark and Cassandra:\nval sc = new SparkContext(conf) case class Event(id: UUID, ad_id: UUID, campaign: UUID, ts: Long, `type`: String) sc.cassandraTable[Event](\u0026#34;keyspace\u0026#34;, \u0026#34;event\u0026#34;) .filter(e =\u0026gt; e.`type` == \u0026#34;view\u0026#34; \u0026amp;\u0026amp; checkMonth(e.ts)) .map(e =\u0026gt; (e.campaign, 1)) .reduceByKey(_ + _) .collect() Interaction with Cassandra is performed via spark-cassandra-connector which makes it really easy and straightforward. There\u0026rsquo;s one more interesting option to work with NoSQL stores - SparkSQL, which translates SQL statements into a series of RDD operations.\ncase class CampaignReport(id: String, views: Long, clicks: Long) sql(\u0026#34;\u0026#34;\u0026#34;SELECT campaign.id as id, campaign.views as views, campaign.clicks as clicks, event.type as type FROM campaign JOIN event ON campaign.id = event.campaign \u0026#34;\u0026#34;\u0026#34;).rdd .groupBy(row =\u0026gt; row.getAs[String](\u0026#34;id\u0026#34;)) .map{ case (id, rows) =\u0026gt; val views = rows.head.getAs[Long](\u0026#34;views\u0026#34;) val clicks = rows.head.getAs[Long](\u0026#34;clicks\u0026#34;) val res = rows.groupBy(row =\u0026gt; row.getAs[String](\u0026#34;type\u0026#34;)).mapValues(_.size) CampaignReport(id, views = views + res(\u0026#34;view\u0026#34;), clicks = clicks + res(\u0026#34;click\u0026#34;)) }.saveToCassandra(“keyspace”, “campaign_report”) With several lines of code, it\u0026rsquo;s possible to implement naive Lamba design which of course could be much more sophisticated, but this example shows just how easy this can be achieved.\nAlmost MapReduce: bringing processing closer to data #Spark-Cassandra connector is data locality aware and reads the data from the closest node in a cluster thus minimizing the amount of data transferred around the network. To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes.\nAlongside Spark collocation with Cassandra, it makes sense to separate your operational (or write-heavy) cluster from one for analytics:\nclusters can be scaled independently data is replicated by Cassandra, no extra work is needed analytics cluster has different Read/Write load patterns analytics cluster could contain additional data (e.g. dictionaries) and processing results Spark resource impact is limited to only one cluster Let\u0026rsquo;s look at Spark applications deployment options one more time: There are three main options available for cluster resource managers:\nSpark Standalone - Spark master and Workers are installed and executed as standalone applications (which obviously introduces some ops overhead and support only static resource allocation per worker) YARN is really nice if you already have a Hadoop ecosystem Mesos which from the beginning was designed for dynamic allocation of cluster resources and not only for running Hadoop applications but for handling heterogeneous workloads Mesos architecture # Mesos cluster consists of Master nodes that are responsible for resource offers and scheduling and Slave nodes which do the actual heavy lifting of tasks execution. In HA mode with multiple Masters ZooKeeper is used for leader election and service discovery. Applications executed on Mesos are called Frameworks and utilize API to handle resource offers and submit tasks to Mesos. Generally, the process of task execution consists of these steps:\nSlaves publish available resources to Master Master sends resource offers to Frameworks Scheduler replies with tasks and resources needed per task Master sends tasks to slaves Bringing Spark, Mesos, and Cassandra together #As said before Spark workers should be collocated with Cassandra nodes to enforce data locality awareness thus lowering the amount of network traffic and Cassandra cluster load. Here\u0026rsquo;s one of the possible deployment scenarios on how to achieve this with Mesos. Mesos Masters and ZooKeepers collocated Mesos Slaves and Cassandra nodes collocated to enforce better data locality for Spark Spark binaries deployed to all worker nodes and spark-env.sh is configured with proper master endpoints and executor jar location Spark Executor JAR uploaded to S3/HDFS With provided setup Spark job can be submitted to the cluster with simple spark-submit invocation from any worker nodes having Spark binaries installed and assembly jar containing actual job logic uploaded\nspark-submit --class io.datastrophic.SparkJob /etc/jobs/spark-jobs.jar There exist options to run Dockerized Spark so that there\u0026rsquo;s no need to distribute binaries across every single cluster node.\nScheduled long-running tasks #Every data processing system sooner or later faces the necessity of running two types of jobs: scheduled/periodic jobs like periodic batch aggregations and long-running ones which are the case for stream processing. The main requirement for both of these types is fault tolerance - jobs must continue running even in case of cluster nodes failures. Mesos ecosystem comes with two great frameworks supporting each of these types of jobs.\nMarathon is a framework for fault-tolerant execution of long-running tasks supporting HA mode with ZooKeeper, able to run Docker, and having a nice REST API. Here\u0026rsquo;s an example of a simple job configuration running spark-submit as a shell command: Chronos has the same characteristics as Marathon but is designed for running scheduled jobs and in general, it is distributed HA cron supporting graphs of jobs. Here\u0026rsquo;s an example of an S3 compaction job configuration which is implemented as a simple bash script: There are plenty of frameworks already available or under active development which targeted to integrate widely used systems with Mesos resource management capabilities. Just to name some of them:\nHadoop Cassandra Kafka Myriad: YARN on Mesos Storm Samza Ingesting the data #So far so good: the storage layer is designed, resource management is set up and jobs are configured. The only thing which is not there yet is the data to process. Assuming that incoming data will arrive at high rates the endpoints which will receive it should meet next requirements:\nprovide high throughput/low latency being resilient allow easy scalability support back pressure Backpressure is not a must, but it would be nice to have this as an option to handle load spikes.\nAkka perfectly fits the requirements and basically, it was designed to provide this feature set. So what\u0026rsquo;s is Akka:\nactor model implementation for JVM message-based and asynchronous enforces no shared mutable state easy scalable from one process to a cluster of machines actors form hierarchies with parental supervision not only concurrency framework: akka-http, akka-streams, and akka-persistence Here\u0026rsquo;s a simplified example of three actors which handle JSON HttpRequest, parse it into domain model case class and save it to Cassandra:\nclass HttpActor extends Actor { def receive = { case req: HttpRequest =\u0026gt; system.actorOf(Props[JsonParserActor]) ! req.body case e: Event =\u0026gt; system.actorOf(Props[CassandraWriterActor]) ! e } } class JsonParserActor extends Actor { def receive = { case s: String =\u0026gt; Try(Json.parse(s).as[Event]) match { case Failure(ex) =\u0026gt; //error handling code case Success(event) =\u0026gt; sender ! event } } } class CassandraWriterActor extends Actor with ActorLogging { //for demo purposes, session initialized here val session = Cluster.builder() .addContactPoint(\u0026#34;cassandra.host\u0026#34;) .build() .connect() override def receive: Receive = { case event: Event =\u0026gt; val statement = new SimpleStatement(event.createQuery) .setConsistencyLevel(ConsistencyLevel.QUORUM) Try(session.execute(statement)) match { case Failure(ex) =\u0026gt; //error handling code case Success =\u0026gt; sender ! WriteSuccessfull } } } It looks like only several lines of code are needed to make everything work, but while writing raw data (events) to Cassandra with Akka is easy there is a number of gotchas:\nCassandra is still designed for fast serving but not batch processing, so pre-aggregation of incoming data is needed computation time of aggregations/rollups will grow with the amount of data actors are not suitable for performing aggregation due to the stateless design model micro-batches could partially solve the problem some sort of reliable buffer for raw data is still needed Kafka as a buffer for incoming data # For keeping incoming data with some retention and its further pre-aggregation/processing some sort of distributed commit log could be used. In this case, consumers will read data in batches, process it, and store it into Cassandra in form of pre-aggregates. Here\u0026rsquo;s an example of publishing JSON data coming over HTTP to Kafka with akka-http:\nval config = new ProducerConfig(KafkaConfig()) lazy val producer = new KafkaProducer[A, A](config) val topic = “raw_events” val routes: Route = { post{ decodeRequest{ entity(as[String]){ str =\u0026gt; JsonParser.parse(str).validate[Event] match { case s: JsSuccess[String] =\u0026gt; producer.send(new KeyedMessage(topic, str)) case e: JsError =\u0026gt; BadRequest -\u0026gt; JsError.toFlatJson(e).toString() } } } } } object AkkaHttpMicroservice extends App with Service { Http().bindAndHandle(routes, config.getString(\u0026#34;http.interface\u0026#34;), config.getInt(\u0026#34;http.port\u0026#34;)) } Consuming the data: Spark Streaming #While Akka is still could be used for consuming stream data from Kafka, having Spark in your ecosystem brings Spark Streaming as an option to solve the problem:\nit supports a variety of data sources provides at-least-once semantics exactly-once semantics available with Kafka Direct and idempotent storage Consuming event stream from Kinesis with Spark Streaming example:\nval ssc = new StreamingContext(conf, Seconds(10)) val kinesisStream = KinesisUtils.createStream(ssc,appName,streamName, endpointURL,regionName, InitialPositionInStream.LATEST, Duration(checkpointInterval), StorageLevel.MEMORY_ONLY) } //transforming given stream to Event and saving to C* kinesisStream.map(JsonUtils.byteArrayToEvent) .saveToCassandra(keyspace, table) ssc.start() ssc.awaitTermination() Designing for Failure: Backups and Patching #Usually, this is the most boring part of any system but it\u0026rsquo;s really important when there exists any possibility that the data which came into the system could be invalid or when all the analytics data center crushes. So why not store the data in Kafka/Kinesis? For the moment of writing Kinesis has only one day of retention and without backups, in case of a failure, all processing results could be lost. While Kafka supports much larger retention periods, the cost of hardware ownership should be considered because for example, S3 storage is way cheaper than multiple instances running Kafka as well as S3 SLA are really good.\nApart from having backups the restoring/patching strategies should be designed upfront and tested so that any problems with data could be quickly fixed. Programmer\u0026rsquo;s mistake in aggregation job or duplicate data could break the accuracy of the computation results so fixing the error shouldn\u0026rsquo;t be a big problem. One thing to make all these operations easier is to enforce idempotency in the data model so that multiple repetitions of the same operations produce the same results(e.g. an SQL update is an idempotent operation while the counter increment is not).\nHere is an example of a Spark job which reads S3 backup and loads it into Cassandra:\nval sc = new SparkContext(conf) sc.textFile(s\u0026#34;s3n://bucket/2015/*/*.gz\u0026#34;) .map(s =\u0026gt; Try(JsonUtils.stringToEvent(s))) .filter(_.isSuccess).map(_.get) .saveToCassandra(config.keyspace, config.table) The Big picture #The high-level design of data platform built with SMACK So what does the SMACK stack provide:\na concise toolbox for a wide variety of data processing scenarios battle-tested and widely used software with large communities easy scalability and replication of data while preserving low latencies unified cluster management for heterogeneous loads single platform for any kind of applications implementation platform for different architecture designs (batch, streaming, Lambda, Kappa) fast time-to-market (e.g. for MVP verification) ","date":"16 September 2015","permalink":"https://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/","section":"","summary":"This post is a follow-up of the talk given at Big Data AW meetup in Stockholm and focused on different use cases and design approaches for building scalable data processing platforms with SMACK(Spark, Mesos, Akka, Cassandra, Kafka) stack. While stack is really concise and consists of only several components it is possible to implement different system designs which list not only purely batch or stream processing, but more complex Lambda and Kappa architectures as well.","title":"Data processing platforms architectures with SMACK: Spark, Mesos, Akka, Cassandra and Kafka"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/kafka/","section":"Tags","summary":"","title":"Kafka"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/smack/","section":"Tags","summary":"","title":"SMACK"},{"content":"For some cases such as the ones present in AdServing, the counters come really handy to accumulate totals for events coming into a system compared to batch aggregates. While distributed counters consistency is a well-known problem Cassandra counters in version 2.1 are claimed to be more accurate compared to the prior ones. This post describes the approach and the results of Cassandra counters consistency testing in different failure scenarios such as rolling restarts, abnormal termination of nodes, and network splits.\nHere\u0026rsquo;s the initial blog post by DataStax describing the internals of new Cassandra counters implementation. There exist really good counters consistency tests during network splits performed by Jepsen and published in Aphyr\u0026rsquo;s blog. While there are a wide variety of stress test tools, with counters it would be nice to verify their behavior for business-specific cases, emulating production workload. For this purpose, a simple Akka-based application has been implemented allowing to produce a high write rate while using the problem domain data model.\nThe Load Testing Tool #The main idea behind the tool is in providing highly concurrent writes with respect to the data model (which reflects a particular business case) and a simple feedback loop to verify the results. The feedback loop is implemented with a configurable number of retries to wait for eventually consistent writes being propagated to all nodes.\nActor System Overview # Actors responsibilities and relationships # Coordinator controls interactions between the rest of the actors, waits for initialization, executes broadcast load start command, etc. when all writes are sent to the cluster, Coordinator executes feedback loop to compare actual counter values with expected result Generators are domain-aware random insert queries generators that take into account the number of unique rows and counters per row to generate the full path to counters. The total number of counters thus equals the number of unique rows times the number of counters per row Processors are Actors controlling execution for a specific table and sub sum of total counter value. Total number of processors equal number_of_entities * parallelism_factor. Each processor is responsible for providing writes for totalSum/parallelism subtotal of target total value for each counter. Number of processors affects the number of concurrent writes to the same counters. Workers are needed for generating increments and coordinating writes to generated keys. Each worker holds a subset of the Processor\u0026rsquo;s keys for a specific table thus providing faster iterations over smaller insert batches. The number of writers influences the parallelism of writes to different keys providing intermixed writes with other workers. But generally, it is a performance optimization for processing a smaller number of keys per write CassandraWriter is a short-living actor which dies after all work is done. These actors live on a dedicated dispatcher with a separate thread pool due to datastax-java-driver specifics: it becomes enormously slow on a shared thread pool for some reason. Data model #A replication factor of 3 will be used in tests against 5-node C* cluster:\nCREATE KEYSPACE IF NOT EXISTS $keyspaceName WITH replication= {\u0026#39;class\u0026#39;:\u0026#39;SimpleStrategy\u0026#39;, \u0026#39;replication_factor\u0026#39;:3}; During the tests, the data will be written independently representing 3 different entities (e.g. views, clicks, and campaigns). Each of these entities looks as follows:\nCREATE TABLE IF NOT EXISTS $keyspaceName.$name ( id uuid, u ascii, e ascii, d int, c counter, PRIMARY KEY((id, u, e), d)); But for one of the entities, two column families will be created to reflect the case with different read paths (same fields, different row keys).\nTest Setup #Cassandra cluster configuration #Tests are performed against 5 node Cassandra 2.1.9 cluster. Each node has 32Gb of RAM and 4-core 8-threads 3.70GHz Xeon CPU.\nCassandra heap size: 8Gb\nTest configuration #Here\u0026rsquo;s an example of tool invocation:\njava -jar counters-test.jar -h 10.120.0.252 -k test_keyspace -p 10 -w 100 -s 50 -r 1000 -c 100 Details:\ncounter total value 50 parallelism 10 100 workers, keys per worker 1000 (sequentially fed to async writes) 1000 rows 100 counters per row The tool is configured to write +1 increments which result in 5M of total writes to every table (views, clicks, 2x campaign). Write throughput after warmup fluctuates around 30K writes/sec. Generally\nNetwork partitions and rolling restarts are implemented with Siafu: fabric-based python tool\nCassandra client configuration #Client configured with:\nimplicit val cluster = Cluster.builder() .addContactPoint(config.clusterNode) .withQueryOptions(new QueryOptions() .setConsistencyLevel(ConsistencyLevel.QUORUM) ).build() All writes are executed in sync mode, i.e.:\nTry(msg.session.execute(statement)) match { case Failure(ex) =\u0026gt; log.error(ex, \u0026#34;ERROR DURING WRITE\u0026#34;) case _ =\u0026gt; } The Test #Scenarios Overview and Results Interpretation #The counters consistency is going to be evaluated in the following scenarios while writes are executed against the cluster:\nnode normally stops node normally restarts cluster rolling restart node dies (kill -9) node dies, starts again and nodetool repair is executed node is lost due to network split and then comes back Interpretation of results #Results for each entity contain the following set of fields:\novercounts - the total number of counters holding value greater than expected maxOverCount - maximum overcount value for a single counter totalOverCount - the total number of queries with overcount undercounts - the total number of counters holding value less than expected maxUnderCount - maximum undercount value for a single counter totalUnderCount - the total number of queries with undercount Load stats #Here\u0026rsquo;s an example of throughput stats for the test being executed without incidents: The spike in the beginning is due to several minutes of warmup done before the actual writes.\nAnd here\u0026rsquo;s an example of throughput deviation while node fails and gets back: One really important thing during interpretation of the test results is that the test is pretty synthetic and the amount of failed under- or overcounts will vary depending on the throughput and amount of failures per test. But for the initial evaluation of C* counters behavior in the most common failure scenarios, this should be enough to grasp the whole picture.\nThe Results #Scenario: a normal node stop #DEVIATION STATS: overcounts: 4 maxOverCount: 1 totalOverCount: 4 undercounts: 0 maxUnderCount: 0 totalUnderCount: 0 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 0 maxUnderCount: 0 totalUnderCount: 0 DEVIATION STATS: overcounts: 2 maxOverCount: 1 totalOverCount: 2 undercounts: 0 maxUnderCount: 0 totalUnderCount: 0 Scenario: a normal node restart #DEVIATION STATS: overcounts: 4 maxOverCount: 1 totalOverCount: 4 undercounts: 11 maxUnderCount: 1 totalUnderCount: 11 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 2 maxUnderCount: 1 totalUnderCount: 2 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 6 maxUnderCount: 1 totalUnderCount: 6 Scenario: cluster rolling restart #DEVIATION STATS: overcounts: 8 maxOverCount: 1 totalOverCount: 8 undercounts: 22 maxUnderCount: 1 totalUnderCount: 22 DEVIATION STATS: overcounts: 15 maxOverCount: 1 totalOverCount: 15 undercounts: 20 maxUnderCount: 1 totalUnderCount: 20 DEVIATION STATS: overcounts: 21 maxOverCount: 2 totalOverCount: 22 undercounts: 30 maxUnderCount: 1 totalUnderCount: 30 Normal termination operations conclusion #During normal operations like service stops, restarts, and cluster rolling restarts the amount of failed counters is not exceeded 0.000007% which is a pretty acceptable deviation for most of the cases. The results may vary depending on load/throughput but the fact is that even during normal cluster operations some undercounts (and what\u0026rsquo;s even worse overcounts) are registered.\nScenario: node dies (kill -9) #DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 16 maxUnderCount: 1 totalUnderCount: 16 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 50 maxUnderCount: 2 totalUnderCount: 52 DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 28 maxUnderCount: 1 totalUnderCount: 28 Scenario: network partition #DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 13 maxUnderCount: 1 totalUnderCount: 13 DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 55 maxUnderCount: 1 totalUnderCount: 55 DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 57 maxUnderCount: 1 totalUnderCount: 57 Scenario: 2 new nodes are added to 5-node cluster while writing #No over- or under-counts registered\nScenario: node dies, starts again and nodetool repair is executed #DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 7019 maxUnderCount: 4 totalUnderCount: 8607 DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 4146 maxUnderCount: 4 totalUnderCount: 5264 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 4588 maxUnderCount: 4 totalUnderCount: 6027 Abnormal termination observations #All cases except abnormal termination with consequent restart (with or without repair) went pretty same to normal termination while node coming back to the cluster after failure has shown major undercounts.\nThe interesting part here is that with consistency level configured to QUORUM there\u0026rsquo;s a repeatable exception appearing in the logs: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)\n2015-08-31 15:02:07,336 ERROR [akka://counter-consistency/user/$b/$m/$S/$b] - ERROR DURING WRITE com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54) ~[cassandra-tools.jar:1.0] at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:269) ~[cassandra-tools.jar:1.0] at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:183) ~[cassandra-tools.jar:1.0] at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) ~[cassandra-tools.jar:1.0] ... It looks like it relates to [CASSANDRA-10041]: \u0026ldquo;timeout during write query at consistency ONE\u0026rdquo; when updating counter at consistency QUORUM and 2 of 3 nodes alive Jira issue which was unresolved at the moment of performing tests.\nIn the node death-then-restart scenario, the observed behavior is that one can\u0026rsquo;t just restart the failed node. It\u0026rsquo;s a bit strange because this situation could happen and it looks cumbersome to re-bootstrap the node wiping out all the data. But it looks like a reasonable tradeoff for the given counter functionality. The next test scenario checks the re-bootstrapping approach.\nScenario: node killed, then rebootstrapped #Actions performed:\none of the nodes is killed (kill -9) dead node removed nodetool removenode \u0026lt;node uid\u0026gt; data dirs wiped out on failed node machine JVM_OPTS=\u0026quot;$JVM_OPTS -Dcassandra.replace_address=\u0026lt;node self address\u0026gt; added to cassandra-env.sh on the dead node node started The results:\nDEVIATION STATS: overcounts: 2 maxOverCount: 1 totalOverCount: 2 undercounts: 69 maxUnderCount: 2 totalUnderCount: 70 DEVIATION STATS: overcounts: 1 maxOverCount: 1 totalOverCount: 1 undercounts: 50 maxUnderCount: 1 totalUnderCount: 50 DEVIATION STATS: overcounts: 0 maxOverCount: 0 totalOverCount: 0 undercounts: 0 maxUnderCount: 0 totalUnderCount: 0 After the total wipeout of the commitlog and data directories deviations are of the same magnitude as in the other scenarios.\nConclusion #The results are pretty clear: Cassandra 2.1 counters are not totally consistent. But. The deviation and amount of undercounts are really small in the total amount of writes per test(5M) while the amount of overcounts is even smaller (not exceeding 10 per test). One thing to mention here again is that results randomly vary depending on which node to restart due to row key distribution across the nodes and write throughput. One still can\u0026rsquo;t use the counters in banking but the amount of failed counters is quite acceptable in other cases where strict accuracy is not a mandatory requirement(e.g. adserving). Combined with some sort of aggregates as a part of Lambda Architecture Cassandra counters can be used without big risks. But there are still two open questions after the tests:\nwhy in the case of normal operations like stop/restart some counters are inconsistent? This basically means that in case of any Cassandra operation which needs service restart it will lose some increments why one can\u0026rsquo;t simply restart failed node without major undercounts? Counters operations are completely different from the regular ones in Cassandra and nodetool repair simply doesn\u0026rsquo;t work as expected. ","date":"3 September 2015","permalink":"https://datastrophic.io/evaluating-cassandra-2-1-counters-consistency/","section":"","summary":"For some cases such as the ones present in AdServing, the counters come really handy to accumulate totals for events coming into a system compared to batch aggregates. While distributed counters consistency is a well-known problem Cassandra counters in version 2.1 are claimed to be more accurate compared to the prior ones. This post describes the approach and the results of Cassandra counters consistency testing in different failure scenarios such as rolling restarts, abnormal termination of nodes, and network splits.","title":"Cassandra 2.1 Counters: Testing Consistency During Node Failures"},{"content":"Scala Days Amsterdam conference was full of interesting topics so in this post I\u0026rsquo;ll cover talks on the Scala platform, core concepts for making Scala code more idiomatic, monad transformers, consistency in distributed systems, distributed domain driven design, and a little more.\nThis post came out of the post-conference presentation to my team, so the slides are also available here and contain all the links to related materials and presentations so you can discover more on your own.\nMain themes #So here are the main themes/areas discussed at Scala Days:\nReactive Applications - a big topic around non-blocking design with reference to Reactive Streams and Akka Big Data/Fast Data - mainly focused on frameworks and architectures like Spark, GraphX, Kafka Functional Programming in Scala - topic related functional programming aspects in Scala which mainly covers Scalaz use cases and how-to\u0026rsquo;s Scala in practice - notes from production projects experience Distributed Systems - a broad topic focused on building systems with inter-node communication like Akka cluster or cluster management with Mesos Scala keynote #The talk is available at Parleys, here\u0026rsquo;re some highlights:\nA reactive platform is started to form (like JEE). If you check out the Typesafe website there\u0026rsquo;re a lot of frameworks forming a full stack for application development: DB access with Slick, concurrency/distributed interaction with Akka, Play as a web framework. Scala-specific platform with TASTY (Serialized Typed Abstract Syntax Trees) DOTC compiler (was close to completion at the moment of the talk) Type unions (U\u0026amp;T and U|T) - instead of with mixin trait composition Implicits that compose (implicit function types) Moving towards more pure functional programming TASTY #Java code compilation is bound to JVM, and Scala compiler adds Scala-specific signatures to classifiers that result in a larger size of classifiers and lower effectiveness.\nAnother problem that Scala faces is its dependency on JDK/JVM and its need to adapt to version changes (e.g. migrating from Java 7 to Java 8). Besides that JVM is not only one target platform, so that\u0026rsquo;s the place where TASTY comes into play. The main goal is in removing this dependency on specific platform versions. Plus some internal optimizations of code representation aimed to make code faster.\n#####DOTC The long-term plan for the Scala platform is in switching to a new Scala-specific DOTC compiler (based on Dependent Object Types calculus) and switch to work in conjunction with TASTY. For the moment of writing the compiler was close to completion, so for more details about DOTC here\u0026rsquo;s the link to the original paper Dependent Object Types: Towards a foundation for Scala’s type system. Life beyond the illusion of the present #A philosophical talk by Jonas Bonér focused on consistency in distributed systems. One of the main ideas was that actually data shouldn\u0026rsquo;t be updated or deleted, it can only be created and read(CRUD). And there\u0026rsquo;s still a number of problems in achieving consistency in distributed systems, but there\u0026rsquo;s a number of approaches to manage it though, like vector clocks, consistency as logical monotonicity (CALM), and commutative replicated data types (CRDTs). Here goes a brief description of them:\nVector clocks is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations\nCALM informally:\na block of code is logically monotonic if it satisfies a simple property: adding things to the input can only increase the output. By contrast, non-monotonic code may need to “retract” a previous output if more is added to its input.\nConflict-free replicated data type:\nis a type of specially designed data structure used to achieve strong eventual consistency and monotonicity (absence of rollbacks). There are two alternative routes to ensuring strong eventual consistency: operation-based CRDTs and state-based CRDTs. The two alternatives are equivalent, as one can emulate the other, but operation-based CRDTs require additional guarantees from the communication middleware\nHere\u0026rsquo;s a link to original CRDT paper and here\u0026rsquo;s a really good post on CRDTs by Jordan Rule\nLambda Architecture with Spark Streaming, Kafka, Cassandra #It was mainly a pitch-talk about Kafka-Spark-Cassandra-based architectures (link to Parleys). What\u0026rsquo;s interesting here is that it looks like a common pattern: using Spark Streaming as an intermediate between Kafka and Cassandra for data ingestion. There\u0026rsquo;re a lot of talks arise lately on the subject of moving from Hadoop to Kafka-Spark-Cassandra-based architectures.\nWhat about the talk itself, it was mainly describing these systems and providing some code samples. Reference application (killrweather) using all these technologies is available on github as well. Helena Edelson demonstrated \u0026ldquo;her nerdy chart\u0026rdquo; with different architecture design strategies and technologies suitable for them:\nAnother interesting part was that DataStax is developing a Cassandra Streaming driver which uses Cassandra as a source and is supposed to be used with Spark.\nHow to unsuck your Options in Futures #Link to the video. Very often you need to unwrap the option value contained in the Future or any other container. This involves writing cumbersome code for extracting the actual value and this cumbersome code grows linearly. But Scalaz library provides a set of monads which are called \u0026ldquo;monad transformers\u0026rdquo; and allow you to unwrap one-level depth values to perform operations with them which results in a single type class/container of expected type in the end.\ndef f1: Future[Option[Int]] = ??? def f2: Future[Option[Int]] = ??? def f3: Future[Option[Int]] = ??? val result = for{ a \u0026lt;- OptionT(f1) b \u0026lt;- OptionT(f2) c \u0026lt;- OptionT(f3) } yield a + b + c val finOption: Future[Option[Int]] = result.run Let\u0026rsquo;s imagine you have different containers and want to perform some operation on their values. For this case, you need to \u0026ldquo;cast\u0026rdquo; (wrap) the given objects to the closest one and then apply the monad transformer.\ndef f1: Future[String \\/ Int] = ??? def f2: Option[Int] = ??? def f3: Future[Int] = ??? val result = for{ a \u0026lt;- EitherT(f1) b \u0026lt;- EitherT(Future(f2 \\/\u0026gt; \u0026#34;B is missing\u0026#34;)) c \u0026lt;- EitherT(f3.map(v =\u0026gt; \\/.right(v)) } yield a + b + c val finEither: Future[String \\/ Int] = result.run And here\u0026rsquo;s how it can look like if we implement it ourselves\ncase class FutureOption[A](contents: Future[Option[A]]){ def flatMap[B](fn: A =\u0026gt; FutureOption[B]) = FutureOption{ contents.flatMap{ case Some(value) =\u0026gt; fn(value).contents case None =\u0026gt; Future.successful(None) } } def map[B](fn: A =\u0026gt; B) = FutureOption{ contents.map{ option =\u0026gt; option.map(fn) } } } So why is all this needed? For-comprehensions are a really awesome feature actually which allows writing really concise and readable code. For-comprehensions basically use two operations that should be implemented in used classes: map and flatMap.\nSo here we go closer to the Monad. Apart from Category Theory, monad can be thought of as a TypeClass with methods: map, flatMap and create (bind and unit). And to satisfy the for-comprehension contract with our Future-of-Option monad transformer we need to implement map and flatMap.\nScalaz provides a number of useful monad transformers out of the box. But if you need anything special, you can generalize the given code a bit and use a Monad trait from Scalaz to abstract a bit more.\nUnderstanding the backend of Big Data #The talk was generally about four main topics:\nthe complexity of building distributed systems and how Functional Programming helps to reduce it. Basically, FP is good at transforming immutable data and forces immutable data structures leading to fewer errors in runtime and providing some sort of compile-time \u0026ldquo;correctness\u0026rdquo; checks upfront\ncluster management is hard, with static resource allocation cluster resources are used for about 5-10% nowadays because of over-provisioning for handling load spikes. Mixed workloads are the most complex ones to estimate resources.\nso here comes Mesos which provides dynamic resource allocation and management. And the main components of the ecosystem covered in the talk were Mesos itself, Marathon for scheduling long-running tasks, and Myriad which is the Mesos-YARN bridge\nfinally DCOS GA was released a couple of hours before the talk and it looks really promising, providing central DC management UI and console allowing to easily install Mesos frameworks. It\u0026rsquo;s available for free for Amazon and has a trial version for commodity hardware. It\u0026rsquo;s worth checking it out.\nEssential Scala #Really impressive talk by Noel Welsh providing several really simple guidelines to make Scala code better structured and more idiomatic which is crucial when someone just started using Scala or just needs to train the team moving to it. So basically six core concepts form the learning curve of Scala:\nExpressions, types, \u0026amp; values Objects and classes Algebraic data types Structural recursion Sequencing computation Type classes Topics covered in the talk were ADT, Structural recursion, and Sequencing computation\nAlgebraic Data Types # Goal: translate data description into code Model data with logical ORs and ANDs Two patterns: product types(ANDs) and sum types(ORs) Sum and product together make algebraic data types //product type: A has a B and C final case class A(b: B, c: C) //sum type: A is a B or C sealed trait A final case class B() extends A final case class C() extends A //Example: a Calculation either successful and has value or failed and has an error sealed trait Calculation final case class Success(value: Int) extends Calculation final case class Failure(msg: String) extends Calculation Structural recursion # Goal: transform algebraic data type Two patterns here: pattern-matching \u0026amp; polymorphism Structural induction is a proof method that is used in mathematical logic, computer science, graph theory, and some other mathematical fields. It is a generalization of mathematical induction over natural numbers and can be further generalized to arbitrary Noetherian induction. Structural recursion is a recursion method bearing the same relationship to structural induction as ordinary recursion bears to ordinary mathematical induction.\n//pattern matching sealed trait A { def doSomething: H = { this match { case B(d, e) =\u0026gt; doB(d, e) case C(f, g) =\u0026gt; doC(f, g) } } } final case class B(d: D, e: E) extends A final case class C(f: F, g: G) extends A //polymorphism sealed trait A { def doSomething: H } final case class B(d: D, e: E) extends A { def doSomething: H = doB(d, e) } final case class C(f: F, g: G) extends A { def doSomething: H = doC(f, g) } The processing algebraic data types immediately follows from the structure of the data Can choose between pattern matching and polymorphism Pattern matching (within the base trait) is usually preferred Sequencing Computations # Goal: patterns for sequencing computations Functional programming is about transforming values \u0026hellip; without introducing side-effects A =\u0026gt; B =\u0026gt; C Three patterns: fold, map, and flatMap Fold is abstraction over structural recursion So here\u0026rsquo;s how we\u0026rsquo;d go with our own fold implementation\n//initial version sealed trait A { def doSomething: H = { this match { case B(d, e) =\u0026gt; doB(d, e) case C(f, g) =\u0026gt; doC(f, g) } } } final case class B(d: D, e: E) extends A final case class C(f: F, g: G) extends A //first refactoring sealed trait A { def fold(doB: (D, E) =\u0026gt; H, doC: (F, G) =\u0026gt; H): H = { this match { case B(d, e) =\u0026gt; doB(d, e) case C(f, g) =\u0026gt; doC(f, g) } } } final case class B(d: D, e: E) extends A final case class C(f: F, g: G) extends A //final result sealed trait Result[A] { def fold[B](s: A =\u0026gt; B, f: B): B = this match { case Success(v) =\u0026gt; s(v) case Failure() =\u0026gt; f } } final case class Success[A](value: A) extends Result[A] final case class Failure[A]() extends Result[A] fold is a generic transform for any algebraic data type but it\u0026rsquo;s not always the best choice not all data is an algebraic data type there\u0026rsquo;re other methods easier to use (map and flatMap) So main conclusions of the talk are that in fact, Scala is simple, 3 described patterns cover 90% of code (and 4 cover 99%) and that program design in Scala is systematic.\nA Purely Functional Approach to Building Large Applications #It was a really monadic talk about Scalaz usage in real projects showing how to properly abstract the logic with monads and wiring functions together.\nReader monad: in simple words, it\u0026rsquo;s a wrapper providing context for functions that need it. Generally speaking, the scope where such kind of dependency injection is needed is function scope, so here comes the reader that provides the context to the function. ReaderT is a monad transformer like we\u0026rsquo;ve seen before which unwraps the values which sit in some monad container like Option or Future Kleisli arrow: is a base abstraction for ReaderT and is really handy when it comes to the composition of monadic functions, it\u0026rsquo;s simply a wrapper for a function of type A =\u0026gt; F[B]. Task monad: a substitute for Scala Future, differs in that sense that occupies thread only at the moment of materialization, which allows performing different transformations lazily lowering the amount of context switching Easy Scalable Akka Applications #The talk was focused on distributed domain-driven design and its implementation with Akka.\nTwo main approaches discussed were CQRS and Event Sourcing:\nCommand Query Responsibility Segregation\nCQRS at its heart is the notion that you can use a different model to update information than the model you use to read information.\nEvent Sourcing\nEvent Sourcing ensures that all changes to the application state are stored as a sequence of events. Not just can we query these events, we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.\nA couple of words on the distributed domain-driven design. Some examples of Non-Distributed Domain\nBasic CRUD (Create, Read, Update, Delete) Writes and reads to same database (ACID) Scaled via multiple identical replicas Bottlenecks on contention (reads and writes interfere with each other) With Akka and DDDD\nWouldn’t it be great if you could just keep your domain instances in memory? But how to recover from its volatile nature: an event journal! So the main idea here is to use Actors to represent domain entities (e.g. bank accounts) and store the state mutations as a commit log with periodical snapshots which is achieved with an akka-persistence module. Cassandra was used for storing the journal so in total the system looks really interesting. CRUD and CQRS were compared and tested with the Gatling stress tool with results available in Boldradius blog post.\nSome Other Cool Talks # Spores: type-safe function serialization for remote method invocation. Imagine actors passing functions instead of functions+data, immutable data enforced Project Gålbma: Actors vs. Types - so generally the main goal of this project is to provide typed ActorRefs, in future remove Actor trait and switch to more pure actor model enforcing message-passing behavior The talk about using Finagle in SoundCloud was pretty interesting. Finagle is a high-performance RPC server supporting various protocols, built on Netty, and enforcing functional programming in API. It is really nice and easy-to-use to build a distributed system. Scala.js talk was pretty interesting in terms of compiler implementation details and it seems that it is mature enough Reactive streams topic was pretty interesting especially in terms of back pressure and constructing the data flows model. Wrapping Up # Scala language becomes more and more stable in terms of API changes and moves towards improving performance, being more functionally pure as well as growing its own platform Scala is widely used as an implementation language in major and emerging distributed data processing/computing frameworks (Akka, Spark, Crunch) Scalaz is used in a wide variety of projects for proper abstractions or just for making code more concise and reusable (/, monad transformers) There\u0026rsquo;re a lot of Scala libraries/frameworks forming mature ecosystem (e.g. Akka and Play) which can be used for building large distributed applications More awesome slides and videos from Scala Days Amsterdam are available at Parleys!\n","date":"1 July 2015","permalink":"https://datastrophic.io/in-the-wake-of-scala-days-2015/","section":"","summary":"Scala Days Amsterdam conference was full of interesting topics so in this post I\u0026rsquo;ll cover talks on the Scala platform, core concepts for making Scala code more idiomatic, monad transformers, consistency in distributed systems, distributed domain driven design, and a little more.","title":"In the Wake of Scala Days 2015"},{"content":"","date":null,"permalink":"https://datastrophic.io/tags/scala/","section":"Tags","summary":"","title":"Scala"},{"content":" Anton Kirillov Technical Leader. Compute and AI Infrastructure Hi, I\u0026rsquo;m Anton — welcome to datastrophic.io!\nI\u0026rsquo;m a technical leader and software engineer specializing in distributed systems, container orchestration platforms, and AI infrastructure. I\u0026rsquo;m focusing on performance optimizations, workload scheduling and federation, and resource management in large-scale multi-cluster environments.\nWhat does \u0026ldquo;datastrophic\u0026rdquo; mean? datastrophic (adj.) - relating to or describing a critical platform issue that results in severe data loss or damage. The name comes from my work circa 2015 helping startups design data platforms. We frequently discussed critical issues — data loss, consistency, delivery semantics, idempotent writes, backup strategies — and their business impact. When SLA breaches or irrecoverable data loss occurred, the consequences were often catastrophic. Hence, \u0026ldquo;datastrophic.\u0026rdquo;\nDisclaimer #All the information in this blog is the author\u0026rsquo;s personal opinion and does not represent any person, company, or organization views or position. None of the content is sponsored, and all the information represented in the blog is for educational purposes only.\nCredits # The website is created with Hugo and Congo. The main page image is based on The Black Wing Computer Lab drawing by Gleaming Scythe Publishing. ","date":"1 January 0001","permalink":"https://datastrophic.io/about/","section":"","summary":"","title":""},{"content":"","date":null,"permalink":"https://datastrophic.io/categories/","section":"Categories","summary":"","title":"Categories"}]