Kubernetes for Data Engineers: A Beginner’s Guide
This tutorial will introduce you to the core concepts of containers, Kubernetes, and Helm, with a focus on local
development and a simple data processing script.
Prerequisites
- Docker installed on your local machine
- minikube installed for local Kubernetes cluster
- kubectl installed for interacting with Kubernetes
- Helm installed for package management
So what is Kubernetes, MiniKube, and Kubectl, and Helm?
- Kubernetes (k8s): An open-source container orchestration platform that automates the deployment, scaling, and
management of containerized applications. - MiniKube: A tool that runs a single-node Kubernetes cluster inside a virtual machine on your local machine.
- Kubectl: The command-line tool for interacting with Kubernetes clusters.
- Helm: A package manager for Kubernetes that helps you define, install, and upgrade applications on your cluster.
There are a few gaps in this tutorial for the purpose of learning. It is a great starting point for beginners to understand the basics of Kubernetes and Helm.
1: Understanding Containers
Containers are lightweight, standalone executable packages that include everything needed to run a piece of software, including the code, runtime, system tools, libraries, and settings. They are isolated from each other and share the same operating system kernel, making them more efficient than virtual machines.
1. Create a Python script named data_processor.py:
import time
def process_data() -> None:
"""
This function simulates data processing.
"""
print("Starting data processing...")
time.sleep(5)
print("Data processing completed!")
if __name__ == "__main__":
while True:
process_data()
time.sleep(10)
This script does not do much… but it will be used to demonstrate how to run a containerized application.
2. Create a Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY data_processor.py .
CMD ["python", "data_processor.py"]
In this configuration, we are using the official Python 3.9 slim image as the base image. We copy the data_processor.py script into the /app directory and set it as the default command to run when the container starts.
The docker commands are as follows:
- -rm: Automatically remove the container when it exits
- -t: Allocate a pseudo-TTY
- -d: Run the container in the background
- -p: Publish a container’s port(s) to the host
- - - name: Assign a name to the container
3. Build the Docker image:
docker build -t data-processor .
4. Run the container locally:
We will run the container in the foreground to see the output from the data_processor.py script.
docker run --rm -t data-processor
You should see the output from the data_processor.py script in your terminal. We can stop the container by pressing `Ctrl+C`.
2: Introduction to Kubernetes
Kubernetes is a container orchestration platform that automates the deployment, scaling, and management of containerized applications.It is designed to run distributed systems at scale, making it ideal for cloud-native applications.
We will use minikube to set up a local Kubernetes cluster for development purposes. Minikube is a tool that runs a single-node Kubernetes cluster inside a virtual machine on your local machine. It is a great way to get started with Kubernetes without the need for a cloud provider. This is known as bare-metal Kubernetes, because it runs directly on your local machine.
1. Start your local Kubernetes cluster:
We will start minicube, and also load our nelwy created image to minicube to access.
minikube start
minikube image load data-processor:latest
Other minikube commands include:
- minikube status: Get the status of the local Kubernetes cluster
- minikube stop: Stop the local Kubernetes cluster
- minikube delete: Delete the local Kubernetes cluster
- minikube dashboard: Open the Kubernetes dashboard in a web browser
- minikube service: Access a service running in the cluster
- minikube ip: Get the IP address of the minikube VM
- minikube logs: Get the logs of the minikube VM
- minikube ssh: SSH into the minikube VM
2. Create a Kubernetes deployment YAML file named data-processor-deployment.yaml:
The deployment file specifies the desired state for the data-processor application, including the container image, ports, and replicas. We use yaml to define the configuration, and we can apply it to the Kubernetes cluster using kubectl.
apiVersion: apps/v1
kind: Deployment
metadata:
name: data-processor
spec:
replicas: 1
selector:
matchLabels:
app: data-processor
template:
metadata:
labels:
app: data-processor
spec:
containers:
- name: data-processor
image: data-processor:latest
imagePullPolicy: IfNotPresent
This config defines a deployment with one replica of the data-processor container image. In a production environment, you would typically push the image to a container registry.
3. Apply the deployment:
Kubectl is the command-line tool for interacting with Kubernetes clusters. We use it to apply the deployment configuration to the cluster.
Kubectl commands include:
- kubectl apply: Apply a configuration to a resource by filename or stdin
- kubectl get: Display one or many resources
- kubectl describe: Show details of a specific resource or group of resources
- kubectl logs: Print the logs for a container in a pod
- kubectl exec: Execute a command in a container
- kubectl delete: Delete resources by filenames, stdin, resources, and names, or by resources and label selector
kubectl apply -f data-processor-deployment.yaml
4. Check the status of your deployment:
kubectl get deployments
kubectl get pods
You should see the data-processor deployment and pod running in the cluster.
5. View the logs of your running pod:
kubectl logs data-processor-<pod-id>
3: Introduction to Helm
Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. We use it by defining a chart, which is a collection of files that describe a set of Kubernetes resources.
We can get charts from the Helm Hub or create our own, and we can install, upgrade, and delete them using the Helm command-line tool.
When you create a new Helm chart using `helm create data-processor-chart`, several files and directories are generated.
1. Create a new Helm chart:
helm create data-processor-chart
In this directory, files are created, each file is for the following:
- Chart.yaml: This file contains metadata about the chart, such as its name, version, description, and any
dependencies. - values.yaml: This file defines the default configuration values for the chart. It allows you to customize the
behavior of your chart without modifying the template files. - templates/: This directory contains the template files for Kubernetes resources. These templates use Go templating
syntax to generate the final Kubernetes manifests. - templates/NOTES.txt: This file contains plain text that gets printed out after the chart is successfully deployed.
It’s typically used to display usage notes, next steps, or additional information about the deployment. - templates/deployment.yaml: This template defines the Kubernetes Deployment resource for your application.
- templates/service.yaml: This template defines the Kubernetes Service resource to expose your application.
- templates/serviceaccount.yaml: This template creates a Kubernetes ServiceAccount for your application, if needed.
- templates/hpa.yaml: This template defines a Horizontal Pod Autoscaler for automatically scaling your application
based on resource usage. - templates/ingress.yaml: This template defines an Ingress resource for routing external traffic to your service.
- templates/tests/: This directory contains test files for your chart.
- templates/_helpers.tpl: This file contains helper templates that can be used across your chart templates.
- .helmignore: Similar to .gitignore, this file specifies which files should be ignored when packaging the chart.
These files provide a structured way to define, customize, and deploy your application using Helm. You can modify these
files to suit your specific application needs and add additional resources as required.
2. Replace the contents of data-processor-chart/values.yaml with:
replicaCount: 1
image:
repository: data-processor
tag: latest
pullPolicy: Never
serviceAccount:
create: false
name: ""
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
autoscaling:
enabled: false
service:
type: ClusterIP
port: 80
Since we’re not using a service or ingress for this simple data processing job, let’s remove those templates:
rm data-processor-chart/templates/service.yaml
rm data-processor-chart/templates/ingress.yaml
rm data-processor-chart/templates/hpa.yaml
rm data-processor-chart/templates/tests/test-connection.yaml
What we have done here is define the configuration values for our Helm chart. We specify the number of replicas, the container image details, and the resource limits and requests.
Now, let’s simplify the NOTES.txt file to remove references to ingress and service:
Thank you for installing {{ .Chart.Name }}.
Your release is named {{ .Release.Name }}.
To learn more about the release, try:
$ helm status {{ .Release.Name }}
$ helm get all {{ .Release.Name }}
3. Replace the contents of data-processor-chart:
Replace the contents of data-processor-chart/templates/deployment.yaml with:
apiVersion: apps/v1
kind: Deployment
metadata:
name: { { include "data-processor-chart.fullname" . } }
labels:
{ { - include "data-processor-chart.labels" . | nindent 4 } }
spec:
replicas: { { .Values.replicaCount } }
selector:
matchLabels:
{ { - include "data-processor-chart.selectorLabels" . | nindent 6 } }
template:
metadata:
labels:
{ { - include "data-processor-chart.selectorLabels" . | nindent 8 } }
spec:
{ { - with .Values.imagePullSecrets } }
imagePullSecrets:
{ { - toYaml . | nindent 8 } }
{ { - end } }
serviceAccountName: { { include "data-processor-chart.serviceAccountName" . } }
containers:
- name: { { .Chart.Name } }
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: { { .Values.image.pullPolicy } }
resources:
{ { - toYaml .Values.resources | nindent 12 } }
Here, we define the deployment template for our Helm chart. We use the values defined in values.yaml to set the number of replicas, the container image, and the resource limits and requests.
Finally, let’s update the data-processor-chart/templates/serviceaccount.yaml file:
{ { - if .Values.serviceAccount.create - } }
apiVersion: v1
kind: ServiceAccount
metadata:
name: { { include "data-processor-chart.serviceAccountName" . } }
labels:
{ { - include "data-processor-chart.labels" . | nindent 4 } }
{ { - with .Values.serviceAccount.annotations } }
annotations:
{ { - toYaml . | nindent 4 } }
{ { - end } }
{ { - end } }
4. Install the Helm chart:
helm install data-processor ./data-processor-chart
5. Verify the deployment:
kubectl get deployments
kubectl get pods
You should see something like:
NAME READY UP-TO-DATE AVAILABLE AGE
data-processor-data-processor-chart 1/1 1 1 11s
And:
NAME READY STATUS RESTARTS AGE
data-processor-data-processor-chart-<pod-id> 1/1 Running 0 16s
6. Check the logs of your running pod:
kubectl logs data-processor-<pod-id> -f
7. To stop the deployment:
helm uninstall data-processor
Conclusion
In this tutorial, we introduced you to the core concepts of containers, Kubernetes, and Helm, focusing on local development and a simple data processing script. We covered building a Docker image, running a container locally, setting up a local Kubernetes cluster with minikube, and deploying a containerized application using kubectl and Helm.
This is just the beginning of your journey into the world of Kubernetes and container orchestration. As you continue to explore and learn, consider the following best practices:
- Container optimization: Keep your container images small and efficient by using multi-stage builds and minimizing the number of layers.
- Resource management: Always specify resource requests and limits for your containers to ensure efficient cluster utilization.
- Security: Follow the principle of least privilege when setting up service accounts and RBAC policies.
- Monitoring and logging: Implement comprehensive monitoring and logging solutions to gain visibility into your applications and cluster health.
- CI/CD integration: Automate your deployment process by integrating Kubernetes and Helm into your CI/CD pipeline.
To further your knowledge, consider exploring:
- Advanced Kubernetes features like StatefulSets, DaemonSets, and Jobs
- Kubernetes networking concepts and service mesh technologies
- High availability and disaster recovery strategies
- Cloud-native storage solutions
- Kubernetes security best practices
Remember, the key to mastering Kubernetes and container orchestration is hands-on practice and continuous learning. As you build more complex applications and deploy them to production environments, you’ll gain valuable experience and insights.
Happy containerizing! 🐳🚀