Ultimate Guide to Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail

Installation and configuration of Prometheus Operator, Grafana, Loki, and Promtail to ensure seamless application performance and effective log management. Learn to create custom metrics, set up alerting mechanisms, and visualize data through the Grafana dashboard monitoring solution

Ultimate Guide to Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail
Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail

πŸ™‹β€β™‚οΈ Introduction

Hello everyone, I'm Ankit Jodhani, a DevOps engineer passionate about Cloud and Container technologies. This blog is part of the #10weeksofcloudops series initiated by Piyush Sachdeva.

πŸ“š Synopsis

In this blog, we will implement monitoring and logging from scratch on an AWS EKS cluster using open-source tools like Prometheus, Grafana, Loki, and Promtail. We will also discuss writing custom metrics in a Node.js application, making them scrappable by Prometheus, setting up Alertmanager to receive email alerts, and designing custom dashboards in Grafana.

Once the monitoring part is over, we will move on to logging. We will set up Promtail and Loki to collect and aggregate logs, and finally visualize our logs using Grafana dashboards.

πŸ”Έ Story

  • Run Terraform scripts to create an EKS Cluster with necessary components.
  • Instrumentation: Understand prom-client to write custom metrics in a Node.js app and dockerize it.
  • Deploy the app on Kubernetes and make it accessible over the internet.
  • Install Prometheus Operator and Grafana on the EKS Cluster.
  • Configure Alerting rules, Service Monitors, and AlertManager for email alerts.
  • Create dashboards in Grafana to visualize the performance of the cluster and application.
  • Install Loki on the EKS Cluster and configure it with AWS S3 for log storage.
  • Install Promtail on the EKS Cluster and configure it to send logs to Loki.
  • Configure Grafana to display application logs.
  • Clean up all the infrastructure.

βœ… Prerequisites

  • πŸ“Œ AWS Account
  • πŸ“Œ Basic knowledge of Terraform
  • πŸ“Œ Basic knowledge of Docker
  • πŸ“Œ Basic of Nodejs (Good to have)

πŸ–₯️ Local setup

πŸ’‘
Ensure Terraform, Helm, and AWS CLI are installed on your computer. AWS CLI should be configured with admin privileges to avoid permission issues.

πŸ“¦ List of AWS services

  • πŸ‘‘ Amazon EKS
  • 🌐 Amazon VPC
  • πŸ”’ Amazon IAM
  • πŸ’» Amazon EC2
  • βš–οΈ Amazon Autoscaling
  • πŸͺ£Amazon S3
  • πŸš€ DynamoDB

☸️ List of Kubernetes Tools & Drivers

  • πŸ“‚ EBS CSI Driver (EKS Addon)
  • πŸ“Š Helm charts
  • πŸ§‘β€πŸ­ Prometheus Operator (using kube-prometheus-stack helm chart)
  • πŸ”” Alertmanager (using kube-prometheus-stack helm chart)
  • πŸ’» Grafana (using kube-prometheus-stack helm chart)
  • πŸ—ƒοΈ Loki (using grafana helm chart)
  • πŸ”Ž Promtail (using grafana helm chart)

☸️ Monitoring

  • Monitoring involves tracking the performance of your application and resources, and sending alerts when something is running slowly or failing, to prevent issues from escalating.

πŸ“Š Prometheus

  • It is an open-source monitoring tool that tracks your workload and stores all your metrics in a time-series database.
  • We use PromQL to query the metrics
  • In this blog, we'll store data inside an AWS EBS volume.

πŸ“’ Alert manager

  • Alert Manager is a Prometheus component responsible for sending alerts to users.

πŸ“˜ Logging

  • Logging helps you see what's happening inside your cluster, nodes, and how your application behaves in response to different requests and components, aiding in troubleshooting errors or bugs.

πŸ“œPromtail

  • Promtail is an open-source tool created by Grafana Labs. It collects all container logs and sends them to Loki.

πŸ”— Loki

  • Loki is also an open-source tool designed and developed by Grafana Labs. It consumes data sent by Promtail or other tools, processes, and filters it.
  • We use LogQL to query the logs from loki.
  • Loki can be integrated with many cloud services, in this blog we'll use the AWS S3 bucket to store the logs.

πŸ–₯️ Grafana

  • Grafana is a visualization tool commonly used for monitoring and logging.
  • Grafana can be integrated with prometheus, loki many other tool to create beautiful dashboard.
  • Grafana will query the prometheus & loki to get the metrics and logs.

🎯 Architecture

Let's understand the architecture of the project. Understanding the architecture makes it easier to proceed with the practical steps.

Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail
  • As you can see in the architecture, Prometheus scrapes metrics from the application and cluster and stores them in AWS EBS Volumes to keep it persistent in case of pod failure. just like that Grafana & Alermanger will also store its data inside EBS Volume.
  • Promtail will collect all the logs from the nodes (application logs + component logs) and send those logs to Loki.
  • Loki will aggregate & process the logs and send them to the AWS S3 bucket.
  • Grafana will query Prometheus and Loki for metrics and logs.

πŸš€ Step-by-Step Guide

πŸ’» Clone the repository

git clone https://github.com/AnkitJodhani/eks-monitoring-and-logging.git

cd eks-monitoring-and-logging
  • Below you can see the directory structures and purpose of each directory
πŸ“‚eks-private-container-registry
β”œβ”€β”€πŸ“app-code
β”‚   └── (Code of nodejs application)
β”œβ”€β”€πŸ“app-k8s-manifest
β”‚   └── (Contents of kubernetes manifest files for nodejs app)
β”œβ”€β”€πŸ“eks-terraform
β”‚   └── (Contains Terraform script to create AWS EKS cluster)
β”œβ”€β”€πŸ“grafana-dashboard
β”‚   └── (Contains json file for grafana dashboard )
β”œβ”€β”€πŸ“kube-prometheus-stack
β”‚   └── (Kubernetes manifest file for prometheus operator)
β”œβ”€β”€πŸ“loki-promtail-stack
β”‚   └── (Contents of Loki & Promtail)
β”œβ”€β”€πŸ˜Ί.gitignore
β”œβ”€β”€πŸ“„readme.md
β””β”€β”€πŸ“„test.sh

πŸ§‘β€πŸ’» Instrumentation

  • Instrumentation is the process of making code changes in the application to write custom metrics & expose metrics.
  • Instrumentation helps in Monitoring Performance + gaining insight of the application
  • I already created a demo nodejs app to demonstrate the Instrumentation. you will find the code inside app-code directory.
  • Please read index.js file. here I'll share a brief overview of the code
    • Express Setup: Initializes an Express application and sets up logging with Morgan.
    • Logging with Pino: Defines a custom logging function using Pino for structured logging.
    • Prometheus Metrics with prom-client: Integrates Prometheus for monitoring HTTP requests using the prom-client library:
      • http_requests_total counter
      • http_request_duration_seconds histogram
      • http_request_duration_summary_seconds summary
      • node_gauge_example gauge for tracking async task duration
    • Basic Routes:
      • / : Returns a "Running" status.
      • /healthy: Returns the health status of the server.
      • /serverError: Simulates a 500 Internal Server Error.
      • /notFound: Simulates a 404 Not Found error.
      • /logs: Generates logs using the custom logging function.
      • /crash: Simulates a server crash by exiting the process.
      • /example: Tracks async task duration with a gauge.
      • /metrics: Exposes Prometheus metrics endpoint.
  • After adding the required metrics, Dockerize the application and push it to the container registry. In my case, I pushed it to the docker hub.

πŸ‘‘ EKS Cluster using Terraform

  • Now, let's go ahead and spin up the EKS Cluster.
cd eks-terraform/main
  • in this directory, you will find all the config files for Terraform like backend.tf terraform.tfvars etc.. and you can modify them based on your requirements but the default setting will work fine for this project.
  • Initialize the terraform
terraform init
  • validate the script
terraform validate
  • See the plan of what Terraform going to install for us
terraform plan
  • VPC, IAM Roles, EKS Cluster + Managed NodeGroup, EBS CSI driver (using AWS Addon) + IRSA (IAM role for service account)
  • Now, let's execute the terraform to create the AWS EKS Cluster
terraform apply --auto-approve
☣️
Terraform takes approximately 20-30 minutes. So Enjoy the automation πŸ˜€ ….

Once, the above command completes its execution successfully, you have will have eks cluster running. let's head over to the aws console to verify that.

  • Let's update the .kube/config file to connect with the cluster
aws list-clusters --region us-east-1

aws eks update-kubeconfig --name monitoring-alerting-logging-eks-cluster --region us-east-1
  • Now, we can review the K8s component
kubectl get all -n kube-system

πŸ§‘β€πŸš€ Deploy Nodejs app

  • Our EKS Cluster is running, and now we can deploy our node js application.
  • You will find the Kubernetes manifest file in app-k8s-manifest directory.
  • You might want to change the image name in app-k8s-manifest/deployment.yml file instead of going with ankitjodhani/prometheus:learning
  • The app-k8s-manifest/service.yml will create a LoadBalancer to expose the app on the internet. Apply the file:
kubectl apply -k app-k8s-manifest/
  • Head over to the AWS console and verify the load balancer (Classic Load Balancer).
  • Now, you can take the DNS name of the Load Balancer and visit the website.
  • It's good to generate a load using an automated script. In the root directory, you will find test.sh, which will generate the load by sending a lot of requests.
  • So, open another new terminal and execute the below command. just like shown in the below image.
./test.sh YOUR_LOAD_BALANCER_DNS_NAME
  • Note: Keep running the test.sh and don't kill the terminal for a while.

βš“ Install the Helm chart

  • Execute the below commands and install helm charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

βš’οΈ Install & Configure Prometheus Operator

  • Now, let's install the Prometheus operator in the AWS EKS Cluster using the helm chart.
  • You will find all the Prometheus-related manifest files inside kube-prometheus-stack directory.
kubectl create ns monitoring

helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring -f kube-prometheus-stack/custom_kube_prometheus_stack.yml
Install Prometheus Operator on AWS EKS Cluster
  • It's time to configure custom Alerts, an Alertmanager to receive emails, and a ServiceMonitor to scrape our application metrics.
  • Before configuring Alertmanager, we need credentials to send emails. For this blog, I'm using Gmail, but any SMTP provider like AWS SES can be used. so let's grab the credentials for that.
  • Open your Google account settings and search App password & create a new password.
  • Copy the newly created password. it should be like uhnlqkdhnirpqfpy
πŸ“›
Ensure there are NO spaces in the password.
  • Convert that password into base64 format.
  • Now, put your password in the kube-prometheus-stack/email-secret.yml and add your email ID to the kube-prometheus-stack/alertmanagerconfig.yml instead of mine.
  • You can also take a look at kube-prometheus-stack/alerts.yml file to see the Rules that I've set for the alerts.
    • Send an alert when the average node CPU is higher than 50%.
    • Send an alert when a POD restarts more than 2 times.
  • Next, we will configure the Service Monitor to scrape the metrics from our Node.js application.
  • Refer to kube-prometheus-stack/serviceMonitor.yml for the configuration.
configure service monitor with Prometheus operator
  • It's time to apply all these configurations. Execute below command
kubectl apply -k kube-prometheus-stack/
  • We need to wait for a couple of minutes for the Prometheus operator to reload its configuration.
  • Now, let's visit the Prometheus UI by running the following command and visiting http://localhost:9090.
kubectl port-forward -n monitoring service/prometheus-operated 9090:9090
Prometheus Operator
  • To check the applied rules, click on the Alerts button at the top.
Configure Alert rules in Prometheus Operator
  • Verify the target configuration by clicking the Targets button from the drop-down menu.
Configure service monitor in Prometheus Operator
  • Let's access the Alertmanager UI to see the alert configurations. Run the following command and visit http://localhost:9093.
kubectl port-forward -n monitoring service/alertmanager-operated 9093:9093
Configure Alertmanager in Prometheus Operator
  • Click on the Status button at the top to see the applied configurations.
Configure Alermanger in Prometheus Operator
  • Now, let's crash the Node.js app twice to receive alerts from Alertmanager.
  • The Nodejs app has a route /crash, which crashes the container, and Kubernetes automatically restarts it. However, if the app crashes more than 2 times, Alertmanager will send an alert to our email.
  • let's see that practically
http://YOUR_LOAD_BALANCER_DNS_NAME/crash
  • Keep hitting the above endpoint until Kubernetes restarts at least 3 times.
kubectl get pods
Prometheus Operator
  • Check the alert in the firing state by running:
kubectl port-forward -n monitoring service/prometheus-operated 9090:9090
Alert rules in Prometheus Operator
  • Verify Alertmanager received an alert from Prometheus:
kubectl port-forward -n monitoring service/alertmanager-operated 9093:9093
Alertmanger configure in Prometheus Operator
  • You should receive an email notification on your configured mail address.
Configure alertmanger in Prometheus Operator
  • We configured it to send emails every 5 minutes until the issue is resolved.
Configure Alertmanager in Prometheus Operator
  • Now, it's time to visualize our metrics on a beautiful dashboard. Thankfully, the kube-prometheus-stack Helm chart automatically installs Grafana, so we don't need to install it separately. Access the Grafana UI at http://localhost:8000:
kubectl port-forward -n monitoring service/monitoring-grafana 8000:80
Grafana in Prometheus operator
  • You will see many pre-built dashboards. You can utilize them for monitoring or design/import your own.
  • Import the dashboard I created for the Node.js app, available in the grafana-dashboard directory.
  • Click on the New button at the top right, select Import from the drop-down menu, and import the dashboard.
Grafana in Prometheus Operator
  • Once imported, you will see a screen similar to mine, as shown below, if you haven't stopped the test.sh (load generator script).
Grafana dashboard in Prometheus Operator
  • This is how we can monitor our application, other components, and clusters from Grafana.

βš’οΈ Install & configure Loki

  • We've set up monitoring, now let's configure Loki and Promtail for logging.
  • We already added the Grafana Helm repo in the previous step, which includes both Loki and Promtail.
  • We want Loki to store logs in an AWS S3 bucket, so it needs a bucket and relevant permissions to send logs to the AWS S3 bucket.
  • Head over to the AWS S3 console and create a bucket with a unique name.
ConfigureLoki to send logs to AWS S3 Bucket
  • Next, create an IAM policy in the AWS console. You can find the policy in loki-promtail-stack/aws-s3-policy.json, but remember to add your bucket's ARN.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1719324853777",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "ARN_OF_YOUR_BUCKET"
        },
        {
            "Sid": "Stmt1719324853778",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "ARN_OF_YOUR_BUCKET/*"
        }
    ]
}

Configure Loki to send logs to AWS S3 Buckets
  • Lets, ceate an IAM role, attach the policy, and create an access_key_id and secret_access_key.
Configure Loki to send logs to the AWS S3 bucket
  • Now we are ready to configure Loki.
  • Let's first see the values.yml file and write that into loki_distributed_values.yml
helm show values grafana/loki-distributed > loki-promtail-stack/loki_distributed_values.yml
  • loki_distributed_values.yml has all the default settings but we have to make some changes to configure the aws s3 bucket.
  • for reference, you can see the below screenshots for what values I've changed in the file.
Configure Loki to send logs to AWS S3 bucket
Configure Loki to send logs to the AWS S3 bucket
  • I also created an updated configuration file, loki-promtail-stack/custom_loki_distributed_values.yml, with all necessary changes.
  • Ensure you add your bucket name, region, access ID, and secret access ID.
  • Now, let's install Loki on the cluster using the helm chart, hit the below command to install it
helm install loki grafana/loki-distributed -n monitoring -f loki-promtail-stack/custom_loki_distributed_values.yml
Install Loki using helm chart on AWS EKS cluster
  • Yup! We've installed Loki successfully

βš’οΈ Install & configure Promtail

  • Now, let's set up the log collector, Promtail. We already have the Promtail Helm chart in the Grafana repo.
  • Since everything is installed in the monitoring namespace, we need to change one endpoint in Promtail's default configuration.
  • Hit the below command to see the default configuration(values.yml) file at loki-promtail-stack/promtail_values.yml
helm show values grafana/promtail > loki-promtail-stack/promtail_values.yml
  • We have to change clients.url attribute so Promtail knows where to send the logs. Refer to the image for reference.
Configure Promtail on AWS EKS Cluster
  • I also provided an updated configuration file, loki-promtail-stack/custom_promtail_values.yml.
  • Now, we are done with configuration. let's go ahead and install Promtail. please hit the below command
helm install promtail grafana/promtail -n monitoring -f  loki-promtail-stack/custom_promtail_values.yml
Install and configure Promtail on AWS EKS
  • Now, let's go ahead and see our logs in the Grafana dashboard. please hit the below command & access grafana at http://localhost:8000
  • Before adding a new dashboard, we need to add new data sources so Grafana can query logs from Loki.
  • So let's add a new data source. see below image for reference
  • Add a new data source with the URL http://loki-loki-distributed-gateway.monitoring.svc.cluster.local
Configure Loki and Promtail on AWS EKS Cluster
Configure Loki and Promtail on AWS EKS Cluster
  • We've successfully added a data source. Now, import the community dashboard by typing 15414 and selecting Loki as the data source.
Configure Loki and Promtail on AWS EKS Cluster
Configure Loki and Promtail on AWS EKS Cluster
  • You can now see all the logs in Grafana. Apply filters to get specific namespace or container logs.
Configure Loki and Promtail on AWS EKS Cluster
  • Now, let's try to generate logs from your application by selecting the default namespace from the dropdown menu at the top.
  • You can run thattest.sh script or visit http://YOUR_LOAD_BALANCER_DNS_NAME/logs in the browser.
Configure Loki and Promtail on AWS EKS Cluster
  • Lastly, verify that Loki is sending logs to the S3 bucket by checking the folders created by Loki in the AWS S3 console.
Configure Loki and Promtail on AWS EKS Cluster
  • Yes, we can see the logs are available inside our AWS S3 bucket.

 πŸ§Ό Cleanup

  • It's time to clean up what we've created to avoid unnecessary costs.
  • First, delete the Node.js application from Kubernetes:
kubectl delete -k app-k8s-manifest/
  • Next, delete the Helm charts installed, as Prometheus, Grafana, and Alertmanager have created AWS EBS volumes:
helm uninstall monitoring -n monitoring

helm uninstall loki -n monitoring

helm uninstall promtail -n monitoring
  • Let's even delete the monitoring namespace
kubectl delete ns monitoring
  • Also, make sure that we don't have any Persistent volume because if something is left out it will create trouble for Terraform.
kubectl get pv
  • Finally, let's destroy our AWS EKS Cluster. so, please navigate to the eks-terraform/main/ directory & hit the below command
cd eks-terraform/main/

terraform destroy --auto-approve
  • After executing the above command, you will not have any resources in your AWS account.

πŸ™Œ Conclusion

  • In this blog, we've comprehensively walked through setting up a monitoring and logging stack on AWS EKS using Prometheus, Grafana, Loki, and Promtail.
  • From deploying a Node.js application with custom metrics to visualizing logs and metrics in Grafana, we've covered the entire process step-by-step.
  • I aimed to cover all necessary details and best practices. but writing everything in the blog is not possible so I recommend you to dig deeper and check out my Terraform code, Kubernetes manifest files, and the rest of all directories.
  • You can implement CICD for Terraform (GitOps approach)

And here it ends... πŸ™ŒπŸ₯‚

if you like my work please message me on LinkedIn with "Hi and your country name"

-πŸ™‹β€β™‚οΈ Ankit Jodhani.

πŸ“¨ reach me at ankitjodhani1903@gmail.com

πŸŽ’ Resources

https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed

https://github.com/grafana/helm-charts/tree/main/charts/promtail

https://dev.to/aws-builders/monitoring-eks-cluster-with-prometheus-and-grafana-1kpb

https://github.com/grafana/loki/issues/7335

https://stackoverflow.com/questions/76873980/loki-s3-configuration-for-chunks-and-indexes

https://blog.srev.in/posts/grafana-loki-with-amazon-s3/

https://akyriako.medium.com/kubernetes-logging-with-grafana-loki-promtail-in-under-10-minutes-d2847d526f9e