Kubernetes Dive

AWS Break Glass Access: The Complete Guide

Ankit Jodhani — Thu, 19 Feb 2026 05:02:20 GMT

🙋‍♂️ Introduction

Hi All, I'm Ankit Jodhani, a Kubestronaut and was working as a Kubernetes Engineer in the past. Also an AWS Community Builder. I'm very passionate about Cloud and Container technologies.

Recently, I came across this concept called "Break Glass" (I know, its not a new for eveyone but was new for me) & honestly, it surprised me how critical it is and how rarely people talk about it in detail. So I spent good amount of time researching, reading AWS docs and blog. this blog is the result of all that.

Little Promotion: I'm looking for Freelancing clients and Projects related to Kubernetes, Cloud, DevOps. Feel free to reach out if you are looking for someone like me.

📚 Synopsis

Let's imagin one scenario: Its 2 AM in the midnight, You receive a phone call from your colleague that Payment service is failing and customers are getting errors.
You grab your laptop, open the browser, and go to your AWS SSO portal to login & fix the issue but the portal shows "Service Unavailable", You try again. Same thing. Your Identity Provider (Okta, Azure AD, whatever you use) is either down or something is broken with IAM Identity Center.
Now you're standing there, fully awake, knowing exactly what to fix but

You can't get into your AWS accounts. You are completely locked out.

I know this is a hypothetical scenario but this can happen to real teams at real companies. And the one who has Break Glass mechanism in place, they can fix the issue and go back to sleep. And the one who doesn't?? Welll.. for them it can be very long night..

In this blog, we'll explore:

🔸 Story

What the normal "Day-2" access flow looks like & what can go wrong with it
The Break Glass concept: what it actually means? & why you need it?
The different approaches to implement Break Glass in AWS
Complete architecture for a production-grade Break Glass setup
Step-by-step implementation guide
How to setup alerts and monitoring for Break Glass usage
Break Glass drill procedure: How to test it
Real-world emergency scenarios & exactly how Break Glass saves you

🔄 Normal Access Flow (Day-2 Operations)

First, Let's understand how normal access works. Because Break Glass only makes sense when you understand what it's replacing.

Here's how engineers access AWS accounts on a daily basis:

+------------------+
|     Engineer     |
+------------------+
          |
          v
+------------------+
|     Web Browser  |
+------------------+
          |
          v
+-----------------------------------+
|   Identity Provider (IdP)        |
|  (Okta / Azure AD / Google)      |
+-----------------------------------+
          |
          v
+-----------------------------------+
|        MFA Challenge             |
| (Authenticator App / SMS / etc.) |
+-----------------------------------+
          |
          v
+-----------------------------------+
| SSO Portal - Account & Role List |
+-----------------------------------+
          |
          v
+-----------------------------------+
| Engineer selects:                |
| "Production Account → ReadOnly"  |
+-----------------------------------+
          |
          v
+-----------------------------------+
| IAM Identity Center              |
| Assumes IAM Role in Target Acct  |
+-----------------------------------+
          |
          v
+-----------------------------------+
| Production AWS Account           |
| IAM Role: ReadOnly               |
| Temporary Credentials            |
| (1–12 Hour Expiry)               |
+-----------------------------------+

No passwords are stored. No long-lived access keys. No IAM users in member accounts.
All access is temporary, auditable, and centrally managed through IAM Identity Center.

This is good. This is the right way. But what happens when this flow breaks?

⚠️ What Can Go Wrong?

Failure Scenario	Impact
Identity Provider (Okta / Azure AD) is down	No one can authenticate, Complete lockout from all accounts.
IAM Identity Center service outage(rare but possible)	SSO portal unreachable, No one can assume roles.
Someone misconfigures an SCP on Root or Workload OU	SCP accidentally denies `sts:AssumeRole`. Identity Center can't assume roles in member accounts.
Identity Provider is compromised by attacker	You need to cut off SSO immediately. But then how does YOUR team access AWS to respond to the incident?

In all of these scenarios, your normal access path is broken. You need an alternative way in. And that alternative is Break Glass.

🔐 The Break Glass Concept

It is a Pre-established emergency access to mechanism to the system that bypasses the normal authentication & authorization flow to selected set of people in case of emergency situations.

It's called "Break Glass" because, its like a fire alarm behind a glass panel and you only break it in case of a real emergency.

Few considerations:

🚫 Never used for normal day-to-day operations
✅ Must always be functional and ready
🚨 Must trigger an immediate alert when used
🧘 Must be simple enough to use under pressure
🔒 Requires authorization, not everyone should have access

📋 Break Glass Approaches

There are 4 main approaches.

Sr No	Approach	What It Is
1	Treat Root User as Break Glass	Secure the management account root user as your last-resort emergency access
2	Break Glass IAM User in Management Account	Create dedicated IAM users (BreakGlass-1, BreakGlass-2) in the management account with cross-account roles
3	Dedicated Break Glass Account	Separate AWS account with its own IAM users + cross-account roles into member accounts
4	Backup Identity Provider	Configure a second IdP as fallback federation source

In this blog we will focus on 3rd Approach(Dedicated Break Glass Account) as it covers all the other approaches within itself.

All of them are fairly simple to implement and the choice depends on the criticalness of your workloads and the scale you operate at.

There are no hard rules about these approaches. You can also design a custom approach based on what you need. These are some patterns, not rigid rules.

🎯 Architecture

Let's understand the architecture before we jump into the implementation. This will give you a clear picture of what we're building.

AWS Break Glass Access: The Complete Guide

Here's what the architecture looks like:

Break Glass Account: A separate dedicated account in dedicated OU or in the Security OU with 2 IAM users (BreakGlass-Admin-1 and BreakGlass-Admin-2)
Management Account: Has 2 Break Glass IAM users (BreakGlass-1 and BreakGlass-2) along with the secured root user.
Every critical member account(not every): Has 2 IAM roles (BreakGlassReadOnly and BreakGlassAdmin )
- These roles trust all 4 Break Glass users (2 from Management + 2 from Break Glass Account)
- They require MFA for assumption
The Break Glass Account has NO SSO access (it should be completely independent or disconnected from Identity Center - no one should have access to it)
The Break Glass Account has NO workloads (only CloudTrail and Config running)
CloudTrail + EventBridge alerts fire whenever any Break Glass user logs in or assumes a role

The key idea here: we have 3 layers of emergency access, each independent of the other:

Layer 1: Break Glass Account IAM Users
         (handles most common emergencies or used when Management Account itself is compromised or broken)

Layer 2: Break Glass IAM Users in Management Account
         (handles emergencies like SSO fix, SCP fix)

Layer 3: Management Account Root User
         (absolute last resort, when everything else fails)

🚀 Step-by-Step Implementation Guide

🔹 Step 1: Create the Break Glass AWS Account

Create a new AWS account through Account Factory or AWS Organizations.

Account Name: BreakGlass
Root Email: aws-breakglass@xyz.com (dedicated email, not shared with anyone else)
OU Placement: Security OU (or create a dedicated sub-OU)

💡

The Break Glass Account must have its own dedicated email address. This email should NOT be shared or aliased with any other account's email group. Complete isolation.

Few critical things about this account:

This account must be disconnected from SSO / Identity Center (no one should be able to access it via SSO)
No workloads should run in this account (only CloudTrail and AWS Config, which are mandatory via Control Tower)
The SCPs on this account's OU should NOT block sts:AssumeRole or iam:* otherwise the Break Glass users won't be able to assume roles in other accounts

🔹 Step 2: Create Break Glass IAM Users in the Break Glass Account

In the newly created Break Glass Account, create 2 IAM users with console access:

BreakGlass-Admin-1
BreakGlass-Admin-2

For each user:

a) Create the user with console access
b) Attach the permission to assume roles in other accounts

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAssumeBreakGlassRoles",
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": [
        "arn:aws:iam::*:role/BreakGlassAdmin",
        "arn:aws:iam::*:role/BreakGlassReadOnly"
      ]
    }
  ]
}

c) Add an MFA enforcement policy:
- This is important: Even if someone gets the password, they can't do anything without the hardware MFA device.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAllWithoutMFA",
      "Effect": "Deny",
      "NotAction": [
        "iam:CreateVirtualMFADevice",
        "iam:EnableMFADevice",
        "iam:GetUser",
        "iam:ListMFADevices",
        "iam:ListVirtualMFADevices",
        "iam:ResyncMFADevice",
        "sts:GetSessionToken"
      ],
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        }
      }
    }
  ]
}

d) Setup Hardware MFA:
- Use a hardware MFA device (YubiKey or similar): NOT a phone-based authenticator app
- Register the MFA device on each user
- Label the physical device clearly: BG-ADMIN-1-MFA
e) Store the credentials securely:
- Store passwords in your organization's security vault (1Password Business, CyberArk, HashiCorp Vault, something that does NOT depend on AWS)
- Store the hardware MFA devices in a physical secure location (office safe, locked cabinet)

📛 Best practice: implement dual control. One person holds the password, another person holds the MFA device. Both must be present to use Break Glass. This prevents a single person from having unilateral access.

These credentials should be shared with 2 credible people in your organization, typically the Cloud Platform Lead and the CTO

Now repeat the same for BreakGlass-Admin-2.

🔹 Step 3: Create Break Glass IAM Users in the Management Account

Now create 2 more Break Glass users, but this time in the Management Account:

BreakGlass-1
BreakGlass-2

The setup is identical to Step 2, same policies, same MFA, same credential storage practices.

🤔

You might have question that: But why do we need Break Glass users in BOTH accounts?

Scenario	Break Glass Account Users	Management Account Users
Access member accounts when SSO is down	✅ Works	✅ Works (two paths)
Fix SCPs / AWS Organizations	❌ Must use Management Account root	✅ Use Break Glass IAM user (faster, better audit trail)
Fix IAM Identity Center (SSO)	❌ Must use Management Account root	✅ Use Break Glass IAM user
Fix Control Tower	❌ Must use Management Account root	✅ Use Break Glass IAM user

Note: The Break Glass Account cannot manage Organizations, SCPs, or Identity Center, only the Management Account can. Without Break Glass users in the Management Account, every SCP or SSO issue forces you to use root. And root should be the absolute last resort..

🔹 Step 4: Create Cross-Account Roles in Every Member Account

This is the critical piece that connects everything. In every critical member account, create 2 IAM roles:
- Role 1: BreakGlassReadOnly: For investigation and read-only access

{
  "RoleName": "BreakGlassReadOnly",
  "MaxSessionDuration": 14400,
  "AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "TrustManagementAccountBreakGlass",
        "Effect": "Allow",
        "Principal": {
          "AWS": [
            "arn:aws:iam::MANAGEMENT_ACCOUNT_ID:user/BreakGlass-1",
            "arn:aws:iam::MANAGEMENT_ACCOUNT_ID:user/BreakGlass-2"
          ]
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "Bool": { "aws:MultiFactorAuthPresent": "true" }
        }
      },
      {
        "Sid": "TrustBreakGlassAccount",
        "Effect": "Allow",
        "Principal": {
          "AWS": [
            "arn:aws:iam::BREAKGLASS_ACCOUNT_ID:user/BreakGlass-Admin-1",
            "arn:aws:iam::BREAKGLASS_ACCOUNT_ID:user/BreakGlass-Admin-2"
          ]
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "Bool": { "aws:MultiFactorAuthPresent": "true" }
        }
      }
    ]
  },
  "ManagedPolicyArns": ["arn:aws:iam::aws:policy/ReadOnlyAccess"]
}

Role 2: BreakGlassAdmin: For full admin access when you need to fix things
- Same trust policy as above, but attach AdministratorAccess instead of ReadOnlyAccess.

🔥

Do NOT create these roles manually in every account. Automate it. Use CloudFormation StackSets, AFT (Account Factory for Terraform), or CfCT (Customizations for Control Tower). This way, every new account automatically gets these roles.

🔹 Step 5: Setup Alerts and Monitoring

This is non-negotiable. You MUST know when anyone uses Break Glass. whether it's a legitimate emergency or an attacker who got hold of the credentials.

What to alert on:

Any console login by Break Glass IAM users
Any sts:AssumeRole call to BreakGlassAdmin or BreakGlassReadOnly roles
Any console login by root user (any account)
Any failed login attempts on Break Glass users

Send notifications to:

📧 Email: Security team + Cloud Platform Lead
💬 Slack/Teams: #security-alerts channel
🚨 PagerDuty: High urgency (Break Glass login = always high urgency)

💡

This alert will also be useful while collecting evidence for auditors and for the compliance requirements.

🔥 Emergency Procedure: How to Actually Use Break Glass

There should be a complete documented procedure for how to use Break Glass and it should be easily accessible to your team in case of emergency.

Here's the exact flow:

STEP 0: Declare the emergency
  → Cloud Lead or CTO approves Break Glass usage
  → Notify #incident channel: "Break Glass initiated. Reason: [XYZ]"

STEP 1: Determine which layer you need
  → Need to fix SCPs / SSO / Control Tower?
    → Use BreakGlass-1 in Management Account 
  → Management Account is compromised/broken?
    → Use BreakGlass-Admin-1 in Break Glass Account
  → Everything else has failed?
    → Use Management Account Root (Layer 3)

STEP 2: Retrieve credentials
  → Person A retrieves password from the vault
  → Person B retrieves hardware MFA device from secure storage
  → Both people must be present

STEP 3: Login
  → Go to: https://ACCOUNT_ID.signin.aws.amazon.com/console
  → Enter IAM username + password + MFA code
  → You're in.

STEP 4: If you need to reach a member account
  → Click username (top-right) → "Switch Role"
  → Enter target Account ID + Role (BreakGlassAdmin or BreakGlassReadOnly)
  → You're now inside the target account

STEP 5: Fix the issue
  → Document EVERY action you take (timestamps + what you did + why)

STEP 6: Exit and secure
  → Log out. Return MFA devices to storage.
  → Notify team: "Break Glass session ended. Normal access restored."

STEP 7: Post-incident
  → Rotate the Break Glass password that was used
  → Review CloudTrail logs for the session
  → Write incident report
  → Conduct post-mortem: Why was Break Glass needed? How to prevent it?

🧪 Break Glass Drill

As discussed earlier, Break Glass mechanism must remain functional at all times. so for that, you should conduct a Break Glass drill every 6 months(or based on your schedule)
Drill checklist:
- 🔘 Notify security team that a drill is starting
- 🔘 Retrieve Break Glass credentials from vault
- 🔘 Successfully log in as Break Glass user
- 🔘 Successfully switch role into a non-production member account
- 🔘 Verify alerts fired (security team confirms receipt)
- 🔘 Log out and return credentials
- 🔘 Rotate the password used during the drill
- 🔘 Document results, what worked, what didn't
- 🔘 Update the runbook if anything was unclear

🎬 Real-World Scenario: SSO is Down, Production is on Fire

Let me paint you a real picture of how all of this comes together:

2:00 AM — PagerDuty fires. Payment service returning 500 errors.

2:02 AM — On-call SRE tries SSO portal. "Service Unavailable."
          Can't access any AWS account.

2:05 AM — SRE escalates to Cloud Lead: "SSO is down. Need Break Glass."

2:07 AM — Cloud Lead approves. Opens 1Password (SaaS — not on AWS).
          Retrieves BreakGlass-1 password. Grabs YubiKey from drawer.

2:10 AM — Logs into Management Account:
          https://111111111111.signin.aws.amazon.com/console
          Username: BreakGlass-1 | Password: *** | MFA: YubiKey

2:11 AM — Switches Role to Production Account:
          Account: 222222222222 | Role: BreakGlassAdmin

2:13 AM — Inside Production Account. Investigates the issue.
          Finds bad deployment. Initiates rollback.

2:20 AM — Application recovers. 500 errors stop.

2:22 AM — Logs out. Returns YubiKey to secure storage.

2:25 AM — Posts in #incident: "Production restored. Break Glass ended."

Next morning:
  → Security team reviews CloudTrail logs
  → BreakGlass-1 password rotated
  → Incident report written
  → Post-mortem: Why did SSO go down? How to prevent it?

🙌 Conclusion

Break Glass is one of those things that you set up hoping you'll never use it. But when you do need it, you'll be incredibly glad it's there.

I tried to cover all the important details and best practices. But writing everything in one blog is obviously not possible.

And that's a wrap! 🙌🥂

if you like my work please message me on LinkedIn with "Hi + your country name"

🙋‍♂️ Ankit Jodhani (Again, I'm open for Kubernetes, Cloud and DevOps Project)

📨 reach me at ankitjodhani1903@gmail.com

For anyone exploring a character design, it is worth considering how details related to character costumes will affect the complete look. During a long convention day, a comparison of details related to costume sizing can clarify differences in colour and construction. For choices related to details related to character accuracy, cosplay costumes for beginners offers a useful starting point for practical preparation. To stay comfortable throughout the event, it is useful to check details related to short cosplay wigs against the wearer's own needs.

AWS S3 Cost Optimization: Automate Cleanup of Abandoned Buckets

Ankit Jodhani — Mon, 21 Apr 2025 19:33:51 GMT

🙋‍♂️ Introduction

Hey folks! I'm Ankit, working as a Kubernetes Engineer at CirrOps and a newly minted AWS Community Builder. I’m passionate about Cloud and Container technologies. But today, I'm switching gears to talk about something equally important - Cost Optimization, more specifically about AWS S3 buckets.

📚 Synopsis

In most projects, we spin up a large number of AWS S3 buckets while developing an app or testing, or running in production. It's often the go-to solution for media-related applications. But the problem here is, when testing applications or in production environments, we end up creating tons of buckets that eventually get abandoned by users, testers, employees, or applications.
These buckets not only lead to unnecessary costs but also create a management mess. Deleting those buckets by hand is boring, easy to mess up, and frankly, the last thing anyone wants to do on a Friday evening. So I spent a weekend putting together an event‑driven cleanup workflow that removes or retains buckets automatically for us

🤩 TL;DR (The “30‑Second” Version)

Condition	Action
Bucket not accessed in the last N days and tag `autoDelete=True`	Delete bucket (all versions, then the bucket itself)
Bucket not accessed in the last N days and tag `autoDelete=False`	Ignore IT and leave it as it is
Bucket not accessed in the last N days and no valid tag	Notify user ➜ URI to Keep (adds tag) or Delete (deletes bucket)

💡High-Level Solution

Imagine every bucket in your account carries a simple tag:

autoDelete=True or autoDelete=False

A scheduled job or script runs daily and checks each bucket’s last access date. If a bucket hasn’t seen any activity in the last 30 days and its tag is:

True → Automatically delete the bucket and all of its contents.
False → Leave the bucket alone.
Missing or invalid → Send the bucket owner an email with two options:
1. Keep it (tag it autoDelete=False)
2. Delete it (confirm deletion)

This approach makes sure that we only remove truly abandoned buckets and gives users one-click control over exceptions.

✅ Prerequisites

📍 An AWS account with administrative privileges
📍 Basic familiarity with Python and Boto3
📍 Understanding of AWS Lambda, EventBridge, SNS, API Gateway, and DynamoDB

📦 List of AWS services

🪣 Amazon S3
📨 AWS SNS
⛅ CloudTrail
🚀 DynamoDB
🖥️ AWS Lambda
🔄 AWS EventBridge
🌏 Amazon API Gateway

🎯 Architecture

Let's dive into how this all works together.

S3 Cost Optimization

A key question you might have is: How can we determine when a bucket was last accessed? There are several approaches, but I personally prefer using a combination of CloudTrail, Lambda, and DynamoDB

Here's the breakdown of the architecture:

⛅ CloudTrail:

AWS doesn’t expose LastAccessed for a bucket out of the box, but CloudTrail records every object‑level API call.
I configured a data‑event trail (yes, it costs a little extra) and pointed it at the logging bucket s3EventLoggingStorage
Each time an object is listed, uploaded, downloaded, or deleted, the event lands in EventBridge and triggers s3EventLogger, which writes:

{
    "BucketName" xyz-terraform: ,
    "EventDateTime": 2024-07-24T00:41:03Z ,
    "EventName": ListObjects ,
    "EventDate": 2024-07-24,
    "EventTime": 00:41:03Z,
    "Status": Active
}

🔄 EventBridge:

EventBridge has two important components in our solution:

Rule: Created a rule in the event bus with an event pattern that captures every S3 bucket event (except for our CloudTrail logging bucket s3EventLoggingStorage)
- This rule triggers a Lambda function named s3EventLogger when the pattern matches.
- The event pattern configuration looks like this:

{
  "source": ["aws.s3"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["s3.amazonaws.com"],
    "eventName": ["ListObjects", "ListObjectVersions", "PutObject", "GetObject", "HeadObject", "CopyObject", "GetObjectAcl", "PutObjectAcl", "CreateMultipartUpload", "ListParts", "UploadPart", "CompleteMultipartUpload", "AbortMultipartUpload", "UploadPartCopy", "RestoreObject", "DeleteObject", "DeleteObjects", "GetObjectTorrent", "SelectObjectContent", "PutObjectLockRetention", "PutObjectLockLegalHold", "GetObjectLockRetention", "GetObjectLockLegalHold"],
    "requestParameters": {
      "bucketName": [{
        "anything-but": ["s3EventLoggingStorage"]
      }]
    }
  }
}

Scheduler: Executes the s3Scanner Lambda function daily at 9:00 AM

🖥️ Lambda

We have a total of 3 buckets working together:
1) s3EventLogger:
- Primary purpose: Record when any bucket receives any kind of API call
- Collects event data such as BucketName, EventDate etc..
- Stores this data in a DynamoDB table called s3DateLogger
2) s3Scanner:
- Triggered every day at 9:00 AM UTC via an EventBridge schedule.
- Lists all buckets in the account.
- For each bucket:
  - Fetch the last access date from DynamoDB (if no record exists in DynamoDB, then create an entry and set the date to 15 days in the future as a grace period - useful if you have older buckets)
  - Calculate days since the last access
  - Retrieve the autoDelete tag (if any)
    - Decision logic:
      - autoDelete=True && days ≥ 30 → Delete all object versions, then the bucket
      - autoDelete=False && days ≥ 30 → Skip deletion
  - No valid tag && days ≥ 30 → Publish a notification to SNS
3) userHandler:
- It will be triggered by the API Gateway
- It either deletes the bucket or adds the autoDelete=False tag based on user choice

📧 SNS:

SNS sends an email to the user with two links:
- Keep It: Calls the API Gateway endpoint ?bucket_name={bucket_name}&action=keep
- Delete It: Calls the API Gateway endpoint ?{bucket_name}&action=delete

🌏 API Gateway:

Provides endpoints for users to respond to notifications, triggering the userHandler Lambda

🧑‍💻 Source Code

The GitHub repository contains:

Link: Source code for all three Lambda functions
Note: This was a weekend project, so the code has room for improvement. If you’re a beginner, try extending it to fit your own environment. 😊

👆 GitHub Repository

🙌 Conclusion

We explored how to automate S3 cleanup with minimal human intervention and potentially save a bunch of dollars in the process. Give it a try over the weekend or at your convenience. I hope you enjoyed this blog as much as I enjoyed creating it.

And that’s a wrap! 🙌🥂

If you liked my work, please message me on LinkedIn with "Hi + your country name"

🙋‍♂️ Ankit Jodhani

📨 Reach me at ankitjodhani1903@gmail.com

For anyone exploring a character design, it is worth considering how details related to photoshoot preparation will affect the complete look. With practical wear in mind, reviewing details related to photoshoot preparation also makes comfort and movement easier to judge. To compare practical details about details related to character costumes, cosplay wigs for indoor events offers a useful starting point for practical preparation. To reuse the pieces at future events, it is useful to check details related to shoes and accessories against the wearer's own needs.

Ultimate Guide to Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail

Ankit Jodhani — Sun, 07 Jul 2024 15:34:25 GMT

🙋‍♂️ Introduction

Hello everyone, I'm Ankit Jodhani, a DevOps engineer passionate about Cloud and Container technologies. This blog is part of the #10weeksofcloudops series initiated by Piyush Sachdeva.

📚 Synopsis

In this blog, we will implement monitoring and logging from scratch on an AWS EKS cluster using open-source tools like Prometheus, Grafana, Loki, and Promtail. We will also discuss writing custom metrics in a Node.js application, making them scrappable by Prometheus, setting up Alertmanager to receive email alerts, and designing custom dashboards in Grafana.

Once the monitoring part is over, we will move on to logging. We will set up Promtail and Loki to collect and aggregate logs, and finally visualize our logs using Grafana dashboards.

🔸 Story

Run Terraform scripts to create an EKS Cluster with necessary components.
Instrumentation: Understand prom-client to write custom metrics in a Node.js app and dockerize it.
Deploy the app on Kubernetes and make it accessible over the internet.
Install Prometheus Operator and Grafana on the EKS Cluster.
Configure Alerting rules, Service Monitors, and AlertManager for email alerts.
Create dashboards in Grafana to visualize the performance of the cluster and application.
Install Loki on the EKS Cluster and configure it with AWS S3 for log storage.
Install Promtail on the EKS Cluster and configure it to send logs to Loki.
Configure Grafana to display application logs.
Clean up all the infrastructure.

✅ Prerequisites

📌 AWS Account
📌 Basic knowledge of Terraform
📌 Basic knowledge of Docker
📌 Basic of Nodejs (Good to have)

🖥️ Local setup

💡

Ensure Terraform, Helm, and AWS CLI are installed on your computer. AWS CLI should be configured with admin privileges to avoid permission issues.

📦 List of AWS services

👑 Amazon EKS
🌐 Amazon VPC
🔒 Amazon IAM
💻 Amazon EC2
⚖️ Amazon Autoscaling
🪣Amazon S3
🚀 DynamoDB

☸️ List of Kubernetes Tools & Drivers

📂 EBS CSI Driver (EKS Addon)
📊 Helm charts
🧑‍🏭 Prometheus Operator (using kube-prometheus-stack helm chart)
🔔 Alertmanager (using kube-prometheus-stack helm chart)
💻 Grafana (using kube-prometheus-stack helm chart)
🗃️ Loki (using grafana helm chart)
🔎 Promtail (using grafana helm chart)

☸️ Monitoring

Monitoring involves tracking the performance of your application and resources, and sending alerts when something is running slowly or failing, to prevent issues from escalating.

📊 Prometheus

It is an open-source monitoring tool that tracks your workload and stores all your metrics in a time-series database.
We use PromQL to query the metrics
In this blog, we'll store data inside an AWS EBS volume.

📢 Alert manager

Alert Manager is a Prometheus component responsible for sending alerts to users.

📘 Logging

Logging helps you see what's happening inside your cluster, nodes, and how your application behaves in response to different requests and components, aiding in troubleshooting errors or bugs.

📜Promtail

Promtail is an open-source tool created by Grafana Labs. It collects all container logs and sends them to Loki.

🔗 Loki

Loki is also an open-source tool designed and developed by Grafana Labs. It consumes data sent by Promtail or other tools, processes, and filters it.
We use LogQL to query the logs from loki.
Loki can be integrated with many cloud services, in this blog we'll use the AWS S3 bucket to store the logs.

🖥️ Grafana

Grafana is a visualization tool commonly used for monitoring and logging.
Grafana can be integrated with prometheus, loki many other tool to create beautiful dashboard.
Grafana will query the prometheus & loki to get the metrics and logs.

🎯 Architecture

Let's understand the architecture of the project. Understanding the architecture makes it easier to proceed with the practical steps.

Monitoring & Logging on AWS EKS: Prometheus, Grafana, Loki, and Promtail

As you can see in the architecture, Prometheus scrapes metrics from the application and cluster and stores them in AWS EBS Volumes to keep it persistent in case of pod failure. just like that Grafana & Alermanger will also store its data inside EBS Volume.
Promtail will collect all the logs from the nodes (application logs + component logs) and send those logs to Loki.
Loki will aggregate & process the logs and send them to the AWS S3 bucket.
Grafana will query Prometheus and Loki for metrics and logs.

🚀 Step-by-Step Guide

💻 Clone the repository

Please clone the Github repository on your local computer.

git clone https://github.com/AnkitJodhani/eks-monitoring-and-logging.git

cd eks-monitoring-and-logging

Below you can see the directory structures and purpose of each directory

📂eks-private-container-registry
├──📁app-code
│   └── (Code of nodejs application)
├──📁app-k8s-manifest
│   └── (Contents of kubernetes manifest files for nodejs app)
├──📁eks-terraform
│   └── (Contains Terraform script to create AWS EKS cluster)
├──📁grafana-dashboard
│   └── (Contains json file for grafana dashboard )
├──📁kube-prometheus-stack
│   └── (Kubernetes manifest file for prometheus operator)
├──📁loki-promtail-stack
│   └── (Contents of Loki & Promtail)
├──😺.gitignore
├──📄readme.md
└──📄test.sh

🧑‍💻 Instrumentation

Instrumentation is the process of making code changes in the application to write custom metrics & expose metrics.
Instrumentation helps in Monitoring Performance + gaining insight of the application
I already created a demo nodejs app to demonstrate the Instrumentation. you will find the code inside app-code directory.
Please read index.js file. here I'll share a brief overview of the code
- Express Setup: Initializes an Express application and sets up logging with Morgan.
- Logging with Pino: Defines a custom logging function using Pino for structured logging.
- Prometheus Metrics with prom-client: Integrates Prometheus for monitoring HTTP requests using the prom-client library:
  - http_requests_total counter
  - http_request_duration_seconds histogram
  - http_request_duration_summary_seconds summary
  - node_gauge_example gauge for tracking async task duration

Basic Routes:
- / : Returns a "Running" status.
- /healthy: Returns the health status of the server.
- /serverError: Simulates a 500 Internal Server Error.
- /notFound: Simulates a 404 Not Found error.
- /logs: Generates logs using the custom logging function.
- /crash: Simulates a server crash by exiting the process.
- /example: Tracks async task duration with a gauge.
- /metrics: Exposes Prometheus metrics endpoint.

After adding the required metrics, Dockerize the application and push it to the container registry. In my case, I pushed it to the docker hub.

👑 EKS Cluster using Terraform

Now, let's go ahead and spin up the EKS Cluster.

cd eks-terraform/main

in this directory, you will find all the config files for Terraform like backend.tf terraform.tfvars etc.. and you can modify them based on your requirements but the default setting will work fine for this project.
Initialize the terraform

terraform init

validate the script

terraform validate

See the plan of what Terraform going to install for us

terraform plan

VPC, IAM Roles, EKS Cluster + Managed NodeGroup, EBS CSI driver (using AWS Addon) + IRSA (IAM role for service account)
Now, let's execute the terraform to create the AWS EKS Cluster

terraform apply --auto-approve

☣️

Terraform takes approximately 20-30 minutes. So Enjoy the automation 😀 ….

Once, the above command completes its execution successfully, you have will have eks cluster running. let's head over to the aws console to verify that.

AWS EKS Cluster using Terraform

Let's update the .kube/config file to connect with the cluster

aws list-clusters --region us-east-1

aws eks update-kubeconfig --name monitoring-alerting-logging-eks-cluster --region us-east-1

Now, we can review the K8s component

kubectl get all -n kube-system

🧑‍🚀 Deploy Nodejs app

Our EKS Cluster is running, and now we can deploy our node js application.
You will find the Kubernetes manifest file in app-k8s-manifest directory.
You might want to change the image name in app-k8s-manifest/deployment.yml file instead of going with ankitjodhani/prometheus:learning
The app-k8s-manifest/service.yml will create a LoadBalancer to expose the app on the internet. Apply the file:

kubectl apply -k app-k8s-manifest/

Head over to the AWS console and verify the load balancer (Classic Load Balancer).

Now, you can take the DNS name of the Load Balancer and visit the website.
It's good to generate a load using an automated script. In the root directory, you will find test.sh, which will generate the load by sending a lot of requests.
So, open another new terminal and execute the below command. just like shown in the below image.

./test.sh YOUR_LOAD_BALANCER_DNS_NAME

Note: Keep running the test.sh and don't kill the terminal for a while.

⚓ Install the Helm chart

Execute the below commands and install helm charts

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

⚒️ Install & Configure Prometheus Operator

Now, let's install the Prometheus operator in the AWS EKS Cluster using the helm chart.
You will find all the Prometheus-related manifest files inside kube-prometheus-stack directory.

kubectl create ns monitoring

helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring -f kube-prometheus-stack/custom_kube_prometheus_stack.yml

Install Prometheus Operator on AWS EKS Cluster

It's time to configure custom Alerts, an Alertmanager to receive emails, and a ServiceMonitor to scrape our application metrics.
Before configuring Alertmanager, we need credentials to send emails. For this blog, I'm using Gmail, but any SMTP provider like AWS SES can be used. so let's grab the credentials for that.
Open your Google account settings and search App password & create a new password.

Copy the newly created password. it should be like uhnlqkdhnirpqfpy

📛

Ensure there are NO spaces in the password.

Convert that password into base64 format.
Now, put your password in the kube-prometheus-stack/email-secret.yml and add your email ID to the kube-prometheus-stack/alertmanagerconfig.yml instead of mine.

You can also take a look at kube-prometheus-stack/alerts.yml file to see the Rules that I've set for the alerts.
- Send an alert when the average node CPU is higher than 50%.
- Send an alert when a POD restarts more than 2 times.
Next, we will configure the Service Monitor to scrape the metrics from our Node.js application.
Refer to kube-prometheus-stack/serviceMonitor.yml for the configuration.

configure service monitor with Prometheus operator

It's time to apply all these configurations. Execute below command

kubectl apply -k kube-prometheus-stack/

We need to wait for a couple of minutes for the Prometheus operator to reload its configuration.
Now, let's visit the Prometheus UI by running the following command and visiting http://localhost:9090.

kubectl port-forward -n monitoring service/prometheus-operated 9090:9090

Prometheus Operator

To check the applied rules, click on the Alerts button at the top.

Configure Alert rules in Prometheus Operator

Verify the target configuration by clicking the Targets button from the drop-down menu.

Configure service monitor in Prometheus Operator

Let's access the Alertmanager UI to see the alert configurations. Run the following command and visit http://localhost:9093.

kubectl port-forward -n monitoring service/alertmanager-operated 9093:9093

Configure Alertmanager in Prometheus Operator

Click on the Status button at the top to see the applied configurations.

Configure Alermanger in Prometheus Operator

Now, let's crash the Node.js app twice to receive alerts from Alertmanager.
The Nodejs app has a route /crash, which crashes the container, and Kubernetes automatically restarts it. However, if the app crashes more than 2 times, Alertmanager will send an alert to our email.
let's see that practically

http://YOUR_LOAD_BALANCER_DNS_NAME/crash

Keep hitting the above endpoint until Kubernetes restarts at least 3 times.

kubectl get pods

Prometheus Operator

Check the alert in the firing state by running:

kubectl port-forward -n monitoring service/prometheus-operated 9090:9090

Alert rules in Prometheus Operator

Verify Alertmanager received an alert from Prometheus:

kubectl port-forward -n monitoring service/alertmanager-operated 9093:9093

Alertmanger configure in Prometheus Operator

You should receive an email notification on your configured mail address.

Configure alertmanger in Prometheus Operator

We configured it to send emails every 5 minutes until the issue is resolved.

Configure Alertmanager in Prometheus Operator

Now, it's time to visualize our metrics on a beautiful dashboard. Thankfully, the kube-prometheus-stack Helm chart automatically installs Grafana, so we don't need to install it separately. Access the Grafana UI at http://localhost:8000:

kubectl port-forward -n monitoring service/monitoring-grafana 8000:80

Grafana in Prometheus operator

You will see many pre-built dashboards. You can utilize them for monitoring or design/import your own.
Import the dashboard I created for the Node.js app, available in the grafana-dashboard directory.
Click on the New button at the top right, select Import from the drop-down menu, and import the dashboard.

Grafana in Prometheus Operator

Once imported, you will see a screen similar to mine, as shown below, if you haven't stopped the test.sh (load generator script).

Grafana dashboard in Prometheus Operator

This is how we can monitor our application, other components, and clusters from Grafana.

⚒️ Install & configure Loki

We've set up monitoring, now let's configure Loki and Promtail for logging.
We already added the Grafana Helm repo in the previous step, which includes both Loki and Promtail.
We want Loki to store logs in an AWS S3 bucket, so it needs a bucket and relevant permissions to send logs to the AWS S3 bucket.
Head over to the AWS S3 console and create a bucket with a unique name.

ConfigureLoki to send logs to AWS S3 Bucket

Next, create an IAM policy in the AWS console. You can find the policy in loki-promtail-stack/aws-s3-policy.json, but remember to add your bucket's ARN.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1719324853777",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "ARN_OF_YOUR_BUCKET"
        },
        {
            "Sid": "Stmt1719324853778",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "ARN_OF_YOUR_BUCKET/*"
        }
    ]
}

Configure Loki to send logs to AWS S3 Buckets

Lets, ceate an IAM role, attach the policy, and create an access_key_id and secret_access_key.

Configure Loki to send logs to the AWS S3 bucket

Now we are ready to configure Loki.
Let's first see the values.yml file and write that into loki_distributed_values.yml

helm show values grafana/loki-distributed > loki-promtail-stack/loki_distributed_values.yml

loki_distributed_values.yml has all the default settings but we have to make some changes to configure the aws s3 bucket.
for reference, you can see the below screenshots for what values I've changed in the file.

Configure Loki to send logs to AWS S3 bucket

Configure Loki to send logs to the AWS S3 bucket

I also created an updated configuration file, loki-promtail-stack/custom_loki_distributed_values.yml, with all necessary changes.
Ensure you add your bucket name, region, access ID, and secret access ID.
Now, let's install Loki on the cluster using the helm chart, hit the below command to install it

helm install loki grafana/loki-distributed -n monitoring -f loki-promtail-stack/custom_loki_distributed_values.yml

Install Loki using helm chart on AWS EKS cluster

Yup! We've installed Loki successfully

⚒️ Install & configure Promtail

Now, let's set up the log collector, Promtail. We already have the Promtail Helm chart in the Grafana repo.
Since everything is installed in the monitoring namespace, we need to change one endpoint in Promtail's default configuration.
Hit the below command to see the default configuration(values.yml) file at loki-promtail-stack/promtail_values.yml

helm show values grafana/promtail > loki-promtail-stack/promtail_values.yml

We have to change clients.url attribute so Promtail knows where to send the logs. Refer to the image for reference.

Configure Promtail on AWS EKS Cluster

I also provided an updated configuration file, loki-promtail-stack/custom_promtail_values.yml.
Now, we are done with configuration. let's go ahead and install Promtail. please hit the below command

helm install promtail grafana/promtail -n monitoring -f  loki-promtail-stack/custom_promtail_values.yml

Install and configure Promtail on AWS EKS

Now, let's go ahead and see our logs in the Grafana dashboard. please hit the below command & access grafana at http://localhost:8000
Before adding a new dashboard, we need to add new data sources so Grafana can query logs from Loki.
So let's add a new data source. see below image for reference
Add a new data source with the URL http://loki-loki-distributed-gateway.monitoring.svc.cluster.local