The 5 AWS Misconfigurations Quietly Bleeding Your Budget

Each of these five misconfigurations has a cost symptom and a security implication. Most teams fix the bill and never ask the security question behind it.

Gourav Das
Gourav Das
CostObserver Team

The 5 AWS Misconfigurations Quietly Bleeding Your Budget

After enough AWS cost audits, the same misconfigurations keep appearing. Not because teams are careless. Because these are the ones that look fine in every dashboard until someone runs the numbers.

Each item below has two dimensions: what it costs you and what it signals from a security perspective. Most teams fix the cost. Almost none ask the security question behind it.

1. NAT Gateway Traffic That Should Be Using VPC Endpoints

The cost symptom: NAT Gateway charges appear as a combination of hourly fees ($0.045/hour per gateway, roughly $32/month just to exist) and data processing fees ($0.045/GB processed). An account with three NAT Gateways processing 18TB/month in data transfer is paying over $800/month in processing fees alone, before the hourly charges.

The most common driver of high NAT Gateway costs is S3 and DynamoDB traffic routing through NAT when it should be using Gateway VPC endpoints, which are free. Traffic to S3 and DynamoDB from within a VPC does not need to leave the AWS network. When it does, you pay for every gigabyte.

Source: Amazon VPC pricing

The security implication: High NAT Gateway egress is not just an architecture inefficiency. Unmonitored NAT traffic is a blind spot. A sustained increase in NAT Gateway data processing costs with no corresponding increase in legitimate application traffic is a signal worth investigating. Data exfiltration from a compromised EC2 instance or container will show up as NAT Gateway egress before it shows up as a security finding.

The fix: Check your VPC Flow Logs for traffic patterns. Move S3 and DynamoDB traffic to Gateway endpoints. For other services, evaluate whether Interface endpoints are worth the cost. For non-production environments, question whether multi-AZ NAT is necessary at all.

2. CloudWatch Log Groups With No Retention Policy

The cost symptom: CloudWatch charges $0.50 per GB ingested and $0.03 per GB stored per month. Log groups with no retention policy accumulate indefinitely. An account with 50 log groups, each growing at 5GB/month with no retention set, is paying for years of accumulated logs from services that may no longer exist.

Source: AWS CloudWatch pricing

The pattern from real audits: log groups from deprecated Lambda functions, ECS tasks that were decommissioned, and API Gateway stages that were replaced. The services are gone. The logs are still billing.

The security implication: No retention policy is also a compliance problem. Most security frameworks require log retention for a defined period, not indefinite retention. Logs older than your compliance window are not helping your audit posture. They are just costing money and creating a larger data surface that needs to be governed.

More importantly: a log group with no retention policy and no active service writing to it is a ghost resource. Ghost resources are unaudited resources. Who created it? What was writing to it? When did it stop? These are questions a security audit should answer. Most teams never ask them because the log group is not causing any visible problems.

The fix: Set a 30 to 90 day retention policy across all log groups. AWS does not do this by default. A one-time script using the AWS CLI can set retention across every log group in an account in minutes. AWS Trusted Advisor will flag log groups with no retention policy if you have a Business or Enterprise support plan.

3. Idle EC2 Instances Running Under 5% CPU

The cost symptom: An EC2 instance running at 3% average CPU utilization is paying for 97% of its compute capacity to sit idle. AWS Compute Optimizer flags these automatically. In most accounts, there are several. The monthly cost of idle EC2 in a mid-size account typically runs between $400 and $2,000/month.

The security implication: An EC2 instance running at 3% CPU is not necessarily idle. It could be doing something at low intensity that is not your workload. Cryptomining on a compromised instance often runs at deliberately low CPU to avoid detection. A compromised instance used for low-and-slow data exfiltration will show similar utilization patterns.

The cost signal (idle-looking instance) and the security signal (potentially compromised instance) look identical in a utilization dashboard. The difference is in the network traffic. An idle instance should have near-zero network egress. An instance doing something it should not will have network activity that does not match its CPU profile.

The fix: For genuinely idle instances, use scheduling policies to stop them outside business hours. For dev and staging, this alone typically saves 60% of compute costs. Before terminating any instance flagged as idle, check its VPC Flow Logs for the past 30 days. If there is network activity on an instance with near-zero CPU, that is worth investigating before you shut it down.

4. Stale EBS Snapshots and Unattached Volumes

The cost symptom: EBS gp3 volumes cost $0.08/GB-month. Snapshots cost $0.05/GB-month for the stored data. An account with 500 snapshots averaging 100GB each is paying $2,500/month for snapshot storage. Many of those snapshots are from instances that were terminated months ago.

Source: Amazon EBS pricing

Unattached EBS volumes are a separate line item. When an EC2 instance is terminated without the “delete on termination” flag set, the volume persists and continues billing. In accounts that have been running for more than a year, unattached volumes accumulate quietly.

The security implication: A snapshot from a terminated instance is not just a cost problem. It is a data governance problem. What data was on that instance? If the instance was part of a workload that handled sensitive data, the snapshot contains that data. Who has access to the snapshot? What is the snapshot’s sharing policy?

Unattached volumes from deleted test environments are the same problem. The test environment is gone. The data on the volume is not. The access permissions on the volume are whatever they were when the instance was running.

The fix: Run a snapshot audit. Delete snapshots older than your backup retention policy. For unattached volumes, check the creation date and the last-attached instance before deleting. Set the “delete on termination” flag on EBS volumes for non-production instances by default.

5. Over-Permissioned IAM Roles in Dev and Staging

The cost symptom: This one does not show up directly as a cost line item. It shows up as the blast radius when something goes wrong. A dev environment with an IAM role that has ec2:* across all regions is not expensive until a developer accidentally triggers a runaway process, a misconfigured script launches instances in the wrong region, or a compromised credential is used to spin up GPU instances for cryptomining.

The cost of over-permissioned IAM in dev and staging is not the normal monthly bill. It is the incident bill. And incident bills in AWS can be large.

The security implication: Dev and staging environments typically have broader IAM permissions than production because the friction of least-privilege is felt more acutely during development. That is understandable. It is also the reason that compromised dev credentials cause disproportionate damage.

A dev IAM role with ec2:RunInstances across all regions and no condition keys restricting the instance types or regions is a high-value target. The AWS IAM best practices documentation recommends using condition keys to restrict API calls by region, resource type, and resource tag. Most teams apply this to production. Few apply it to dev.

The fix: Scope dev and staging IAM roles to the regions and resource types they actually need. Add a condition key restricting ec2:RunInstances to specific instance families. Set a Service Control Policy at the AWS Organizations level that prevents instance launches in regions the account does not operate in. This is a one-time setup that eliminates an entire category of incident risk.

The Pattern Across All Five

Each of these misconfigurations has the same structure: a cost signal that is visible in the billing console and a security signal that is invisible unless you look for it.

The FinOps team fixes the cost. The security team never sees the signal. The misconfiguration that caused the cost problem is also the misconfiguration that creates the security risk, but the two teams are looking at different dashboards and asking different questions.

That is the SecFinOps gap. The cost audit and the security audit are the same audit. Most organizations are running them separately.

Start your free CostObserver beta — read-only access, no credit card, connects in minutes.