Page tree
Skip to end of metadata
Go to start of metadata


We watch the AWS Cost Explorer to manage our on-demand, reserved and spot costs.

FinOps Principals

Favour Autoscaling to follow the demand curve very closely

The goal of efficient usage of resources is to pool ram/hd/ram/network resources into a single cluster.  A set of services can then follow very closely the demand curve by using a core set of reserved capacity with the rest on auto scaled or spot capacity.  Using spot has its disadvantages - the 2 min warning and cost fluctuations that must be pre-planned for when demand spikes the spot price.

All autoscaling is not equal

Autoscaling is not instant.  The underlying infrastructure of NLB autoscaling for example - is itself EC2 instances - that take time to replicate and start.  Even Lambda needs to be pre-warmed.  Therefore the disadvantage to placing some of a K8S cluster worker nodes under an auto scaler for example will be that that capacity will not be instantly available like it would if we overprovisioned for excess capacity ahead of time.

Stateless Resilient Microservices reduce unused capacity

One of the factors that enabled Kubernetes to take over distributed computing was the fact that workloads were siloed and not cluster aware (each VM was managed in isolation).  Using for example docker compose to manage a set of containers per VM did not solve the problem of oversaturated or under provisioned VMs.  One VM would be at 90% utilization while another could be at 10.

The traditional cloud lift and shift where we go directly to IaaS (ECS and RDS) and not fully utilize managed services or auto scaling is not cost effective.  Using EC2 directly is the same as using docker compose pre 2016 before managed clusters through Kubernetes (OpenShift, ECS, EKS came around).  There is a reason applications do not go directly into fully auto scaled mode - microservices that can survive frequent crashes/stops/restarts are hard to design (circuit breakers and avoidance of local persistence and state lag minimization must be implemented as a start).

FinOps Best Practices

Planned Infrastructure is cost effective - use reserved or spot

AWS will lower the price in a couple ways - one of which is traditional reserved instances, another is bidding on excess capacity in the spot market (the reason why AWS was envisioned in the first place). When amazon started they purchased predicted load equipment in advance.  Soon however the difference between current and projected load mean that that reserved capacity was idle until it was needed in each 3 month cycle.  AWS was created to sell this temporary excess capacity - this is primarily the current spot market now.

Minimum Utilization

Blue/Green or Canary deployments need double the resources temporarily.  The first time you try to redeploy an application where the utilization is over 50% already will run into issues when you temporarily use over 100% during the transition.   Therefore set maximums below 50%.

Resource Tradeoffs

There is a granularity sweet spot for all resources.  For example a 16G VM will have up to 3G OS overhead - if your K8S cluster uses 8G VMs then over 1/3 of the RAM will be wasted on the base OS - switching to 16 it drops to 20%, 32 it drops to 10%.  However using larger VMs has other issues like rogue pods taking over an entire 32G VM (see Performance#FullKubernetesClusterCPUSaturation).  In a cluster of 4 x 32 that would be 25% saturation, in a cluster of 8 x 16 saturation would top out at 13% which is better. 

NoOps is possible with Infrastructure as Code

The current Kubernetes + Operators framework addresses the intent state machine (provisioner and scheduler - via Kubelet and etcd) and ongoing maintenance (restarts/upgrades - via Operators).   With the properly designed microservice architecture (CI/CD (continuous delivery and continuous deployment), stateless resiliency) there should be minimal need for hands on devops beyond coding up the system and deployment.

Infrastructure is throwaway

The more we treat deployments as stateless and throwaway (some persistence containers still require stateful sets though) - the more the system will be able to utilize the lowest cost etherial infrastructure (spot, lambda).

A workload at the container, service and infrastructure level that does not deviate from the original automated infrastructure as code deployment - will be able to restart with minimal impact on the system.  This is why one of the first implementations of Kubernetes outside of kubeadm - Rancher was named around the concept of "cattle" - as in we don't treat our infrastructure as "pets" and hand adjust each instance.

Costing Formulas

We need a couple simple derived formulas for several architectural scenarios to be able to rapidly plan the FinOps profile before going into more detail.  Some base costs around compute and persistence are required.

We also need to derive out the base case costs (overhead adjustment).

TypeGranularityServiceExampleUtilization per serviceTypeFormulaFree TierCost
Shared overhead
compute1 vCPUIaaS EC2t3a.micro100%On Demand

3y no front


PaaS K8S3 x t3a.large1/12

CaaS Fargate

FaaS Lambda
1M 128Kb 100ms
= .0125 GB-s
400k GB-s0.20 req
0.21 exec

persistence1 GBIaaS RDS

DBaaS Aurora

DBaaS DynamoDB

storage1 GBS3

throughput1 GbpsNetwork In

AI AWS Textract

Text and Image Processing#TextractAPIExamples

0.07 / tx

Costing Options

Cost Explorer

Cost Estimator

Savings Plans

Free tier usageMost services have tier - once used gone - so the first service in gets the benefit
Volume pricingIf several services saturate for example S3 - subsequent services will benefit with lower pricing (resource pooling)
Auto scaled reservedIf service A kicks in k8s autoscaling of the worker nodes - all other services benefit by default due the capacity increase.
The reverse is true - if service A terminates - then service B (rogue) had full use of most of the vCores on a scaled node - now needs to share in a more overall saturated smaller cluster

Partitioned useMove read-only traffic - like monitoring/reporting to a read replica that is optimized for read not read/write

AWS Cost Calculator

It would be ideal if we could plan and track costs as pseudo Costs as Code (tied to Cloud Formation/terraform scripts).  There is a way to export estimates in the using 

see also

There are issues with the cost calculator - it does not import estimate templates or break out details costs after the initial construction.

For example a IaaS T3a.medium 2vCPU/4Gb 100GB EBS DT outbound 10GB no peak scaling reserved 3y no upfront, snapshot weekly is US $31/month

Export and Share

Export to CSV

And share to public URL

Commercial Licenses


  • No labels