We watch the AWS Cost Explorer https://console.aws.amazon.com/cost-management/home?#/dashboard to manage our on-demand, reserved and spot costs.
Favour Autoscaling to follow the demand curve very closely
The goal of efficient usage of resources is to pool ram/hd/ram/network resources into a single cluster. A set of services can then follow very closely the demand curve by using a core set of reserved capacity with the rest on auto scaled or spot capacity. Using spot has its disadvantages - the 2 min warning and cost fluctuations that must be pre-planned for when demand spikes the spot price.
All autoscaling is not equal
Autoscaling is not instant. The underlying infrastructure of NLB autoscaling for example - is itself EC2 instances - that take time to replicate and start. Even Lambda needs to be pre-warmed. Therefore the disadvantage to placing some of a K8S cluster worker nodes under an auto scaler for example will be that that capacity will not be instantly available like it would if we overprovisioned for excess capacity ahead of time.
Stateless Resilient Microservices reduce unused capacity
One of the factors that enabled Kubernetes to take over distributed computing was the fact that workloads were siloed and not cluster aware (each VM was managed in isolation). Using for example docker compose to manage a set of containers per VM did not solve the problem of oversaturated or under provisioned VMs. One VM would be at 90% utilization while another could be at 10.
The traditional cloud lift and shift where we go directly to IaaS (ECS and RDS) and not fully utilize managed services or auto scaling is not cost effective. Using EC2 directly is the same as using docker compose pre 2016 before managed clusters through Kubernetes (OpenShift, ECS, EKS came around). There is a reason applications do not go directly into fully auto scaled mode - microservices that can survive frequent crashes/stops/restarts are hard to design (circuit breakers and avoidance of local persistence and state lag minimization must be implemented as a start).
FinOps Best Practices
Planned Infrastructure is cost effective - use reserved or spot
AWS will lower the price in a couple ways - one of which is traditional reserved instances, another is bidding on excess capacity in the spot market (the reason why AWS was envisioned in the first place). When amazon started they purchased predicted load equipment in advance. Soon however the difference between current and projected load mean that that reserved capacity was idle until it was needed in each 3 month cycle. AWS was created to sell this temporary excess capacity - this is primarily the current spot market now.
Blue/Green or Canary deployments need double the resources temporarily. The first time you try to redeploy an application where the utilization is over 50% already will run into issues when you temporarily use over 100% during the transition. Therefore set maximums below 50%.
There is a granularity sweet spot for all resources. For example a 16G VM will have up to 3G OS overhead - if your K8S cluster uses 8G VMs then over 1/3 of the RAM will be wasted on the base OS - switching to 16 it drops to 20%, 32 it drops to 10%. However using larger VMs has other issues like rogue pods taking over an entire 32G VM (see Performance#FullKubernetesClusterCPUSaturation). In a cluster of 4 x 32 that would be 25% saturation, in a cluster of 8 x 16 saturation would top out at 13% which is better.
NoOps is possible with Infrastructure as Code
The current Kubernetes + Operators framework addresses the intent state machine (provisioner and scheduler - via Kubelet and etcd) and ongoing maintenance (restarts/upgrades - via Operators). With the properly designed microservice architecture (CI/CD (continuous delivery and continuous deployment), stateless resiliency) there should be minimal need for hands on devops beyond coding up the system and deployment.
Infrastructure is throwaway
The more we treat deployments as stateless and throwaway (some persistence containers still require stateful sets though) - the more the system will be able to utilize the lowest cost etherial infrastructure (spot, lambda).
A workload at the container, service and infrastructure level that does not deviate from the original automated infrastructure as code deployment - will be able to restart with minimal impact on the system. This is why one of the first implementations of Kubernetes outside of kubeadm - Rancher was named around the concept of "cattle" - as in we don't treat our infrastructure as "pets" and hand adjust each instance.
We need a couple simple derived formulas for several architectural scenarios to be able to rapidly plan the FinOps profile before going into more detail. Some base costs around compute and persistence are required.
We also need to derive out the base case costs (overhead adjustment).
|Type||Granularity||Service||Example||Utilization per service||Formula|
|compute||1 vCPU||IaaS EC2||t3a.micro||100%|
|PaaS K8S||3 x t3a.large||1/12|
|persistence||1 GB||IaaS RDS||100%|
|throughput||1 Gbps||Network In|
|Free tier usage||Most services have tier - once used gone - so the first service in gets the benefit|
|Volume pricing||If several services saturate for example S3 - subsequent services will benefit with lower pricing (resource pooling)|
|Auto scaled reserved||If service A kicks in k8s autoscaling of the worker nodes - all other services benefit by default due the capacity increase.|
The reverse is true - if service A terminates - then service B (rogue) had full use of most of the vCores on a scaled node - now needs to share in a more overall saturated smaller cluster
|Partitioned use||Move read-only traffic - like monitoring/reporting to a read replica that is optimized for read not read/write|
AWS Cost Calculator
It would be ideal if we could plan and track costs as pseudo Costs as Code (tied to Cloud Formation/terraform scripts). There is a way to export estimates in the https://calculator.aws/#/ using https://docs.aws.amazon.com/pricing-calculator/latest/userguide/export-estimate.html
There are issues with the cost calculator - it does not import estimate templates or break out details costs after the initial construction.
For example a IaaS T3a.medium 2vCPU/4Gb 100GB EBS DT outbound 10GB no peak scaling reserved 3y no upfront, snapshot weekly is US $31/month
Export and Share
Export to CSV
|The spreadsheet does not break out the costs though|
And share to public URL