Having assisted with the management of a number of cloud HPC systems over the last few years, I have noticed several cost components that are often overlooked.
On the other hand, customers may have some difficulty keeping control of the more obvious costs.
In this article, with a focus on cloud HPC deployments, I hope to explore both of these topics and, in the process, help you optimise your cloud costs.
The Obvious Costs and How to Control Them
In most cases, the major cost components of a cloud HPC deployment are likely to be the performance storage volumes used by the HPC system to carry out calculations.
Performance Storage
For performance storage, it’s important to make sure that you only pay for what you require: many storage services may charge for unused storage space.
However, although the capacity of a storage volume may be adequate for the work being done, users may quickly fill unused capacity with files which could be moved elsewhere.
Where user quotas can be implemented, this should be fully taken advantage of. Where quotas are not available, users should be encouraged to regularly move any unnecessary files to a cheaper storage option.
Optimising storage costs may require the usage of multiple storage options and tiering to get the best value for money.
Although data egress may be an option, egress charges may apply, and it may be more cost-effective to keep your data within the same cloud platform: moving data between different storage services should, in most cases, be cost-free.
There is a caveat with private link services (eg Azure Private Link, AWS PrivateLink) which transport data between components on the cloud without any exposure to the public network.
These charge for ingress and egress between resources within the cloud, which can add up when a large amount of data is transported.
Virtual Machines
Several options are available to save costs on the virtual machines used to carry out calculations. First, consider the types of calculations you require. For example, will they require internode communication?
If not, it is usually cheaper to use a compute-optimised node type than one which comes with high-performance networking. Similar arguments may apply to memory usage.
It is also worth considering whether your calculations can tolerate preemption. If so, making use of spot pricing can considerably cut costs.
Similarly, reserving nodes in advance can also cut costs; however, in this case, it is important to understand the minimum amount of capacity required.
This is of course complicated by the projected increase in usage of your cloud HPC service; engaging with potential users and analysing previous HPC scheduler logs are useful in this regard.
The most advanced accelerator nodes are among the most expensive virtual machine types available. Due to their demand, the use of spot pricing is usually not viable.
In this case, it is often worth considering previous generations of accelerator nodes due to reduced prices and greater availability via spot.
Often Overlooked Costs
Virtual Desktops and Disks
Customers often build virtual desktops into their HPC deployment design to allow users to analyse data from calculations while keeping the data in the cloud.
Ideally, these virtual desktops should be switched off when not in use. Often, in order to not deteriorate user experience, users are given the ability to stop and start their desktops as required.
I emphasise here that by stopping and starting, I mean the ability to deallocate and reallocate virtual machines back to and from the cloud.
This in itself can lead to misunderstandings with users switching off a virtual machine via the OS rather than deallocating them.
The major problem here is users often forget to stop their virtual desktops when not in use. In addition, it can be the case that users who request a virtual desktop may in fact hardly make use of it but still leave it running. Even if the virtual machine is stopped, there will still be charges associated with the attached disks (at least the OS disk).
Other than enforcing an automated mandatory stopping of virtual machines, there are two additional methods I have found useful to reduce the extent of this problem.
Firstly, virtual desktops need to be updated regularly for OS patching reasons. This maintenance window can be used to enforce a mandatory stop of all virtual desktops and require that users start the virtual desktops themselves. Notice of this maintenance period should be given to users in advance. This could be anything from a weekly to a monthly occurrence.
Secondly, regularly auditing desktop usage can ensure that virtual desktops that are not required are removed. In addition, users who make occasional use of virtual desktops could be placed onto a shared machine.
DevTest environments
To ensure updates made to a production environment are not detrimental, it is important to test them in a development and testing environment. On the cloud, this may be a single HPC cluster or several. In addition to HPC clusters, other services may also be in the testing phase.
It is very easy to lose track of these deployments with the resulting impact on costs. Regular auditing of these deployments is a must. In addition, if certain components of these deployments can be switched off (such as virtual machines or database services), this provides another obvious way to save costs.
If the deployments are easy to set up and don’t need to be used overnight, they can even be deleted and redeployed the next day. This is aided by making use of automation pipelines for deployments with the bonus that this helps test and refine the pipelines every day.
Public IP addresses
Public IP address costs can be confusing for new users of the cloud. For some cloud providers (such as Microsoft Azure), the deployment of a public IP address resource is selected by default when deploying a virtual machine depending on the method used.
Public IP addresses come with an additional cost (in addition to the virtual machine and its associated disks). If a public IP is not required and this is overlooked, this leads to unnecessary costs.
Using Azure as an example, public IP addresses can be ‘static’ or ‘dynamic’ (different terms may be used for other cloud providers).
A static IP address ensures that no matter how many times a virtual machine is stopped (i.e. deallocated) and started, it will always have the same public IP address. Such IP addresses usually incur charges regardless of whether a virtual machine is stopped.
On the other hand, dynamic IP addresses can change every time a virtual machine is stopped and started. Such IP addresses are not charged while the virtual machine is stopped.
For development and testing environments, dynamic public IP addresses are an obvious way to cut costs (assuming relevant VMs (Virtual Machines) are stopped when not in use). In addition, regular auditing for the presence of unnecessary public IP addresses should be kept in mind.
Manveer Munde
Principal Consultant