Streamlining Your Cloud HPC Journey
Over the last few years, I have had the privilege of being involved in several Cloud High-Performance Computer (HPC) cluster deployments.
Despite the great variation in designs, which highlights the great flexibility on the Cloud, I have noticed some common roadblocks along the way.
These roadblocks can in the best case cause a small delay to timelines and in the worst case produce a solution which does not meet expectations.
Through this article, I will highlight the most significant roadblocks I have encountered during Cloud HPC deployments, which I hope will greatly aid in streamlining your journey to the cloud.
Engaging With Your Security Team
Since HPC clusters are very commonly used for research, a greater degree of freedom and access to resources which may be blacklisted or very tightly controlled is required.
This will often be in the form of open-source software developed for niche purposes. Software on GitHub or SourceForge comes to mind here.
It is often the case that HPC and enterprise IT are not fully integrated on-premises and when moving to the Cloud, the freedoms available to users of HPC become apparent to a security team.
The initial reaction of a security team which is not so familiar with HPC will probably be to apply the same principles they have to other parts of the IT estate. This is often too restrictive to make good use of the Cloud HPC service.
It is therefore important to bring your security team into the conversation as early as possible.
Areas which have often been problematic on projects include authentication solutions for logging into the HPC service, auditing admin access to the Cloud HPC service and working around internet firewalls.
Engaging With Your Cloud Management Team
When an organisation moves to the Cloud, it is prudent to potentially hire and form a team dedicated to the management of the cloud estate.
As external consultants hired to assist with a cloud HPC deployment, Red Oak Consulting often needs to engage with such a team.
On many projects I have worked on, such teams have excellent general knowledge of the cloud but are often unaware of the particular needs of Cloud HPC.
A common occurrence is the use of blanket cloud policies.
For example, a policy restricting virtual machine types which can be used to control costs is in general a good idea.
However, Cloud HPC often requires access to specific VM types which usually fall on the more costly side.
As was the case for the security team, it is important to bring your Cloud management team into the conversation early to remove any potential misunderstandings.
In addition to virtual machine types used, resource tagging and naming conventions have also been common areas of misunderstanding in past projects.
Agile vs Waterfall
On-premises HPC procurements are often a long process, requiring months of pre-planning followed by implementation and testing. ‘Waterfall’ project management is a common approach.
On the other hand, customers often start with a desire to make a Cloud ‘copy’ of an on-premises HPC cluster and ensure that users are provided with a comparable experience to on-premises, commonly known as a ‘lift and shift’ migration.
However, due to the way costs are structured on the Cloud, mirroring an on-premises deployment is frequently uneconomical.
Because of this, the result is often quite different to an on-premises deployment, even if users are unable to tell the difference.
As a result of this, an ‘agile’ project management approach which allows customers to familiarise themselves with and test the various options on the cloud is frequently the better option for Cloud HPC deployments.
This allows a final solution which makes good use of the cloud as well as ensuring everybody is happy with the final solution.
The use of Agile also meshes well with the fact that infrastructure on the cloud can and in most cases should be deployed via code. Tips for Cloud HPC Cluster
Overlooked Areas of Cloud HPC Design
In my experience, the design of a monitoring solution is by far the most overlooked area of a Cloud HPC deployment.
This should be differentiated from security monitoring, which security teams are often adamant to have in place to allow a solution to come to fruition.
Here, I am referring to basic areas of administration such as setting up alerts for disk capacity usage, cost monitoring and budgeting, and monitoring logs for your Cloud orchestration solution (think Azure CycleCloud or AWS ParallelCluster to name a few).
Often, the priority is to deploy a working solution and monitoring is often not a roadblock to this. However, soon after handing over a solution to the customer, a lack of monitoring in design can often lead to a negative initial experience for users and costs which alarm your finance department.
Another often overlooked is the disaster recovery aspect of your cloud HPC solution. This is particularly important if the to-be-deployed Cloud HPC solution will be critical to your business.
Unlike on-premises, disaster recovery on the Cloud is often as simple as running a handful of automation scripts if set up appropriately. Yet, this is often not an area of consideration for many customers we have worked with.
As with monitoring, disaster recovery is unlikely to act as a roadblock to initial solution deployment but can lead to unfortunate consequences after a system goes into production.
Manveer Munde
Principal Consultant
Red Oak Consulting