Bags are being packed, calendars are filling up and people are busy working out where their meetings are on Google Maps. Exactly how far and how fast, they must hoof it between hotels and meeting room suites, is the looming thought!
As someone who isn’t attending in person this year, I’ve instead been thinking about what we’re seeing with customers and helping brief my colleagues on what to expect. What themes to look out for and most importantly the questions they should be asking when they hit the show floor and meeting rooms.
It feels like there is a spring back in the step of HPC once again. After a few years of less than satisfactory travel and opportunities to meet face to face due to Covid, we’ve still managed to clear the Exascale barrier (arguably starting with Fugaku but definitively with the delivery of Frontier). We are starting to get a handle on how to use disruptive AI and ML technologies in HPC and use accelerated systems in production as opposed to just in large scale government funded research centres (or for AI/ML large language training runs).
And it’s not just GPUs that are evolving at a remarkable rate. CPUs are making waves again. With core counts rising at a dizzying rate, we are seeing some efforts to halt the almost precipitous decline in bandwidth per core, that as we know, gates the performance of so many legacy and real-world applications.
Of course, with monster CPUs and GPUs come some hefty TDPs. So, while the question of how to cool and where to host our systems has always been a pressing issue, with the sorts of power densities that even COTS systems are going to be pushing, it starts to make the hosting choice, especially for Academia and SMEs, a challenge. Add in sustainability requirements, including renewable energy commitments, that many commercial and government sites need to adhere to and the efficiency as well as the carbon footprint of the system and their hosting means there is a need to consider alternatives.
In no small measure this has been responsible for some of the growing migration of HPC workloads to the Cloud, or at least someone else’s datacentre! As the reservoir of providers of HPC tin shrinks, the number of companies who want to host HPC systems appears to be growing. Of course, that is to some extent swapping one problem for another, and the growth in colocation is matched only by the growth in managed service offerings for HPC.
As a customer there is a rich ecosystem vying for your attention, and whatever your blend of cost sensitivity and value drivers, there are going to be options to suit your cashflow and risk appetite. Much has been made of various companies talking about cloud repatriation. And while there are some high profile moves, most of the chatter has focussed on what they are paying today rather than on what the value of being in the cloud is. What is certain is that it’s a fast-moving market and that having a handle on your workloads, the value they represent, as well as what they cost you to run, be it, on-prem or in the cloud is vital.
The pre-SC22 PR train has seen both Intel and AMD formally unveil their latest generation of datacentre class CPUs. AMD’s latest Genoa offerings seems to have the whip hand as they show off their full stack of SKUs for both HPC and enterprise. We’re a little light on detailed HPC benchmarks thus far, but I suspect that we’ll see much more detail on the show floor. With respectable IPC gains, a smart AVX512 implementation and improved power and bandwidth over Milan, it’s hard to see Intel’s Sapphire Rapid’s making too much of a dent over and above simple supply chain dynamics.
My interpretation is AMD seems to think it’s in pole and so the pricing of this generation of EPYC should reflect that. Why, well they will be relying on the improved OPEX characteristics and relative performance compared to Intel, and their own Milan family, rather than a generation-on-generation reduction in cost per core.
There’s been gossip about Intel’s delivery woes since the oft delayed Aurora was first mooted. It sounds like Intel has finally started to ship Sapphire Rapids HBM Xeons and Ponte-Vecchio accelerators (oops sorry – Max Series CPUs and GPUs) to Argonne, arriving rather later than planned. Watch this space for benchmarking output!
Whether there will be enough Max Series to go round on the open market is still an open question. Intel seemed almost keener to talk about successor products to the current Max CPU and GPUs, which, I believe, tells you a lot about their relative competitiveness. I don’t want to disparage the achievement that the new families represent, but they’re very late. So, with that in mind, it is hard to see how they compete on absolute performance with AMD and NVIDIA this time round. The levers that Intel have left are volume, pricing and of course the software ecosystem. It’s fair to say in general I really like what they are trying to do with OneAPI!.
I’m also interested to see some more details on ‘Intel On Demand’ (or Software Defined Silicon) and what sort of pricing will make customers manage this sort of complexity in a HPC environment. With Intel’s market share still declining and their recent financial statements making for painful reading, my questions are – Will AMD be able to capitalise? What will the impact on CPU ASPs be? and how much competition is there going to be in the procurement space now?
Another interesting angle is that with memory and platform costs pushing closer to 50% of a nodes total cost! The realisation that for many systems and workflows there’s a lot of stranded memory these days makes trading core density for more nodes and space makes even less sense. Thankfully AMD seem to have spotted this and their very decent performance with single rank DDR5 DIMMs is a boon.
There was lots of chatter about composable compute at ISC22 and AMD’s decision to delay Genoa to make sure that it intercepted the CXL disaggregated memory standard could be a huge win, especially in the Cloud hyperscale space.1 It will be interesting to see who breaks cover with this first and what the sort of performance tax there is (hint: according to this paper from Microsoft not as much as you may fear).
Then there’s NVIDIA’s latest entry into the HPC space with their Grace Hopper devices. While it’s being marked as an AI and HPC device, the importance of double precision floating point is clearly being de-emphasised in preference for more AI centric, reduced precision operations. Which makes some of the SC22 Twitter chat about mixed precision arithmetic in HPC resonate a little more (there’s a BoF: Mixed Feelings About Mixed Precisions which should be illuminating). It also makes what AMD does with its MI300 and successors interesting to watch – will it maintain the existing ratio of double precision to other lower precision formats
I’m curious to find out how AMD plan on beefing up their ML/AI capabilities going forward. See how their strategy evolves around integration of Xilinx IP and their use of advanced packaging/chiplets. Can you tell I’m usually to be found frequenting the NDA’d vendor briefings?
Speaking of AI/ML I’ll be interested to see if there’s a move to use AI/ML surrogates into simulation workflows and more use of AI/ML for parameterisation/computational steering as well as data preparation and analysis?
I’ll be watching events unfold via #SC22 and #HPC and a big shout out to all my HPC friends who are going to be in Dallas. Have a great show and stay safe!
Dairsie Latimer – Senior Principal Advisor