Non-Uniform Memory Access (NUMA) is reshaping microservice placement
Very detailed and accurate description. The author clearly knows way more than I do, but I would venture a few notes:
1. In the cloud, it can be difficult to know the NUMA characteristics of your VMs. AWS, Google, etc., do not publish it. I found the ‘lscpu’ command helpful.
2. Tools like https://github.com/SoilRos/cpu-latency plot the core-to-core latency on a 2d grid. There are many example visualisations on that page; maybe you can find the chip you are using.
3. If you get to pick VM sizes, pick ones the same size as a NUMA node on the underlying hardware. Eg prefer 64-core m8g.16xlarge over 96-core m8g.24xlarge which will span two nodes.
At OCI, our VM shapes are all single NUMA node by default. We only relatively recently added support for cross-NUMA instances, precisely because of the complications that NUMA introduces.
There are so many performance quirks, and so much software doesn't account for it yet (in part, I'd bet, because most development environments don't have multiple NUMA domains.)
Here's a fun example we found a few years ago, not sure if work has happened in the upstream kernel since: the Linux page cache wasn't fully NUMA aware, and spans NUMA nodes. Someone at work was specifically looking at NUMA performance, and chose to benchmark databases on different NUMA nodes, trying the client on the same NUMA node, and then cross NUMA node, using numactl to pin. After a bunch of tests it looked like with client and server in NUMA 0 it was appreciably faster than client and server in NUMA 1. After a reboot, and re running tests, it had flipped. NUMA 1 faster than NUMA 0. Eventually they worked out that the fast NUMA was whichever one was benchmarked first after a reboot, and from there figured out that when you ran fresh, the database client library ended up in the page cache in that NUMA domain. So if they benchmarked with server in 0, client in 1, and then benchmarked with server in 0, client in 0, that clients access to the client library ended up reaching across to the page cached version in 1, paying a nice latency penalty over and over. His solution was to run the client in a NUMA pinned docker container so that it was a unique file to the OS.
I've used https://instaguide.io/info.html?type=c5a.24xlarge#tab=lstopo
to browse the info. It is getting a bit old though.
> Eg prefer 64-core m8g.16xlarge over 96-core m8g.24xlarge which will span two nodes.
It's sad that we have to do this by ourselves
Solid writeup of NUMA, scheduling, and the need for pinning for folks who don’t spend a lot of time in the IT side of things (where we, unfortunately, have been wrangling with this for over a decade). The long and short of it is that if you’re building a HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance.
One thing the writeup didn’t seem to get into is the lack of scalability of this approach (manual pinning). As core counts and chiplets continue to explode, we still need better ways of scaling manual pinning or building more NUMA-aware OSes/applications that can auto-schedule with minimal penalties. Don’t get me wrong, it’s a lot better than ye olden days of dual core, multi-socket servers and stern warnings against fussing with NUMA schedulers from vendors if you wanted to preserve basic functionality, but it’s not a solved problem just yet.
> The long and short of it is that if you’re building a HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance.
Last time I was architect of a network chip, 21 years ago, our library did that for the user. For workloads that use threads that consume entire cores, it's a solved problem.
I'd guess that the workload you had in mind doesn't have that property.
This strikes me as something that Kubernetes could handle if it could support it. You can use affinity to ensure workloads stay together on the same machines, if K8s was NUMA aware, you could extend that affinity/anti-affinity mechanism down to the core/socket level.
EDIT: aaaand ... I commented before reading the article, which describes this very mechanism.
It'd be great to see Kubernetes make more extensive use of croups & especially nested croups, imo. The cpuset affinity should build into that layer nicely, imo. More broadly, Kubernetes' desire to schedule everything itself, to fit the workloads intelligent to insure successful running, feels like an anti-partern when the kernel has a much more aggressive way to let you trade off and define priorities and bound resources; it sucks having the ultra lo-fi kube take. I want the kernels "let it fail" version where nested cgroups get to fight it out according to their allocations.
Really enjoyed this amazing write up on how Kube does use cgroups. Seems like the QoS controls do give some top level cgroups, that pods then nest inside of. That's something. At least! https://martinheinz.dev/blog/91
If auto-NUMA doesn't handle your workload well and you don't want to manually pin anything, it's always possible to use single-socket servers and set NPS=1. This will make everything uniformly "slow" (which is not that slow).
Historically, the Sparc 6400 was derided for not being NUMA, but instead being Uniformly Slow.
This is one of those way down the road optimizations for folks in fairly rare scale situations in fairly rare tight loops.
Most of us are in the realm of the lowest hanging fruit being database queries that could be 100x faster and functions being called a million times a day that only need to be called twice.
100% with you there. I can count one time in my entire 15 years where I had to pin a production workload for performance, and it was Hyperion.
In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to.
For everyone else, it’s an excellent explainer why most guides and documentation will sternly warn you against fussing with the NUMA scheduler.
> In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to.
Cpu pinning can be super easy too. If you have an application that uses the whole machine, you probably already spawn one thread per cpu thread. Pinning those threads is usually pretty easy. Checking if it makes a difference might be harder... For most applications, it won't make a big difference, but some applications will see a big difference. Usually a positive difference, but it depends on the application. If nobody has tried cpu pinning your application lately, it's worth trying.
Of course, doing something efficiently is nice, but not doing it is often a lot faster... Not doing things that don't need to be done has huge potential speedups.
If you want to cpu pin network sockets, that's not as easy, but it can also make a big difference in some circumstances; mostly if you're a load balancer/proxy kind of thing where you don't spend much time processing packets, just receive and forward. In that case, avoiding cross cpu reads and writes can provide huge speedups, but it's not easy. That one, yeah, only do it if you have a good idea it will help, it's kind of invasive and it won't be noticable if you do a lot of work on requests.
Whilst you’re right in broad strokes, I would observe that “the garbage-collector” is one of those tight loops. Single-threaded JavaScript is perhaps one of the best defences against NUMA, but anyone running a process on multiple cores and multiple gigabytes should at least know about the problem.
Yeah, I was once in this situation with a perf-focused software defined networking project. Pinning to the wrong NUMA node slowed it down badly.
Probably another situation is if you're working on a DBMS itself.
There are some solutions that try to tackle this in HPC. For example https://github.com/LLNL/mpibind is deployed on El Capitan.
Would be interesting to see if something similar appears for cloud workloads.
There's a constant drum-beat of NUMA related work going by if you follow phoronix.com .
https://www.phoronix.com/news/Linux-6.17-NUMA-Locality-Rando... https://www.phoronix.com/news/Linux-6.13-Sched_Ext https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tierin... https://www.phoronix.com/news/Linux-6.14-FUSE
There's some big work I'm missing thats more recent too, again about allocating & scheduling IIRC. Still trying to find it. The third link is in DAMON, which is trying to do a lot to optimize; good thread to tug more on!
I have this pocket belief that eventually we might see post NUMA post coherency architectures, where even a single chip acts more like multiple independent clusters, that use something more like networking (CXL or UltraEthernet or something) to allow RDMA, but without coherency.
Even today, the title here is woefully under-describing the problem. A Epyc chip is actually multiple different compute die, each with their own NUMA zone and their own L3 and other caches. For now yes each socket's memory is all via a single IO die & semi uniform, but whether that holds is in question, and even today, the multiple NUMA zones on one socket already require careful tuning for efficient workload processing.
Emulating NUMA on a single chip is already a known performance tweak on certain architecture. There are options in place to enable it: https://www.kernel.org/doc/html/v5.8/x86/x86_64/fake-numa-fo...
Even the Raspberry Pi 5 benefits from NUMA emulation because it makes memory use patterns better match the memory controller’s parallelization capabilities.
IMO, matter of time before x86 or RISCV extension will show up to begin the inevitable unification of GPU and SIMD in an ISA. NUMA work and clustering over CCXs and sockets is paving the way for the software support in the OS. Question is what makes as much of Vulkan, OpenCL, and CUDA go away as possible?
The vector based simd of RISC-V is very neat. Very hard but also very neat. Rather than having fixed instructions for specific "take 4 fp32 and multiply by 3 fp32" then needing a new instruction for fp64 them a new one for fp32 x fp64 them a new one for 4 x 4, it generalizes the instructions to be more data shape agnostic: here's a cross product operation, you tell us what the vector lengths are going to be, let the hardware figure it out.
I also really enjoyed Semantic Streaming Registers paper, which makes load/store implicit in some ops, adds counters that can walk forward and back automatically so that you can loop immediately and start the next element, have the results dropped into the next result slot. This enables near DSP levels of instruction density, to be more ops focused rather than having to spend instructions writing and saving each step. https://www.research-collection.ethz.ch/bitstream/20.500.118...
I still have a bit of a hard time seeing how we bridge CPU and GPU. The whole "single program multiple executor" waves aspect of the GPU is spiritually just launching a bunch of tasks for a job, but I still struggle to see an eventual convergence point. The GPU remains a semi mystical device to me.
The variable length vectors are probably one of those ideas that sound good on paper but don’t work that well in practice. The issue is that you actually do need to know the vector register size in order to properly design and optimize your data structures.
Most advanced uses of e.g. AVX-512 are not just doing simple loop-unrolling style parallelism. They are doing non-trivial slicing and dicing of heterogeneous data structures in parallel. There are idioms that allow you to e.g. process unrelated predicates in parallel using vector instructions, effectively MIMD instead of SIMD. It enables use of vector instructions more pervasively than I think people expect but it also means you really need to know where the register boundaries are with respect to your data structures.
History has generally shown that when it comes to optimization, explicitness is king.
> The variable length vectors are probably one of those ideas that sound good on paper but don’t work that well in practice
I don't understand this take, you can still querry the vector length and have specialized implementations if needed.
But the vast majority of cases can be written in a VLA way, even most advanced ones imo.
E.g. here are a few things that I know to work well in a VLA style: simdutf (upstream), simdjson (I have a POC), sorting (I would still specialize, but you can have a fast generic fallback), jpeg decoding, heapify, ...
Even with SMP, back in the 2000's, the Windows NT/2000 scheduler wasn't that great, re-scheduling processes/threads across CPUs, already by making use of the processor affinity mask we managed a visible performance improvement.
NUMA systems now make this even more obvious, when scheduling is not done properly.
I've been struggling to find the settings to control NUMA on public cloud instances for a long time. Those are typically configured to present a single socket as a single UMA node, even on huge EPYCs. If someone has a tip on where to find those, I'd appreciate a link!
Very nice article. Gonna try this out on our lab cluster and see what improvements it gives.
If you really want to get into it, then could start worrying about your I/O. In the AMD example here:
* https://www.thomas-krenn.com/en/wiki/Display_Linux_CPU_topol...
you'll see some NUMA nodes with networking I/O attached to them, others with NVMe, and others with no I/O. So if you're really worried about network latency then you'd pin the process to that node, but if you want look at disk numbers (a database?) you'd be potentially looking at that node.
In recent years there's also chiplet-level locality that may need to be considered as well.
Examining this has been a thing in the HPC space for a decade or two now:
* https://www.open-mpi.org/projects/hwloc/lstopo/
* https://slurm.schedmd.com/mc_support.html
This "lstopo" tool is amazing and something I have been searching for a while.