Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741

shivammonaka · 2024-06-07T03:20:20Z

Problem Statement:
Currently, only one BLAS call can be executed at a time, even if that BLAS call requires only 8 out of the 64 available cores. This limitation results in poor core utilization.

Solution:
Remove the restriction on executing only a single BLAS call at a time by modifying the execution architecture and managing thread race conditions on shared variables.

How:
There are multiple approaches to achieve this. I've explored several methods, but encountered limitations with each. The current approach employs POSIX conditional waiting with mutex locking. Each BLAS call calculates the number of cores it requires before execution, reserves them, and then releases the lock.

Execution Model:

Numerous BLAS calls arrive simultaneously at the entry point, such as level3_thread.c.
After setting up their queue status and related information, they all call exec_blas().
Only one BLAS call passes through the mutex (level3_lock), while others wait.
The successful BLAS call attempts to allocate the cores it needs. If allocation fails, it sleeps and releases the mutex. If successful, it reserves the cores and releases the mutex.
The BLAS call that acquired its cores executes the call, releases the cores after execution, and signals sleeping BLAS calls to check for core availability again.

The specifications are as follows:

A machine equipped with 64 cores.
The matrix is a square matrix with a single dimension of size 100. The rationale behind this choice is that, on a matrix of size 100, only 8 cores(As per OpenBLAS V0.3.23) are needed per BLAS (Basic Linear Algebra Subprograms) call. This selection serves as an ideal case to evaluate the impact of parallelism on a 64-core machine.

Observations:
Initially, at low numbers of BLAS calls, performance gains are marginal or even negative. This is because the locking mechanism, though efficient for larger cases, introduces overhead in smaller cases. However, significant improvements are observed with larger cases due to parallelism in BLAS calls.

Core Utilization for Original OpenBLAS vs Improved OpenBLAS:

Original OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine

Improved OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine

Future Improvement:
Asynchronous core allocation.
Initiating thread execution asynchronously without waiting for full core allocation.

Limitations:
Locking Overhead is taking on toll on performance when Number of BLAS Calls are low, necessitating input from @martin-frbg for resolution.

… executed parallelly

shivammonaka · 2024-06-12T03:37:17Z

@martin-frbg any thoughts on this?

martin-frbg · 2024-06-20T11:54:48Z

No better thought so far than trivially making this conditional on having more than say 8 or 16 cores in total so that we do not penalize small hardware unnecessarily. At least I do not see how we could know in advance how many level3 BLAS calls to expect. The alternative would be to make it a build-time option, but that would probably be less than ideal for distribution packagers or HPC support staff...

martin-frbg · 2025-07-24T08:36:49Z

Unfortunately this has now been shown to cause (or at least facilitate) wrong results when calling DDOT from multiple threads in parallel

mattip · 2025-10-15T12:26:49Z

Are there thoughts about backing this out?

martin-frbg · 2025-10-15T13:27:15Z

I'd hoped to achieve some understanding of the cause of this problem, and/or receive input from the author of this PR. What is particularly confusing to me is that the reported failure is with DOT, which I think should not execute any code in level3_thread.c (the single source file affected by this PR) at all

shivammonaka · 2025-10-17T14:34:08Z

I’m also a bit confused about why this is happening. One thing that stands out is that we might need to use pthread_cond_broadcast() instead of pthread_cond_signal(), so that all waiting threads are awakened and can re-check the condition — each taking resources as they become available.

I suspect the issue you’re seeing is related to how the Pthreads backend manages shared resources inside blas_server.c. However, I haven’t yet identified any specific resource in that code that could directly cause this breakdown under heavy parallelism.

mattip · 2025-10-22T11:37:16Z

the reported failure is with DOT

Actually, NumPy np.dot calls NumPy's cblas_matrixproduct which uses gemv, gemm or syrk (for np.dot(x, x.T)). In the case of the reproducer in numpy/numpy#29391, it is calling gemm.

Redoing the bisect on linux, this PR is the place that causes the reproducer to fail.

I am curious how

a matrix of size 100, only 8 cores(As per OpenBLAS V0.3.23) are needed per BLAS (Basic Linear Algebra Subprograms) call.

Is the limit to 8 cores enforced by OpenBLAS somehow? NumPy get complaints that for any OpenBLAS call, all the CPUs are activated. Is there a mechanism we are not using to limit the calls to a smaller number of CPUs?

shivammonaka · 2025-10-22T12:13:16Z

In its initialization phase, OpenBLAS invokes num_cpu_avail() to decide the optimal thread count for SGEMM, considering factors such as machine architecture, matrix dimensions, and internal thread-throttling configurations.”

martin-frbg · 2025-10-22T12:16:04Z

Thanks. I have just come to the same conclusion by instrumenting the function calls in openblas - the call to sdot is a red herring, all it does is compute a 2x2 dot product before sgemm is called with sizable inputs. So it totally makes sense that we end up in level3_thread.
Now the problem with this PR is that calling exec_blas without the level3_lock held creates an opportunity for races as idle threads get assigned new workloads and traverse the list of memory buffers to find a free slot. (ISTR this was pointed out at length in one of the earlier issue tickets on GEMV a few years ago - I need to dig that up and perhaps try to distill it into something for the sparse developer documentation that we have).
So unfortunately we're up against a design limitation from the original GotoBLAS (from a time when "multiple" cores rarely meant more than 2 to 4), and until that gets resolved this PR needs to be reverted.
(I have tried to place memory barriers "everywhere" instead, but that did not prove sufficient. Using thread-local storage does not help either, so there is no option to make this conditional on anything)

martin-frbg · 2025-10-22T12:28:38Z

@mattip since early in the 0.2.2x series most if not all of the interface functions have received at least a simple conditional to use either one or all threads depending on the problem size. For GEMM and a few others, this has been elaborated into either a slightly more sophisticated rule to make sure that each thread get a non-trivial workload, or (for certain Arm hosts) a lookup table of thread counts to use per matrix size

martin-frbg · 2025-10-22T12:56:53Z

#1851 (comment) was/is the salient point I think - the single queue struct used to keep track of everything, and the storing of buffer pointers sa and sb within it. By reusing it before the previous subtask has completed, already active buffers (can/will) get reassigned to new calculations.

Dynamic locking in Pthread Backend to allow multiple BLAS calls to be…

9e22d70

… executed parallelly

martin-frbg added this to the 0.3.28 milestone Jun 20, 2024

martin-frbg merged commit 7e9a4ba into OpenMathLib:develop Jun 20, 2024

lesteve mentioned this pull request Nov 19, 2024

Segmentation fault in 0.3.28 pthreads variant when running the scikit-learn tests #4981

Closed

martin-frbg mentioned this pull request Dec 31, 2024

Ideation on making Pthread more scalable #4645

Closed

mattip mentioned this pull request Jul 23, 2025

BUG: np.dot produces incorrect results when run concurrently numpy/numpy#29391

Closed

mattip mentioned this pull request Oct 22, 2025

patch to remove OpenBLAS PR 4741, cleanup unused bash code MacPython/openblas-libs#218

Merged

1 task

martin-frbg mentioned this pull request Oct 22, 2025

Revert "Enhancing Core Utilization in BLAS Calls: A Scalable Architecture" #5516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741

Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741

shivammonaka commented Jun 7, 2024 •

edited

Loading

Uh oh!

shivammonaka commented Jun 12, 2024

Uh oh!

martin-frbg commented Jun 20, 2024

Uh oh!

martin-frbg commented Jul 24, 2025

Uh oh!

mattip commented Oct 15, 2025

Uh oh!

martin-frbg commented Oct 15, 2025

Uh oh!

shivammonaka commented Oct 17, 2025

Uh oh!

mattip commented Oct 22, 2025

Uh oh!

shivammonaka commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741

Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741

Conversation

shivammonaka commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivammonaka commented Jun 12, 2024

Uh oh!

martin-frbg commented Jun 20, 2024

Uh oh!

martin-frbg commented Jul 24, 2025

Uh oh!

mattip commented Oct 15, 2025

Uh oh!

martin-frbg commented Oct 15, 2025

Uh oh!

shivammonaka commented Oct 17, 2025

Uh oh!

mattip commented Oct 22, 2025

Uh oh!

shivammonaka commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

martin-frbg commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shivammonaka commented Jun 7, 2024 •

edited

Loading