Skip to content

Conversation

@shivammonaka
Copy link
Contributor

@shivammonaka shivammonaka commented Jun 7, 2024

Problem Statement:
Currently, only one BLAS call can be executed at a time, even if that BLAS call requires only 8 out of the 64 available cores. This limitation results in poor core utilization.

Solution:
Remove the restriction on executing only a single BLAS call at a time by modifying the execution architecture and managing thread race conditions on shared variables.

How:
There are multiple approaches to achieve this. I've explored several methods, but encountered limitations with each. The current approach employs POSIX conditional waiting with mutex locking. Each BLAS call calculates the number of cores it requires before execution, reserves them, and then releases the lock.

Execution Model:

  1. Numerous BLAS calls arrive simultaneously at the entry point, such as level3_thread.c.
  2. After setting up their queue status and related information, they all call exec_blas().
  3. Only one BLAS call passes through the mutex (level3_lock), while others wait.
  4. The successful BLAS call attempts to allocate the cores it needs. If allocation fails, it sleeps and releases the mutex. If successful, it reserves the cores and releases the mutex.
  5. The BLAS call that acquired its cores executes the call, releases the cores after execution, and signals sleeping BLAS calls to check for core availability again.

The specifications are as follows:

A machine equipped with 64 cores.
The matrix is a square matrix with a single dimension of size 100. The rationale behind this choice is that, on a matrix of size 100, only 8 cores(As per OpenBLAS V0.3.23) are needed per BLAS (Basic Linear Algebra Subprograms) call. This selection serves as an ideal case to evaluate the impact of parallelism on a 64-core machine.

Observations:
Initially, at low numbers of BLAS calls, performance gains are marginal or even negative. This is because the locking mechanism, though efficient for larger cases, introduces overhead in smaller cases. However, significant improvements are observed with larger cases due to parallelism in BLAS calls.
image

Core Utilization for Original OpenBLAS vs Improved OpenBLAS:

Original OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine
image

Improved OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine
image

Future Improvement:
Asynchronous core allocation.
Initiating thread execution asynchronously without waiting for full core allocation.

Limitations:
Locking Overhead is taking on toll on performance when Number of BLAS Calls are low, necessitating input from @martin-frbg for resolution.

@shivammonaka
Copy link
Contributor Author

@martin-frbg any thoughts on this?

@martin-frbg martin-frbg added this to the 0.3.28 milestone Jun 20, 2024
@martin-frbg martin-frbg merged commit 7e9a4ba into OpenMathLib:develop Jun 20, 2024
@martin-frbg
Copy link
Collaborator

No better thought so far than trivially making this conditional on having more than say 8 or 16 cores in total so that we do not penalize small hardware unnecessarily. At least I do not see how we could know in advance how many level3 BLAS calls to expect. The alternative would be to make it a build-time option, but that would probably be less than ideal for distribution packagers or HPC support staff...

@martin-frbg
Copy link
Collaborator

Unfortunately this has now been shown to cause (or at least facilitate) wrong results when calling DDOT from multiple threads in parallel

@mattip
Copy link
Contributor

mattip commented Oct 15, 2025

Are there thoughts about backing this out?

@martin-frbg
Copy link
Collaborator

I'd hoped to achieve some understanding of the cause of this problem, and/or receive input from the author of this PR. What is particularly confusing to me is that the reported failure is with DOT, which I think should not execute any code in level3_thread.c (the single source file affected by this PR) at all

@shivammonaka
Copy link
Contributor Author

I’m also a bit confused about why this is happening. One thing that stands out is that we might need to use pthread_cond_broadcast() instead of pthread_cond_signal(), so that all waiting threads are awakened and can re-check the condition — each taking resources as they become available.

I suspect the issue you’re seeing is related to how the Pthreads backend manages shared resources inside blas_server.c. However, I haven’t yet identified any specific resource in that code that could directly cause this breakdown under heavy parallelism.

@mattip
Copy link
Contributor

mattip commented Oct 22, 2025

the reported failure is with DOT

Actually, NumPy np.dot calls NumPy's cblas_matrixproduct which uses gemv, gemm or syrk (for np.dot(x, x.T)). In the case of the reproducer in numpy/numpy#29391, it is calling gemm.

Redoing the bisect on linux, this PR is the place that causes the reproducer to fail.

I am curious how

a matrix of size 100, only 8 cores(As per OpenBLAS V0.3.23) are needed per BLAS (Basic Linear Algebra Subprograms) call.

Is the limit to 8 cores enforced by OpenBLAS somehow? NumPy get complaints that for any OpenBLAS call, all the CPUs are activated. Is there a mechanism we are not using to limit the calls to a smaller number of CPUs?

@shivammonaka
Copy link
Contributor Author

In its initialization phase, OpenBLAS invokes num_cpu_avail() to decide the optimal thread count for SGEMM, considering factors such as machine architecture, matrix dimensions, and internal thread-throttling configurations.”

@martin-frbg
Copy link
Collaborator

Thanks. I have just come to the same conclusion by instrumenting the function calls in openblas - the call to sdot is a red herring, all it does is compute a 2x2 dot product before sgemm is called with sizable inputs. So it totally makes sense that we end up in level3_thread.
Now the problem with this PR is that calling exec_blas without the level3_lock held creates an opportunity for races as idle threads get assigned new workloads and traverse the list of memory buffers to find a free slot. (ISTR this was pointed out at length in one of the earlier issue tickets on GEMV a few years ago - I need to dig that up and perhaps try to distill it into something for the sparse developer documentation that we have).
So unfortunately we're up against a design limitation from the original GotoBLAS (from a time when "multiple" cores rarely meant more than 2 to 4), and until that gets resolved this PR needs to be reverted.
(I have tried to place memory barriers "everywhere" instead, but that did not prove sufficient. Using thread-local storage does not help either, so there is no option to make this conditional on anything)

@martin-frbg
Copy link
Collaborator

@mattip since early in the 0.2.2x series most if not all of the interface functions have received at least a simple conditional to use either one or all threads depending on the problem size. For GEMM and a few others, this has been elaborated into either a slightly more sophisticated rule to make sure that each thread get a non-trivial workload, or (for certain Arm hosts) a lookup table of thread counts to use per matrix size

@martin-frbg
Copy link
Collaborator

#1851 (comment) was/is the salient point I think - the single queue struct used to keep track of everything, and the storing of buffer pointers sa and sb within it. By reusing it before the previous subtask has completed, already active buffers (can/will) get reassigned to new calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants