-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Core Utilization in BLAS Calls: A Scalable Architecture #4741
Conversation
… executed parallelly
|
@martin-frbg any thoughts on this? |
|
No better thought so far than trivially making this conditional on having more than say 8 or 16 cores in total so that we do not penalize small hardware unnecessarily. At least I do not see how we could know in advance how many level3 BLAS calls to expect. The alternative would be to make it a build-time option, but that would probably be less than ideal for distribution packagers or HPC support staff... |
|
Unfortunately this has now been shown to cause (or at least facilitate) wrong results when calling DDOT from multiple threads in parallel |
|
Are there thoughts about backing this out? |
|
I'd hoped to achieve some understanding of the cause of this problem, and/or receive input from the author of this PR. What is particularly confusing to me is that the reported failure is with DOT, which I think should not execute any code in level3_thread.c (the single source file affected by this PR) at all |
|
I’m also a bit confused about why this is happening. One thing that stands out is that we might need to use I suspect the issue you’re seeing is related to how the Pthreads backend manages shared resources inside |
Actually, NumPy Redoing the bisect on linux, this PR is the place that causes the reproducer to fail. I am curious how
Is the limit to 8 cores enforced by OpenBLAS somehow? NumPy get complaints that for any OpenBLAS call, all the CPUs are activated. Is there a mechanism we are not using to limit the calls to a smaller number of CPUs? |
|
In its initialization phase, OpenBLAS invokes |
|
Thanks. I have just come to the same conclusion by instrumenting the function calls in openblas - the call to sdot is a red herring, all it does is compute a 2x2 dot product before sgemm is called with sizable inputs. So it totally makes sense that we end up in level3_thread. |
|
@mattip since early in the 0.2.2x series most if not all of the interface functions have received at least a simple conditional to use either one or all threads depending on the problem size. For GEMM and a few others, this has been elaborated into either a slightly more sophisticated rule to make sure that each thread get a non-trivial workload, or (for certain Arm hosts) a lookup table of thread counts to use per matrix size |
|
#1851 (comment) was/is the salient point I think - the single |
Problem Statement:
Currently, only one BLAS call can be executed at a time, even if that BLAS call requires only 8 out of the 64 available cores. This limitation results in poor core utilization.
Solution:
Remove the restriction on executing only a single BLAS call at a time by modifying the execution architecture and managing thread race conditions on shared variables.
How:
There are multiple approaches to achieve this. I've explored several methods, but encountered limitations with each. The current approach employs POSIX conditional waiting with mutex locking. Each BLAS call calculates the number of cores it requires before execution, reserves them, and then releases the lock.
Execution Model:
The specifications are as follows:
A machine equipped with 64 cores.
The matrix is a square matrix with a single dimension of size 100. The rationale behind this choice is that, on a matrix of size 100, only 8 cores(As per OpenBLAS V0.3.23) are needed per BLAS (Basic Linear Algebra Subprograms) call. This selection serves as an ideal case to evaluate the impact of parallelism on a 64-core machine.
Observations:

Initially, at low numbers of BLAS calls, performance gains are marginal or even negative. This is because the locking mechanism, though efficient for larger cases, introduces overhead in smaller cases. However, significant improvements are observed with larger cases due to parallelism in BLAS calls.
Core Utilization for Original OpenBLAS vs Improved OpenBLAS:
Original OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine

Improved OpenBLAS executing a BLAS call which require 8 cores on a 64 core machine

Future Improvement:
Asynchronous core allocation.
Initiating thread execution asynchronously without waiting for full core allocation.
Limitations:
Locking Overhead is taking on toll on performance when Number of BLAS Calls are low, necessitating input from @martin-frbg for resolution.