Skip to content

Commit 60f9b3a

Browse files
patched CPU performance
As described in issue #638, QuEST v4 contained a performance regression (from v3) only sometimes seen in CPU settings. This was due to the use of std::complex operator overloads in cpu_subroutines.cpp (whereas QuEST v3 hand-rolled complex arithmetic), and affected compilation with Clang (in both single-threaded and multithreaded settings) as well as in GCC (only in single-threaded settings) and potentially other compilers. We tentatively patch this issue by passing additional compiler optimisation flags to cpu_subroutines.cpp which circumvent the issue. This is a rather aggravating solution to a major pitfall in the C++ standard library. After deliberation, it beat out other solutions including hand-rolling complex arithmetic, use of a custom complex type, and use of more precise and compiler-specific flags.
1 parent be7edbb commit 60f9b3a

File tree

2 files changed

+102
-0
lines changed

2 files changed

+102
-0
lines changed

CMakeLists.txt

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -444,6 +444,71 @@ endif()
444444

445445

446446

447+
# ============================
448+
# Patch CPU performance
449+
# ============================
450+
451+
452+
# Patch performance of CPU std::complex arithmetic operator overloads.
453+
# The cpu_subroutines.cpp file makes extensive use of std::complex operator
454+
# overloads, and alas these are significantly slower than hand-rolled
455+
# arithmetic, due to their NaN and inf checks, and interference with SIMD.
456+
# It is crucial to pass additional optimisation flags to this file to restore
457+
# hand-rolled performance (else QuEST v3 is faster than v4 eep). In theory,
458+
# we can achieve this with specific, relatively 'safe' flags such as LLVM's:
459+
# -ffinite-math-only -fno-signed-zeros -ffp-contract=fast
460+
# However, it is a nuisance to find equivalent flags for different compilers
461+
# and monitor their performance vs accuracy trade-offs. So instead, we use the
462+
# much more aggressive and ubiquitous -Ofast flag to guarantee performance.
463+
# This introduces many potentially dangerous optimisations, such as asserting
464+
# associativity of flops, which would break techniques like Kahan summation.
465+
# The cpu_subroutines.cpp must ergo be very conscious of these optimisations.
466+
# We here also explicitly inform the file cpu_subroutines.cpp whether or not
467+
# we are passing the flags, so it can detect/error when flags are forgotten.
468+
469+
if (CMAKE_BUILD_TYPE STREQUAL "Release")
470+
471+
# Release build will pass -Ofast when known for the given compiler, and
472+
# fallback to giving a performance warning and proceeding with compilation
473+
474+
if (CMAKE_CXX_COMPILER_ID MATCHES "AppleClang|Clang|Cray|CrayClang|GNU|HP|Intel|IntelLLVM|NVHPC|NVIDIA|XL|XLClang")
475+
set(patch_flags "-Ofast")
476+
set(patch_macro "-DCOMPLEX_OVERLOADS_PATCHED=1")
477+
elseif (CMAKE_CXX_COMPILER_ID MATCHES "HP")
478+
set(patch_flags "+Ofast")
479+
set(patch_macro "-DCOMPLEX_OVERLOADS_PATCHED=1")
480+
elseif (CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
481+
set(patch_flags "/fp:fast")
482+
set(patch_macro "-DCOMPLEX_OVERLOADS_PATCHED=1")
483+
else()
484+
message(WARNING
485+
"The compiler (${CMAKE_CXX_COMPILER_ID}) is unrecognised and so crucial optimisation flags have not been "
486+
"passed to the CPU backend. These flags are necessary for full performance when performing complex algebra, "
487+
"otherwise a slowdown of 3-50x may be observed. Please edit the root CMakeLists.txt to include flags which are "
488+
"equivalent to GNU's -Ofast flag for your compiler (search this warning), or contact the QuEST developers for help."
489+
)
490+
set(patch_flags "")
491+
set(patch_macro "-DCOMPLEX_OVERLOADS_PATCHED=0")
492+
endif()
493+
494+
else()
495+
496+
# Non-release builds (e.g. Debug) will pass no optimisation flags, and will
497+
# communicate to cpu_subroutines.cpp that this is intentional via a macro
498+
499+
set(patch_flags "")
500+
set(patch_macro "-DCOMPLEX_OVERLOADS_PATCHED=0")
501+
502+
endif()
503+
504+
set_source_files_properties(
505+
quest/src/cpu/cpu_subroutines.cpp
506+
PROPERTIES
507+
COMPILE_FLAGS "${patch_flags} ${patch_macro}"
508+
)
509+
510+
511+
447512
# ============================
448513
# Pass files to library
449514
# ============================

quest/src/cpu/cpu_subroutines.cpp

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
* CPU OpenMP-accelerated definitions of the main backend simulation routines,
33
* as mirrored by gpu_subroutines.cpp, and called by accelerator.cpp.
44
*
5+
* BEWARE that this specific file receives additional compiler optimisation flags
6+
* in order to counteract a performance issue in the use of std::complex operator
7+
* overloads. These flags (like -Ofast) may induce assumed associativity of qcomp
8+
* algebra, breaking techniques like Kahan summation. As such, this file CANNOT
9+
* assume IEEE floating-point behaviour.
10+
*
511
* Some of these definitions are templated, defining multiple versions optimised
612
* (at compile-time) for handling different numbers of input qubits; such functions
713
* are proceeded by macro INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_CTRLS(), to force the
@@ -40,6 +46,28 @@
4046
using std::vector;
4147

4248

49+
/*
50+
* Beware that this file makes extensive use of std::complex (qcomp) operator
51+
* overloads and so requires additional compiler flags to achieve hand-rolled
52+
* arithmetic performance; otherwise a 3-50x slowdown may be observed. We here
53+
* enforce that these flags were not forgotton (but may be deliberatedly avoided).
54+
* Beware these flags may induce associativity and break e.g. Kakan summation.
55+
*/
56+
57+
#if !defined(COMPLEX_OVERLOADS_PATCHED)
58+
#error "Crucial, bespoke optimisation flags were not passed (or acknowledged) to cpu_subroutines.cpp which are necessary for full complex arithmetic performance."
59+
60+
#elif !COMPLEX_OVERLOADS_PATCHED
61+
62+
#if defined(_MSC_VER)
63+
#pragma message("Warning: The CPU backend is being deliberately compiled without the necessary flags to obtain full complex arithmetic performance.")
64+
#else
65+
#warning "The CPU backend is being deliberately compiled without the necessary flags to obtain full complex arithmetic performance."
66+
#endif
67+
68+
#endif
69+
70+
4371

4472
/*
4573
* GETTERS
@@ -568,6 +596,9 @@ void cpu_statevec_anyCtrlAnyTargDenseMatr_sub(Qureg qureg, vector<int> ctrls, ve
568596
/// qureg.cpuAmps[i] is being serially updated by only this thread,
569597
/// so is a candidate for Kahan summation for improved numerical
570598
/// stability. Explore whether this is time-free and worthwhile!
599+
///
600+
/// BEWARE that Kahan summation is incompatible with the optimisation
601+
/// flags currently passed to this file
571602
}
572603
}
573604
}
@@ -1758,6 +1789,9 @@ qreal cpu_statevec_calcTotalProb_sub(Qureg qureg) {
17581789
/// final serial combination). This invokes several times
17591790
/// as many arithmetic operations (4x?) but we are anyway
17601791
/// memory-bandwidth bound
1792+
///
1793+
/// BEWARE that Kahan summation is incompatible with the optimisation
1794+
/// flags currently passed to this file
17611795

17621796
qreal prob = 0;
17631797

@@ -1783,6 +1817,9 @@ qreal cpu_densmatr_calcTotalProb_sub(Qureg qureg) {
17831817
/// final serial combination). This invokes several times
17841818
/// as many arithmetic operations (4x?) but we are anyway
17851819
/// memory-bandwidth bound
1820+
///
1821+
/// BEWARE that Kahan summation is incompatible with the optimisation
1822+
/// flags currently passed to this file
17861823

17871824
qreal prob = 0;
17881825

0 commit comments

Comments
 (0)