-
Notifications
You must be signed in to change notification settings - Fork 107
Add AMX support for brute force search #1210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Welcome @xtangxtang! It looks like this is your first PR to zilliztech/knowhere 🎉 |
@xtangxtang 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup required
src/simd/distances_amx.cc
Outdated
#if defined(USE_AMX) | ||
float amx_inner_product_matrix_bf16( char **floatLibraryMatrix, char *floatQueryMatrix, uint64_t dims,uint64_t batchSizeA, | ||
uint64_t batchSizeB, float *results_ptr){ | ||
int DIM=32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments are needed for the magic numbers
also, please use constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we will use constexpr for magic numbers
src/simd/distances_amx.cc
Outdated
float results[16*16] __attribute__((aligned(64)))={0}; | ||
|
||
if(!init_mem){ | ||
cfg[0]=1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use defined structs, for example
struct TileConfig {
// must be 1
uint8_t paletteId;
// must be 0
uint8_t startRow;
uint8_t reserved[14];
// measured in bytes
uint16_t colsb[16];
// measured in rows
uint8_t rows[16];
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, has replace it
src/simd/distances_amx.cc
Outdated
cfg[48+2] = 16; | ||
init_mem = true; | ||
|
||
_tile_loadconfig((void *)cfg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, thread_local
approach here leads to a missing _tile_release()
instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we understand, the release will be called when the thread stop, maybe the process quit. Please provide a position for us to release the tile config, if you have an idea
src/simd/distances_amx.cc
Outdated
} | ||
} | ||
_tile_stored(2, results, batchSizeB*2*2); | ||
_tile_zero(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move _tile_zero(2);
one line above for(int i=0;i<blockCount;i++){
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/simd/distances_amx_intr.h
Outdated
return 1; | ||
} | ||
|
||
static void amx_bf16_mul(void *cfg, void *ma, void *mb, int64_t a_stride, int64_t b_stride, void *mc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
src/simd/distances_ref.cc
Outdated
float& dis8, float& dis9, float& dis10, float& dis11, | ||
float& dis12, float& dis13, float& dis14, float& dis15 | ||
) { | ||
enable_amx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function needs to be invoked only once per process, please alter src/simd/hook.cc
appropriately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/simd/distances_amx_intr.h
Outdated
unsigned long bitmask = 0; | ||
long status = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask); | ||
if (0 != status) { | ||
std::cout << "SYS_arch_prctl(READ) error" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please alert about the problems via the error code rather than via std::cout
if (status != 0) {
return ...;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
std::lock_guard<std::mutex> lock(patch_bf16_mutex); | ||
#if defined(__x86_64__) | ||
if (use_avx512 && cpu_support_avx512()) { | ||
if (use_amx && cpu_support_amx()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (use_amx && cpu_support_amx()) {
if (enable_amx() != SUCCESS) { fallback to avx512 and alert users } else { use amx }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the invocation of AMX and AVX512 should be the same here, except that AMX requires an additional enable_amx call.
typename Apply> | ||
void internal_bf16_vec_inner_products_ny_if( | ||
const knowhere::bf16* __restrict x, | ||
const knowhere::bf16* __restrict y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was __restrict
removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, we have add it back
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xtangxtang The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi~, I have updated the code from your comment, please help to review. |
/reopen |
@alexanderguzhva: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi~, reviewer, Intel Advanced Matrix Extensions (AMX) is a set of specialized instructions designed to accelerate matrix operations, which are fundamental in many areas of modern computing such as machine learning, scientific computing, and graphics processing. AMX leverages dedicated tile-based architecture within the CPU to perform these operations more efficiently than traditional scalar or SIMD (Single Instruction, Multiple Data) methods.
During our analysis of brute force search for Knowhere, we find AMX could speedup up to 2x performance according to dbpedia_openai_1M dataset for bf16 data type. All tests were performed on an Intel(R) Xeon(R) 6980P processor, employing 32 physical cores and 32 threads