Skip to content

Conversation

xtangxtang
Copy link

Hi~, reviewer, Intel Advanced Matrix Extensions (AMX) is a set of specialized instructions designed to accelerate matrix operations, which are fundamental in many areas of modern computing such as machine learning, scientific computing, and graphics processing. AMX leverages dedicated tile-based architecture within the CPU to perform these operations more efficiently than traditional scalar or SIMD (Single Instruction, Multiple Data) methods.

During our analysis of brute force search for Knowhere, we find AMX could speedup up to 2x performance according to dbpedia_openai_1M dataset for bf16 data type. All tests were performed on an Intel(R) Xeon(R) 6980P processor, employing 32 physical cores and 32 threads

@sre-ci-robot
Copy link
Collaborator

Welcome @xtangxtang! It looks like this is your first PR to zilliztech/knowhere 🎉

Copy link

mergify bot commented Jun 4, 2025

@xtangxtang 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

@xtangxtang xtangxtang changed the title Add amx support for brute force search Add AMX support for brute force search Jun 4, 2025
Copy link
Collaborator

@alexanderguzhva alexanderguzhva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup required

#if defined(USE_AMX)
float amx_inner_product_matrix_bf16( char **floatLibraryMatrix, char *floatQueryMatrix, uint64_t dims,uint64_t batchSizeA,
uint64_t batchSizeB, float *results_ptr){
int DIM=32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments are needed for the magic numbers
also, please use constexpr

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we will use constexpr for magic numbers

float results[16*16] __attribute__((aligned(64)))={0};

if(!init_mem){
cfg[0]=1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use defined structs, for example

struct TileConfig {
    // must be 1
    uint8_t paletteId;
    // must be 0
    uint8_t startRow;
    uint8_t reserved[14];
    // measured in bytes
    uint16_t colsb[16];
    // measured in rows
    uint8_t rows[16];
};

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, has replace it

cfg[48+2] = 16;
init_mem = true;

_tile_loadconfig((void *)cfg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, thread_local approach here leads to a missing _tile_release() instruction.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we understand, the release will be called when the thread stop, maybe the process quit. Please provide a position for us to release the tile config, if you have an idea

}
}
_tile_stored(2, results, batchSizeB*2*2);
_tile_zero(2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move _tile_zero(2); one line above for(int i=0;i<blockCount;i++){

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return 1;
}

static void amx_bf16_mul(void *cfg, void *ma, void *mb, int64_t a_stride, int64_t b_stride, void *mc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

float& dis8, float& dis9, float& dis10, float& dis11,
float& dis12, float& dis13, float& dis14, float& dis15
) {
enable_amx();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function needs to be invoked only once per process, please alter src/simd/hook.cc appropriately

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

unsigned long bitmask = 0;
long status = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask);
if (0 != status) {
std::cout << "SYS_arch_prctl(READ) error" << std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please alert about the problems via the error code rather than via std::cout

if (status != 0) {
   return ...;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::lock_guard<std::mutex> lock(patch_bf16_mutex);
#if defined(__x86_64__)
if (use_avx512 && cpu_support_avx512()) {
if (use_amx && cpu_support_amx()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (use_amx && cpu_support_amx()) {
    if (enable_amx() != SUCCESS) { fallback to avx512 and alert users } else { use amx } 
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the invocation of AMX and AVX512 should be the same here, except that AMX requires an additional enable_amx call.

typename Apply>
void internal_bf16_vec_inner_products_ny_if(
const knowhere::bf16* __restrict x,
const knowhere::bf16* __restrict y,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was __restrict removed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we have add it back

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xtangxtang
To complete the pull request process, please assign zhengbuqian after the PR has been reviewed.
You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ruclz
Copy link

ruclz commented Jul 9, 2025

Hi~, I have updated the code from your comment, please help to review.

@alexanderguzhva
Copy link
Collaborator

/reopen

@sre-ci-robot sre-ci-robot reopened this Aug 18, 2025
@sre-ci-robot
Copy link
Collaborator

@alexanderguzhva: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the stale label Aug 19, 2025
@github-actions github-actions bot added the stale label Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants