Results reproduction

During the reproduction process, when I tested qwen-vl and the trained ckpts, I found that the `mmvet` results were all very low, less than 5. I wonder how the results in the paper were measured. Was the `gpt_eval_score` used? Also, for mathvista's testmini, the performance of the qwenvl base I tested was not as high as reported. Is it because I used vllm?