What's Changed
- [GHA] Replaced isual_language_chat_sample-ubuntu-minicpm_v2_6 job by @mryzhov in #1909
- [GHA] Replaced cpp-chat_sample-ubuntu pipeline by @mryzhov in #1913
- Add support of Prompt Lookup decoding to llm bench by @sbalandi in #1917
- [GHA] Introduce SDL pipeline by @mryzhov in #1924
- Switch Download OpenVINO step to aks-medium-runner by @ababushk in #1889
- Bump product version 2025.2 by @akladiev in #1920
- [GHA] Replaced cpp-continuous-batching by @mryzhov in #1910
- Update dependencies in samples by @ilya-lavrenov in #1925
- phi3_v: add universal tag by @Wovchena in #1921
- Fix image_id unary error by @rkazants in #1927
- [Docs] Image generation use case by @yatarkan in #1877
- Add perf metrics for CB VLM by @pavel-esir in #1897
- Enhance the flexibility of the c streamer by @apinge in #1941
- add Gemma3 LLM to supported models by @eaidova in #1942
- Added GPTQ/AWQ support with HF Transformers by @AlexKoff88 in #1933
- Add --static_reshape option to llm_bench, to force static reshape + compilation at pipeline creation by @RyanMetcalfeInt8 in #1851
- benchmark_image_gen: Add --reshape option, and ability to specify multiple devices by @RyanMetcalfeInt8 in #1878
- Revert perf regression changes by @dkalinowski in #1949
- Add running greedy_causal_lm for JS to the sample tests by @Retribution98 in #1930
- [Docs] Add VLM use case by @yatarkan in #1907
- Added possibility to generate base text on GPU for text evaluation. by @andreyanufr in #1945
- VLM: change infer to start_async/wait by @dkalinowski in #1948
- [WWB]: Addressed issues with validation on Windows by @AlexKoff88 in #1953
- [GHA] Remove bandit pipeline by @mryzhov in #1956
- Disable MSVC debug assertions, addressing false positives in iterator checking by @apinge in #1952
- [GHA] Replaced genai-tools pipeline by @mryzhov in #1954
- configurable delay by @eaidova in #1963
- Update cast of tensor data pointer for const tensors by @praasz in #1966
- Remove tokens after EOS for draft model for speculative decoding by @sbalandi in #1951
- Add testcase for chat_sample_c by @apinge in #1934
- Skip warm-up iteration during llm_bench results averaging by @nikita-savelyevv in #1972
- Reset pipeline cache usage statistics on each generate call by @vshampor in #1961
- [Docs] Update models, rebuild on push by @yatarkan in #1922
- Updated logic whether PA backend is explicitly required by @ilya-lavrenov in #1976
- [GHA] [MAC] Use latest_available_commit OV artifacts by @mryzhov in #1977
- [GHA] Set HF_TOKEN by @mryzhov in #1986
- [GHA] Setup ov_cache by @mryzhov in #1962
- [GHA] Changed cleanup runner by @mryzhov in #1995
- Added mutex to methods which use blocks map. by @popovaan in #1975
- Add documentation and sample on KV cache eviction by @vshampor in #1960
- StaticLLMPipeline: Simplify compile_model call logic by @smirnov-alexey in #1915
- Fix reshape in heterogeneous SD samples by @helena-intel in #1994
- Update tokenizers by @mryzhov in #2002
- docs: fix max_new_tokens option description by @tpragasa in #1987
- [Docs] Add speech recognition with whisper use case by @yatarkan in #1971
- Revert "VLM: change infer to start_async/wait " by @ilya-lavrenov in #2004
- Revert "Revert perf regression changes" by @ilya-lavrenov in #2003
- Set xfail to failing tests. by @popovaan in #2006
- [GHA] Use cpack bindings in the samples tests by @mryzhov in #1979
- [Docs]: add Phi3.5MoE to supported models by @eaidova in #2012
- add TensorArt SD3.5 models to supported list by @eaidova in #2013
- Move MiniCPM resampler to vision encoder by @popovaan in #1997
- [GHA] Fix ccache on Win/Mac by @mryzhov in #2008
- samples/python/text_generation/lora.py -> samples/python/text_generation/lora_greedy_causal_lm.py by @Wovchena in #2007
- Whisper timestamp fix by @RyanMetcalfeInt8 in #1918
- Unskip Qwen2-VL-2B-Instruct sample test by @as-suvorov in #1970
- [GHA] Use developer openvino packages by @mryzhov in #2000
- Added NNCF to export-requirements.txt by @ilya-lavrenov in #1974
- Bump py-build-cmake from 0.4.2 to 0.4.3 by @dependabot in #2016
- Use OV_CACHE for python tests by @as-suvorov in #2020
- [GHA] Disable HTTP calls to the Hugging Face Hub by @mryzhov in #2021
- Add python bindings to VLMPipeline for encrypted models by @olpipi in #1916
- Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot in #2017
- CB: auto plugin support by @ilya-lavrenov in #2034
- timeout-minutes: 90 by @Wovchena in #2039
- Bump diffusers from 0.32.2 to 0.33.1 by @dependabot in #2031
- Bump diffusers from 0.32.2 to 0.33.1 in /samples by @dependabot in #2032
- Enable cache and add cache encryption to samples by @olpipi in #1990
- Fix VLM concurrency by @mzegla in #2022
- Move Phi3 vision projection model to vision encoder by @popovaan in #2009
- Fix spelling by @Wovchena in #2025
- [Docs] Enable autogenerated samples docs by @yatarkan in #2029
- Synchronize entire embeddings calculation phase (#1967) by @mzegla in #1993
- Add missing finish reason set when finishing the sequence by @mzegla in #2036
- Bump image-size from 1.2.0 to 1.2.1 in /site in the npm_and_yarn group across 1 directory by @dependabot in #1998
- Add README for C Samples by @apinge in #2040
- Use ov_cache for test_vlm_pipeline by @as-suvorov in #2042
- increase timeouts by @Wovchena in #2041
- [GHA] Use azure runners for python tests by @mryzhov in #1991
- [WWB]: move diffusers imports closer to usage by @eaidova in #2046
- [llm bench] Move calculation of memory consumption to memory_monitor tool by @sbalandi in #1937
- [llm bench] allow loading onnx models using optimum-intel by @eaidova in #2050
- Add cache encryption to vlm sample by @olpipi in #2038
- Remove note about GPU for phi3v by @eaidova in #2053
- Update requirement according to memory_monitor needs by @sbalandi in #2064
- [CI] Freeze optimum-intel by @mryzhov in #2061
- Propose chat template fixes by @Wovchena in #2070
- Add
tiny-random-internvl2
to python tests by @yatarkan in #1978 - Don't download on import by @Wovchena in #2054
- Don't mention chat templates in start_chat docstrings by @Wovchena in #2055
- Revert "Set xfail to failing tests. (#2006)" by @popovaan in #2066
- Fix perf metrics update in prompt lookup decoding pipeline by @mzegla in #2044
- Bump http-proxy-middleware from 2.0.7 to 2.0.9 in /site in the npm_and_yarn group across 1 directory by @dependabot in #2072
- GHA: pin OpenVINO by @ilya-lavrenov in #2078
- [JS] Add LLMPipeline samples by @Retribution98 in #2058
- Disable contituous batching if cannot get context by @WeldonWangwang in #2060
- add internvl3 to supported VLM by @eaidova in #2076
- Add get_vocab Method to Tokenizer by @apaniukov in #2059
- GHA: pin OpenVINO by @Wovchena in #2079
- Raise exception if input prompt exceeds its configured max size on NPU by @AsyaPronina in #1996
- Optimize get_inputs_embeds() for Qwen2VL. by @popovaan in #2037
- Revert "GHA: pin OpenVINO" by @ilya-lavrenov in #2088
- Initial GGUF support by @ilya-lavrenov in #2081
- Revert "GHA: pin OpenVINO" by @ilya-lavrenov in #2087
- Bump diffusers and relax test_image_model_genai by @Wovchena in #2084
- add sentencepiece to requirements.txt by @isanghao in #2089
- Revert "Add get_vocab Method to Tokenizer (#2059)" by @Wovchena in #2086
- Revert optimum-intel freeze by @Wovchena in #2083
- [C] Add ov::Property as arguments to the ov_genai_llm_pipeline_create function by @apinge in #2071
- Use reordered images grid in
create_position_ids
method for Qwen2VL by @yatarkan in #2093 - Allow new Pillow's license by @Wovchena in #2077
- Disable /sdl for gguf-tools by @Wovchena in #2100
- Fix VLM CB metrics. by @popovaan in #2073
- GGUF: fixed GGUF tests by @ilya-lavrenov in #2090
- Fixed whisper tests by @ilya-lavrenov in #2105
- fix llm_bench and wwb parameters for new transformers by @eaidova in #2098
- [llm bench] Avoid crash of memory monitor when framework/pipeline change by exception by @sbalandi in #2106
- GGUF support Qwen2.5 with type of Q4_K Q6_K by @TianmengChen in #2095
- fix whisper optimum run via llm_bench by @eaidova in #2108
- [Docs] Add installation, guides & concepts pages by @yatarkan in #2075
- GGUF WA for GPU by @sammysun0711 in #2110
- llava: add universal tag by @Wovchena in #2091
- promp look up : store encoder stats by @esmirno in #2104
- support 4bit cache copy by @zhangYiIntel in #1980
- Fix of filling of pixel_values tensor in llava_image_embed_make_with_bytes_slice() by @popovaan in #2111
- Update type hints in genai: dict by @Wovchena in #2112
- Print speculative decoding perf metrics in Debug mode by @sbalandi in #2065
- Fix license filter by @Wovchena in #2116
- Increase max_retries by @Wovchena in #2115
- [VLM] Clear inputs embedder cache when chat is finished. by @popovaan in #2117
- InternVL2, LLaVA-NeXT: add universal tag by @Wovchena in #2114
- docs: fix path to kv-cache-areas-diagram.svg by @Wovchena in #2101
- Update llm_bench requirements.txt to contain sentencepiece by @skuros in #2121
- Bring Back get_vocab by @apaniukov in #2107
- GGUF support load split files for Qwen2.5 by @TianmengChen in #2120
- samples: Adds optional device selection to some samples by @apram0d in #2028
- [GHA] Fixed dependabot trigger for github actions by @mryzhov in #2123
- Bump actions/download-artifact from 4.1.8 to 4.3.0 by @dependabot in #2133
- Add remove adapters for LLMpipeline by @wenyi5608 in #1852
- Coverity: exclude C++ and Python Tokenizers by @Wovchena in #2124
- Bump actions/setup-node from 4.0.2 to 4.4.0 by @dependabot in #2132
- GGUF Q6K WA for GPU by @TianmengChen in #2135
- [llm bench]: fix hook for beam search for optimum by @eaidova in #2128
- Update type hints in genai by @Wovchena in #2134
- Copy tags to docs by @Wovchena in #2127
- Bump actions/checkout from 4.1.6 to 4.2.2 by @dependabot in #2136
- Bump actions/setup-python from 5.4.0 to 5.6.0 by @dependabot in #2131
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by @dependabot in #2137
- Removed KVCacheConfig from internal API by @ilya-lavrenov in #2138
- Alias CLIPTextModelWithProjection as CLIPTextModel by @ilya-lavrenov in #1809
- [GGUF] Optimize Load GGUF with Threading by @sammysun0711 in #2139
- Bump actions/upload-artifact from 4.4.3 to 4.6.2 by @dependabot in #2143
- Continuous batching minor improvements by @Wovchena in #2144
- Remove extra call by @Wovchena in #2148
- Group source files for smart CI by @ilya-lavrenov in #2146
- Log failed output by @Wovchena in #2152
- Add a sample of LLM ReAct Agent by @JamieVC in #1926
- Tokenizer: patch simplified_chat_template by @Wovchena in #2145
- [llm_bench] Fix batch size processing while image gen benchmark by @apram0d in #2125
- [VLM] Add Qwen2.5-VL model support by @yatarkan in #2140
- Use full float for hash by @Wovchena in #2149
- Replace non existing models extend chat template mapping by @Wovchena in #2153
- [RAG] Add text embedding pipeline by @as-suvorov in #2057
- Bump json5 from 0.10.0 to 0.12.0 in /samples by @dependabot in #2154
- [C] Implement type conversion for the property values of MAX_PROMPT_LEN and MIN_RESPONSE_LEN by @apinge in #2142
- Added smart CI for Linux workflow by @ilya-lavrenov in #2158
- [GHA] Unique ov_cache by @mryzhov in #2160
- Zero out other half of int64 for hash by @Wovchena in #2157
- Added smart CI for Windows and macOS by @ilya-lavrenov in #2164
- Enables PA for arm64 by @ilya-lavrenov in #2165
- Fix whisper pipeline beam search decoding by @as-suvorov in #2166
- add phi4 reasoning to supported by @eaidova in #2161
- CVS-167152: fixed CLIPTextModelWithProjection creation in Python by @ilya-lavrenov in #2169
- Adjusted smart CI / labeler configs by @ilya-lavrenov in #2168
- Add Text Embedding pipeline samples by @as-suvorov in #2167
- [GHA] Fix component pattern by @akladiev in #2181
- Add backoff to requirements_conversion.txt by @skuros in #2170
- Bump actions/dependency-review-action from 4.6.0 to 4.7.0 by @dependabot in #2185
- Regenerate Windows cache by @Wovchena in #2188
- Fix race cond. Move get_awaiting_requests method to base class by @olpipi in #2174
- Fix overflow. Fix coverity. by @olpipi in #2179
- Add SD3 LoRA Adapter Support by @sammysun0711 in #2187
- [llm_bench] fix vlm processing without image and add more supported models by @eaidova in #2182
- [Coverity] Fix null pointer dereferences by @popovaan in #2184
- [llm_bench] fix overwriting bos token by @michal-miotk in #2199
- Fix Whisper tests by @as-suvorov in #2203
- [JS] Upgrade the js package versions to the upcoming releases by @Retribution98 in #2045
- Revert "Regenerate Windows cache (#2188)" by @Wovchena in #2196
- Added info to Scheduler docstring, optimized calculation of hash during prefix caching. by @popovaan in #2189
- Explain MODE_STATIC vs MODE_FUSE by @Wovchena in #2198
- Bump actions/dependency-review-action from 4.7.0 to 4.7.1 by @dependabot in #2207
- Add paired input into genai::Tokenizer by @pavel-esir in #2080
- Whisper static pipeline: fix for fp8 models by @eshiryae in #2201
- LoRA scaling fix by @likholat in #2210
- Benchmark add empty lora test by @wenyi5608 in #2183
- Bump pybind11-stubgen from 2.5.3 to 2.5.4 by @dependabot in #2208
- LLM: release plugin once pipeline is removed and WA for GPU by @sbalandi in #2102
- Bump onnx from 1.17.0 to 1.18.0 in /tests/python_tests by @dependabot in #2202
- [llm_bench]
first_token_time
should not be scaled bybatch_size
by @pavel-esir in #2217 - [Coverity] Removed dead code in preprocess_clip_image_llava(). by @popovaan in #2230
- Fix Coverity issues by @olpipi in #2222
- Fix CI problems: update optimum-intel, use higher memory runner, disable whisper tests by @rkazants in #2229
- [TTS] Introduce Text-to-speech pipeline API and support SpeechT5 TTS by @rkazants in #2209
- Remove commented code by @Wovchena in #2220
- Bump undici from 6.21.1 to 6.21.3 in /site in the npm_and_yarn group across 1 directory by @dependabot in #2219
- [llm_bench] Include #egg=optimum-intel to avoid issues when freezing … by @wkobielx in #2237
- Dont include debug_utils.hpp by @Wovchena in #2223
- Switch VLM to ContinuousBatching by default. by @popovaan in #2129
- add qwen3 chat template to mapping by @eaidova in #2228
- [JS] Add an interrupt option for LLMPipeline by @Retribution98 in #2235
- Increased VLM tests timeout. by @popovaan in #2238
- Fix Coverity issues by @olpipi in #2232
- [StatefulLLMPipeline] Remove GenAI slicing in stateful pipeline for NPU by @smirnov-alexey in #2246
- Add simplified chat template for falcon-7b-instruct by @eaidova in #2252
- [NPUW] Re-fixed issue with long prompt for NPU by @AsyaPronina in #2242
- Enable chat template by default during WWB evaluation of text models by @nikita-savelyevv in #2051
- Remove LoRA scaling fix by @likholat in #2277
- Add Phi-4-multimodal-instruct by @Wovchena in #2221
- Image generation multiconcurrency by @dkalinowski in #2190
- Implement SnapKV (#2067) - release branch PR by @vshampor in #2278
- [GGUF] Support GGUF format for tokenizers and detokenizers by @rkazants in #2272
- Switch to SDPA for VLMs by @yatarkan in #2296
- add new chat template for qwen3 release by @eaidova in #2298
- Revert switch to CB changes. by @popovaan in #2304
- Fix Phi3-vision prompt by @yatarkan in #2306
- Phi4-mm: fix prompt processing, patch position ids and separator inserter by @yatarkan in #2293
New Contributors
- @ababushk made their first contribution in #1889
- @tpragasa made their first contribution in #1987
- @WeldonWangwang made their first contribution in #2060
- @apram0d made their first contribution in #2028
- @JamieVC made their first contribution in #1926
- @michal-miotk made their first contribution in #2199
Full Changelog: 2025.1.0.0...2025.2.0.0