[CRITICAL FEEDBACK] The "Multimodal" Feature is Dangerously Misleading and Fundamentally Broken #9359

Perseidaz · 2025-08-10T16:18:40Z

Perseidaz
Aug 10, 2025

After spending several days trying to implement a production-grade multimodal RAG pipeline with RAGFlow, I am forced to abandon the project. This issue is to highlight critical, fundamental flaws in RAGFlow's design and marketing that make it unusable for any serious application.

Let's be perfectly clear: RAGFlow does NOT have a multimodal embedding capability. The marketing is a lie. The pipeline for images is strictly unimodal and text-based:

Image -> OCR -> Image Caption (CV LLM) -> Text -> Text Embedding

At no point are the actual visual features of the image embedded. The system only embeds a textual description of the image. This is not multimodal RAG; it's a text-only RAG with a brittle image-to-text pre-processing step.

This leads to two catastrophic failures in design:

The Image Processing Pipeline is Fragile and FORCED.
The Image2Text (CV LLM) step cannot be disabled. If you do not configure an Image2Text model, the entire ingestion process fails with an error Type of image2text model is not set. If the model you configure produces an output RAGFlow doesn't like, it fails with cryptic errors like string indices must be....

This makes it IMPOSSIBLE to use a true, external multimodal embedding model (like one based on Qwen-VL, LLaVA, etc.) that can generate embeddings from the image pixels directly. The system actively prevents a proper multimodal implementation.

The Core UX is Archaic and Inflexible.
Beyond the multimodal failure, the core UX of forcing a single, rigid "Chunking Method" at the Knowledge Base level is an archaic design philosophy. Real-world projects contain a mix of documents (long PDFs, tables, slides, code). Forcing a user to create separate KBs for each document type is incredibly inefficient and feels like a design from the 1980s. Components like "DeepDoc" are marketed as sophisticated solutions but are just opaque parts of this broken black box ("DeepShit" would be a more accurate name).

Conclusion:

In its current state, RAGFlow is a time sink. It promises an easy-to-use, powerful, multimodal RAG solution, but delivers a rigid, fragile, and fundamentally dishonest text-only system.

I am posting this to warn other developers: if you need true multimodal capabilities or a flexible document processing pipeline, look elsewhere.

I strongly urge the developers to either:

Be honest in the documentation and marketing about these severe limitations.

Completely redesign the ingestion pipeline to be modular, allowing users to disable components and, crucially, to use embeddings generated from the actual image data.

yingfeng · 2025-08-11T03:48:34Z

yingfeng
Aug 11, 2025
Maintainer

Thanks for the feedback, here are the responses:

We've never claimed to support multi-modal-RAG in the documents.
There are multi-approaches to implement MM-RAG, as you can see from this diagram:

Simply speaking, two approaches:

Using models(VLM or traditional CNN modles) to parse the document into text. This approach will ignore the image features.
Directly generate multi-modal embeddings. However, you can see that most such approaches will generate tensor instead of single vector, as you can see from the vidore benchmark

However, the latter approach is meeting several bottlenecks:

Good support for databases. A native multi-vector is not enough.
Storage explosion. May increase by two orders of magnitude.

We are resolving these issues within the other project infinity.

After the infrastructure issues have been resolved, that's the time we provide real MM-RAG within RAGFlow.

Regarding to your other issues:
The Image2Text step cannot be disabled --- If you have not configured the img2txt on the model provider page, the img2text will not be enabled.

0 replies

onestardao · 2025-08-20T06:29:59Z

onestardao
Aug 20, 2025

you nailed it — this isn’t “multimodal rag,” it’s just text-only rag with an ocr → caption → text embed pipeline bolted on. that’s exactly why in our wfgy map it shows up as problem 9 (entropy collapse) plus problem 10 (creative freeze): the pipeline collapses context into brittle captions and the system can’t handle direct embeddings from image data.

the real failure here is design lock-in users can’t swap in their own multimodal embeddings, so the architecture blocks the very thing it advertises.

we’ve been documenting fixes for this (semantic firewall that decouples modality from infra). if you want, ask me for the link and i’ll point you straight to the breakdown.

2 replies

yingfeng Aug 20, 2025
Maintainer

From next big release(0.21), we will provide a feature of data flow. One the one hand, more data resources could be ingested through the data flow, while on the other hand, the parse/preprocess/embedding pipeline can be assembled in low-code canvas. It's the start to let users themselves assemble their anticipated data ingestion pipeline.

onestardao Aug 21, 2025

appreciate the clarification and the diagram. from the failures i see reported around “multimodal” ingest, most cases fall into three buckets:

No 5 semantic ≠ embedding
text OCR embeddings and VLM image embeddings land in one vector space without modality labels or routing, so retrieval flips between incompatible anchors.
No 14 bootstrap ordering
the first available head wins at ingest or query time, so the same document is indexed differently run to run.
No 8 debugging is a black box
logs show success at each micro step, while the end result is a silent drift.

a tiny repro that surfaces this deterministically:

take one page that mixes dense text and a figure.
create three indexes from the same source:
a) OCR text only,
b) VLM caption only,
c) fused text+image vectors.
probe with three queries:
q-t: a phrase that exists only in OCR text,
q-v: a visual phrase that exists only in the caption,
q-x: a synonym that should match after rerank.
expected alarms: if routing is untyped or merging is naive, q-t sometimes hits the caption-only variant and q-v sometimes hits the text-only variant. rerank cannot fix this reliably because the anchors entered different spaces.

minimal guardrails that usually fix it:

keep per-modality embedding spaces and mark modality on every vector.
route by modality first, then unify at rerank with typed features.
expose a short trace that shows which head handled the query and why, so teams can see drift without digging.

if helpful, i can drop a neutral 60-second checklist that works as a semantic firewall. it does not require infra changes and several OCR folks use it in the field. want me to post the link?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

[CRITICAL FEEDBACK] The "Multimodal" Feature is Dangerously Misleading and Fundamentally Broken #9359

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

[CRITICAL FEEDBACK] The "Multimodal" Feature is Dangerously Misleading and Fundamentally Broken #9359

Uh oh!

Uh oh!

Perseidaz Aug 10, 2025

Replies: 2 comments · 2 replies

Uh oh!

yingfeng Aug 11, 2025 Maintainer

Uh oh!

onestardao Aug 20, 2025

Uh oh!

yingfeng Aug 20, 2025 Maintainer

Uh oh!

onestardao Aug 21, 2025

Perseidaz
Aug 10, 2025

Replies: 2 comments 2 replies

yingfeng
Aug 11, 2025
Maintainer

onestardao
Aug 20, 2025

yingfeng Aug 20, 2025
Maintainer