Replies: 2 comments 2 replies
-
Thanks for the feedback, here are the responses:
![]()
However, the latter approach is meeting several bottlenecks:
We are resolving these issues within the other project infinity.
Regarding to your other issues: |
Beta Was this translation helpful? Give feedback.
-
you nailed it — this isn’t “multimodal rag,” it’s just text-only rag with an ocr → caption → text embed pipeline bolted on. that’s exactly why in our wfgy map it shows up as problem 9 (entropy collapse) plus problem 10 (creative freeze): the pipeline collapses context into brittle captions and the system can’t handle direct embeddings from image data. the real failure here is design lock-in users can’t swap in their own multimodal embeddings, so the architecture blocks the very thing it advertises. we’ve been documenting fixes for this (semantic firewall that decouples modality from infra). if you want, ask me for the link and i’ll point you straight to the breakdown. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
After spending several days trying to implement a production-grade multimodal RAG pipeline with RAGFlow, I am forced to abandon the project. This issue is to highlight critical, fundamental flaws in RAGFlow's design and marketing that make it unusable for any serious application.
Let's be perfectly clear: RAGFlow does NOT have a multimodal embedding capability. The marketing is a lie. The pipeline for images is strictly unimodal and text-based:
Image -> OCR -> Image Caption (CV LLM) -> Text -> Text Embedding
At no point are the actual visual features of the image embedded. The system only embeds a textual description of the image. This is not multimodal RAG; it's a text-only RAG with a brittle image-to-text pre-processing step.
This leads to two catastrophic failures in design:
The Image2Text (CV LLM) step cannot be disabled. If you do not configure an Image2Text model, the entire ingestion process fails with an error Type of image2text model is not set. If the model you configure produces an output RAGFlow doesn't like, it fails with cryptic errors like string indices must be....
This makes it IMPOSSIBLE to use a true, external multimodal embedding model (like one based on Qwen-VL, LLaVA, etc.) that can generate embeddings from the image pixels directly. The system actively prevents a proper multimodal implementation.
Beyond the multimodal failure, the core UX of forcing a single, rigid "Chunking Method" at the Knowledge Base level is an archaic design philosophy. Real-world projects contain a mix of documents (long PDFs, tables, slides, code). Forcing a user to create separate KBs for each document type is incredibly inefficient and feels like a design from the 1980s. Components like "DeepDoc" are marketed as sophisticated solutions but are just opaque parts of this broken black box ("DeepShit" would be a more accurate name).
Conclusion:
In its current state, RAGFlow is a time sink. It promises an easy-to-use, powerful, multimodal RAG solution, but delivers a rigid, fragile, and fundamentally dishonest text-only system.
I am posting this to warn other developers: if you need true multimodal capabilities or a flexible document processing pipeline, look elsewhere.
I strongly urge the developers to either:
Be honest in the documentation and marketing about these severe limitations.
Completely redesign the ingestion pipeline to be modular, allowing users to disable components and, crucially, to use embeddings generated from the actual image data.
Beta Was this translation helpful? Give feedback.
All reactions