How to parse images in Word documents into URL format #8752
Replies: 3 comments 12 replies
-
RAGFlow does not currently support extracting images from Word documents and converting them to URL format out of the box. The backend parser for Word files only handles text and tables, and the frontend uses the 'mammoth' library to convert DOCX to HTML, but this embeds images as base64 data URIs, not as URLs reference. To achieve what you want, you’ll need to extract images from your Word documents using an external tool (for example, with Python’s There’s no built-in or documented workflow for this in RAGFlow, so this process would need to be handled outside the platform reference. If you have specific requirements for the output format or hosting, let me know and I can suggest more targeted steps. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
@tianshiqingge could you give me an example? If there is any one built with Dify, that will help more. |
Beta Was this translation helpful? Give feedback.
-
you're actually hitting two common blind spots here:
we’ve mapped these as ProblemMap No.1 and No.4 in our diagnostics, and we’ve got an open-source MIT-licensed toolkit (with backing from the tesseract.js author) that handles exactly this edge case. happy to share if you’re exploring deeper pipelines — let me know. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Most of my Word documents are in graphic and textual format. If I want to convert images to URL format, the data retrieved from the external knowledge base in the DIFY workflow can be presented in a combination of graphic and textual formats.
Beta Was this translation helpful? Give feedback.
All reactions