How to parse images in Word documents into URL format #8752

tianshiqingge · 2025-07-09T09:11:09Z

tianshiqingge
Jul 9, 2025

Most of my Word documents are in graphic and textual format. If I want to convert images to URL format, the data retrieved from the external knowledge base in the DIFY workflow can be presented in a combination of graphic and textual formats.

tianshiqingge · 2025-07-09T09:14:59Z

dosubot[bot]
bot Jul 9, 2025

RAGFlow does not currently support extracting images from Word documents and converting them to URL format out of the box. The backend parser for Word files only handles text and tables, and the frontend uses the 'mammoth' library to convert DOCX to HTML, but this embeds images as base64 data URIs, not as URLs reference.

To achieve what you want, you’ll need to extract images from your Word documents using an external tool (for example, with Python’s python-docx or docx2txt), upload those images to a web-accessible storage (like S3, OSS, or your own server), and then replace the image references in your document content with the corresponding URLs. After that, you can present the combined text and image URLs in your DIFY or RAGFlow workflow.

There’s no built-in or documented workflow for this in RAGFlow, so this process would need to be handled outside the platform reference. If you have specific requirements for the output format or hosting, let me know and I can suggest more targeted steps.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

tianshiqingge Jul 11, 2025
Author

@dosu
I want to pass the document through the API. Can ragflow be passed in JSON format instead of file format

dosubot[bot] bot Jul 11, 2025

RAGFlow does not support passing full document content (including text and images/URLs) directly as JSON via the API. Document ingestion is file-centric: you can upload files using multipart/form-data, or provide a URL for the backend to fetch and parse a file. JSON endpoints are only used for metadata, folder creation, or virtual/empty document records—not for uploading actual document content or pre-parsed data like text and image URLs (reference).

If you need to ingest pre-parsed content, you’ll need to convert it into a file (such as a DOCX, TXT, or HTML file) and upload it, or host it at a URL and use the URL ingestion endpoint. Let me know if you want more details on how to implement this workaround.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

tianshiqingge Jul 11, 2025
Author

@dosu please tell me

dosubot[bot] bot Jul 11, 2025

RAGFlow does not support ingesting document content (including text and image URLs) directly as JSON via the API—it only accepts files (like DOCX, TXT, or HTML) or URLs to files for ingestion. Its parsers are text-centric: images in DOCX and HTML are ignored, and only the text is extracted. If you want to preserve image URLs, you need to manually extract the images from your Word documents, upload them to a web-accessible location, and insert the image URLs as plain text (e.g., https://...) into your document content before saving as TXT or HTML. Then, upload this file to RAGFlow using the standard file upload API. When you retrieve the content, the image URLs will be present as text, and you can configure your downstream system (like DIFY) to render these URLs as images if needed. There is no built-in support for structured image blocks or automatic image URL extraction in RAGFlow at this time (source, source, source).

If you need a step-by-step example or a script to automate this process, let me know your preferred programming language or workflow.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

ZhenhangTung · 2025-07-10T03:19:57Z

ZhenhangTung
Jul 10, 2025
Collaborator

@tianshiqingge could you give me an example? If there is any one built with Dify, that will help more.

8 replies

ZhenhangTung Jul 11, 2025
Collaborator

would expect 0.20 to be released in this month.

tianshiqingge Jul 11, 2025
Author

Thank you

NyaDoo Jul 24, 2025

@tianshiqingge could you give me an example? If there is any one built with Dify, that will help more.

Is there an exact launch time at present, and there is still a big gap between the current DeepDoc in some aspects (especially when it comes to tables, pictures and multi-level directories) and the effect of slicing.

YyBoom233 Jul 28, 2025

would expect 0.20 to be released in this week @tianshiqingge

tianshiqingge Aug 9, 2025
Author

Thank you

onestardao · 2025-08-04T13:45:13Z

onestardao
Aug 4, 2025

you're actually hitting two common blind spots here:

the image content in Word docs (especially inline base64) can’t be indexed or retrieved properly — so even if it looks embedded, RAG can’t “see” it.
converting Word images to URLs manually still won’t solve the semantic retrieval gap unless you pair it with OCR + text binding.

we’ve mapped these as ProblemMap No.1 and No.4 in our diagnostics, and we’ve got an open-source MIT-licensed toolkit (with backing from the tesseract.js author) that handles exactly this edge case. happy to share if you’re exploring deeper pipelines — let me know.

0 replies

InfiniFlow

How to parse images in Word documents into URL format #8752

Uh oh!

tianshiqingge Jul 9, 2025

Replies: 3 comments · 12 replies

Uh oh!

dosubot[bot] bot Jul 9, 2025

Uh oh!

tianshiqingge Jul 11, 2025 Author

Uh oh!

dosubot[bot] bot Jul 11, 2025

Uh oh!

tianshiqingge Jul 11, 2025 Author

Uh oh!

dosubot[bot] bot Jul 11, 2025

Uh oh!

ZhenhangTung Jul 10, 2025 Collaborator

Uh oh!

ZhenhangTung Jul 11, 2025 Collaborator

Uh oh!

tianshiqingge Jul 11, 2025 Author

Uh oh!

NyaDoo Jul 24, 2025

Uh oh!

YyBoom233 Jul 28, 2025

Uh oh!

tianshiqingge Aug 9, 2025 Author

Uh oh!

onestardao Aug 4, 2025

tianshiqingge
Jul 9, 2025

Replies: 3 comments 12 replies

dosubot[bot]
bot Jul 9, 2025

tianshiqingge Jul 11, 2025
Author

tianshiqingge Jul 11, 2025
Author

ZhenhangTung
Jul 10, 2025
Collaborator

ZhenhangTung Jul 11, 2025
Collaborator

tianshiqingge Jul 11, 2025
Author

tianshiqingge Aug 9, 2025
Author

onestardao
Aug 4, 2025