Skip to content

Improper results on scanned pdfs #193

Open
@Shravan-Ganji

Description

@Shravan-Ganji

I have been trying to analyze the documents using layout parser on different types of documents, I am able to get expected results on True pdfs but not on scanned pdfs, it is detecting the scanned pdf image contents as figure or not as expected results.

I am facing this issue only for the scanned pdfs

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version, see the Layout Parser Releases

To Reproduce

import layoutparser as lp
import cv2

image = cv2.imread("test.png")
image = image[..., ::-1]

model = lp.models.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})

color_map = {
'Text': 'red',
'Title': 'blue',
'List': 'green',
'Table': 'purple',
'Figure': 'pink',
}

layout = model.detect(image)

lp.draw_box(image, layout, box_width=3,color_map=color_map)

Environment

  1. I am using windows
  2. Latest layout parser version

Contains 2 images:

1: Scanned pdf image result
2: Proper pdf image result
error
positive

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions