How to extract images and image BBox coordinates using python?

How to extract images and image BBox coordinates using python? - python

I am trying to extract images in PDF with BBox coordinates of the image.
I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox coordinates since for some pdfs it is showing something like this
['0', '0', '684', '864']
but image doesn't start at the start of the page, so i don't think it is bbox
I tried with following code using pdfrw
import pdfrw, os
from pdfrw import PdfReader, PdfWriter
from pdfrw.findobjs import page_per_xobj
outfn = 'extract.' + os.path.basename(path)
pages = list(page_per_xobj(PdfReader(path).pages, margin=0.5*72))
writer = PdfWriter(outfn)
writer.addpages(pages)
writer.write()
How do i get image along with it's bbox coordinates?
sample pdf : https://drive.google.com/open?id=1IVbj1b3JfmSv_BJvGUqYvAPVl3FwC2A-

I found a way to do it through a library called pdfplumber. It's built on top of pdfminer and is working consistently in my use-case. And moreover, its MIT licensed so it is helpful for my office work.
import pdfplumber
pdf_obj = pdfplumber.open(doc_path)
page = pdf_obj.pages[page_no]
images_in_page = page.images
page_height = page.height
image = images_in_page[0] # assuming images_in_page has at least one element, only for understanding purpose.
image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
cropped_page = page.crop(image_bbox)
image_obj = cropped_page.to_image(resolution=400)
image_obj.save(path_to_save_image)
Worked well for tables and images in my case.

Related

How do I capture the full dimensions of a pdf table and convert it using Camelot in Python?

pdf linkI have been trying to use the Camelot library and trying to capture a table (that isn't really formatted as a table) by setting the flavor parameter to 'stream'. However, it is not detecting the entire table. So what I decided to do is try to detect the entire page by feeding it an area parameter that takes the pages dimensions as inputs.
I have tried using this code but it still does not give me the whole page dimensions.
import camelot
from matplotlib import pyplot as plt
import pandas as pd
import PyPDF2
pdf_file = open(r'C:\Users\PC\PycharmProjects\finstate.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
page = pdf_reader.getPage(10)
width = page.mediaBox.getWidth()
height = page.mediaBox.getHeight()
print("Width:", width)
print("Height:", height)
page_area = [0, 0, 0, 0]
pdf = camelot.read_pdf(r'C:\Users\PC\PycharmProjects\finstate.pdf', pages='0-10', flavor='stream', area=page_area)
first_table = pdf[10]
print(first_table.df)
first_table.to_csv(r'C:\Users\PC\Desktop\table.csv')

To improve the detected area, you can increase the edge_tol (default: 50) value to counter the effect of text being placed relatively far apart vertically. Larger edge_tol will lead to longer text edges being detected, leading to an improved guess of the table area. Let’s use a value of 500.
You can try the following code. If it doesn't work play with edge_tol;
tables = camelot.read_pdf(r'C:\Users\PC\PycharmProjects\finstate.pdf', pages='0-10', flavor='stream', edge_tol=500)
And the following code snippet could be helpful how your table detected is;
camelot.plot(tables[0], kind='contour').show()

Image clarity issue in HTML page

I have Matplotlib & Seaborn visualizations that need to be saved in HTML. Since there is no direct method to do so, I first saved the images in PNG & then converted them to HTML. This decreased the quality of my images.
My code:
import aspose.words as aw
from PIL import Image
def pairplot_fun(eda_file, pairplot_inputs, pairplot_png, pairplot_html):
pairplot_var=pairplot_inputs[0]
sns.pairplot(eda_file, hue=pairplot_var, height=4);
plt.savefig(pairplot_png)
doc = aw.Document()
builder_pairplot = aw.DocumentBuilder(doc)
builder_pairplot.insert_image(pairplot_png)
doc.save(pairplot_html, dpi=1200)
Specifying the 'dpi' this way isn't making any difference. How do I improve the clarity of my image saved in HTML format?

You can specify image resolution in Aspose.Words HtmlSaveOptions using HtmlSaveOptions.image_resolution property.
doc = aw.Document("in.docx")
options = aw.saving.HtmlSaveOptions()
options.image_resolution = 1200
doc.save("out.html", options)

create pdf with fpdf in python. can not move image to the right in a loop

I use FPDF to generate a pdf with python. i have a problem for which i am looking for a solution. in a folder "images", there are pictures that I would like to display each on ONE page. I did that - maybe not elegant. unfortunately i can't move the picture to the right. it looks like pdf_set_y won't work in the loop.
from fpdf import FPDF
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('../images') if isfile(join('../images', f))]
pdf = FPDF('L')
pdf.add_page()
pdf.set_font('Arial', 'B', 16)
onlyfiles = ['VI 1.png', 'VI 2.png', 'VI 3.png']
y = 10
for image in onlyfiles:
pdf.set_x(10)
pdf.set_y(y)
pdf.cell(10, 10, 'please help me' + image, 0, 0, 'L')
y = y + 210 #to got to the next page
pdf.set_x(120)
pdf.set_y(50)
pdf.image('../images/' + image, w=pdf.w/2.0, h=pdf.h/2.0)
pdf.output('problem.pdf', 'F')
Would be great if you have a solution/help for me. Thanks alot
greets alex

I see the issue. You want to specify the x and y location in the call to pdf.image(). That assessment is based on the documentation for image here: https://pyfpdf.readthedocs.io/en/latest/reference/image/index.html
So you can instead do this (just showing for loop here):
for image in onlyfiles:
pdf.set_x(10)
pdf.set_y(y)
pdf.cell(10, 10, 'please help me' + image, 0, 0, 'L')
y = y + 210 # to go to the next page
# increase `x` from 120 to, say, 150 to move the image to the right
pdf.image('../images/' + image, x=120, y=50, w=pdf.w/2.0, h=pdf.h/2.0)
# new -> ^^^^^ ^^^^

You can check pdfme library. It's the most powerful library in python to create PDF documents. You can add urls, footnotes, headers and footers, tables, anything you would need in a PDF document.
The only problem I see is that currently pdfme only supports jpg format for images. But if that's not a problem it will help you with your task.
Check the docs here.

disclaimer: I am the author of pText the library I will use in this solution.
Let's start by creating an empty Document:
pdf: Document = Document()
# create empty page
page: Page = Page()
# add page to document
pdf.append_page(page)
Next we are going to load an Image using Pillow
import requests
from PIL import Image as PILImage
im = PILImage.open(
requests.get(
"https://365psd.com/images/listing/fbe/ups-logo-49752.jpg", stream=True
).raw
)
Now we can create a LayoutElement Image and use its layout method
Image(im).layout(
page,
bounding_box=Rectangle(Decimal(20), Decimal(724), Decimal(64), Decimal(64)),
)
Keep in mind the origin (in PDF space) is at the lower left corner.
Lastly, we need to store the Document
# attempt to store PDF
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, pdf)
You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.

extract specific text from image

I'm trying to extract specific (or the whole text and then parse it) text from the image.
the image is in the Hebrew language.
what I already tried in nodejs is using in Tesseract library but in Hebrew, it does not recognize the text good.
I'm also tried to convert the image to pdf and then parse from pdf but it's not working well in Hebrew.
anyone has already tried to do that? maybe with python or node js?
I'm trying to do something like cloud vision google text

have you tried preprocessing the image you feed to tesseract? In case you didn't I would give a try to use OpenCV contour detection, particularly Hough Line Transform, and then clean it up a bit. https://www.youtube.com/watch?v=lhMXDqQHf9g&list=PLQVvvaa0QuDeETZEOy4VdocT7TOjfSA8a&index=5 this guy doesn't do your stuff exactly, but if ya take time to scroll bit you can see how it can be useful.

Based on our conversation in OP. Here is some options for you to consider.
Option 1:
If you are working directly with PDFs as your input file
import fitz
input_file = '/path/to/your/pdfs/'
pdf_file = input_file
doc = fitz.open(pdf_file)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo)
pageTextblocks = page.getText('blocks') # This creates a list of items (x0,y0,x1,y1,"line1\nline2\nline3...",...)
pageTextblocks.sort(key=lambda block: block[3])
for block in pageTextblocks:
targetBlock = block[4] # This gets to the content of each block and you can work your logic here to get relevant data
Option 2:
If you are working with image as your input and you need to convert it to PDFs before processing it using code snippet in Option 1.
doc = fitz.open(input_file)
pdfbytes = doc.convertToPDF() # open it as a pdf file
pdf = fitz.open("pdf", pdfbytes) # extract data as a pdf file
One useful tip for processing image in PyMuPDF is to use zoom factor for better resolution if the image is somewhat hard to be recognized.
zoom = 1.2 # scale the image by 120%
mat = fitz.Matrix(zoom,zoom)
Option 3:
A hybrid approach with PyMuPDF and pytesseract since you've mentioned tesseract. I am not sure if this approach fits your needs to extract Hebrew language but it's an idea. The example is used for PDFs.
import fitz
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract/cmd'
input_file = '/path/to/pdfs'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file)
zoom = 1.2
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) #number of page
pix = page.getPixmap(matrix = mat)
output = '/path/to/save/image' + str(pageNo) + '.jpg'
pix.writePNG(output)
print('Converting PDFs to Image ... ' + output)
text_of_each_page = str(((pytesseract.image_to_string(Image.open(output)))))
fullText += text_without_whitespace
fullText += '\n'
Hope this helps. If you need more information about PyMuPDF, click this link and it has a more detailed explanation to fit your needs.

Embed .SVG files into PDF using reportlab

I have written a script in python that produces matplotlib graphs and puts them into a pdf report using reportlab.
I am having difficulty embedding SVG image files into my PDF file. I've had no trouble using PNG images but I want to use SVG format as this produces better quality images in the PDF report.
This is the error message I am getting:
IOError: cannot identify image file
Does anyone have suggestions or have you overcome this issue before?

Yesterday I succeeded in using svglib to add a SVG Image as a reportlab Flowable.
so this drawing is an instance of reportlab Drawing, see here:
from reportlab.graphics.shapes import Drawing
a reportlab Drawing inherits Flowable:
from reportlab.platypus import Flowable
Here is a minimal example that also shows how you can scale it correctly (you must only specify path and factor):
from svglib.svglib import svg2rlg
drawing = svg2rlg(path)
sx = sy = factor
drawing.width, drawing.height = drawing.minWidth() * sx, drawing.height * sy
drawing.scale(sx, sy)
#if you want to see the box around the image
drawing._showBoundary = True

As mentioned by skidzo, you can totally do this with the svglib package, which you can find here: https://pypi.python.org/pypi/svglib/
According to the website, Svglib is a pure-Python library for reading SVG files and converting them (to a reasonable degree) to other formats using the ReportLab Open Source toolkit.
You can use pip to install svglib.
Here is a complete example script:
# svg_demo.py
from reportlab.graphics import renderPDF, renderPM
from reportlab.platypus import SimpleDocTemplate
from svglib.svglib import svg2rlg
def svg_demo(image_path, output_path):
drawing = svg2rlg(image_path)
renderPDF.drawToFile(drawing, output_path)
if __name__ == '__main__':
svg_demo('/path/to/image.svg', 'svg_demo.pdf')

skidzo's answer is very helpful, but isn't a complete example of how to use an SVG file as a flowable in a reportlab PDF. Hopefully this is helpful for others trying to figure out the last few steps:
from io import BytesIO
import matplotlib.pyplot as plt
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Paragraph
from svglib.svglib import svg2rlg
def plot_data(data):
# Plot the data using matplotlib.
plt.plot(data)
# Save the figure to SVG format in memory.
svg_file = BytesIO()
plt.savefig(svg_file, format='SVG')
# Rewind the file for reading, and convert to a Drawing.
svg_file.seek(0)
drawing = svg2rlg(svg_file)
# Scale the Drawing.
scale = 0.75
drawing.scale(scale, scale)
drawing.width *= scale
drawing.height *= scale
return drawing
def main():
styles = getSampleStyleSheet()
pdf_path = 'sketch.pdf'
doc = SimpleDocTemplate(pdf_path)
data = [1, 3, 2]
story = [Paragraph('Lorem ipsum!', styles['Normal']),
plot_data(data),
Paragraph('Dolores sit amet.', styles['Normal'])]
doc.build(story)
main()

You need to make sure you are importing PIL (Python Imaging Library) in your code so that ReportLab can use it to handle image types like SVG. Otherwise it can only support a few basic image formats.
That said, I recall having some trouble, even when using PIL, with vector graphics. I don't know if I tried SVG but I remember having a lot of trouble with EPS.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract images and image BBox coordinates using python? - python

Related

How do I capture the full dimensions of a pdf table and convert it using Camelot in Python?

Image clarity issue in HTML page

create pdf with fpdf in python. can not move image to the right in a loop

extract specific text from image

Embed .SVG files into PDF using reportlab

Categories

Resources