Now, I have a python game that has sprites, and it obtains the images from files in its directory. I want to make it such that I do not even need the files. Somehow, to pre-store the image in a variable so that i can call it from within the program, without the help of the additional .gif files
The actual way i am using the image is
image = PIL.Image.open('image.gif')
So it would be helpful if you could be precise about how to replace this code
Continuing #eatmeimadanish's thoughts, you can do it manually:
import base64
with open('image.gif', 'rb') as imagefile:
base64string = base64.b64encode(imagefile.read()).decode('ascii')
print(base64string) # print base64string to console
# Will look something like:
# iVBORw0KGgoAAAANS ... qQMAAAAASUVORK5CYII=
# or save it to a file
with open('testfile.txt', 'w') as outputfile:
outputfile.write(base64string)
# Then make a simple test program:
from tkinter import *
root = Tk()
# Paste the ascii representation into the program
photo = 'iVBORw0KGgoAAAANS ... qQMAAAAASUVORK5CYII='
img = PhotoImage(data=photo)
label = Label(root, image=img).pack()
This is with tkinter PhotoImage though, but I'm sure you can figure out how to make it work with PIL.
Here is how you can open it using PIL. You need a bytes representation of it, then PIL can open a file like object of it.
import base64
from PIL import Image
import io
with open("picture.png", "rb") as file:
img = base64.b64encode(file.read())
img = Image.open(io.BytesIO(img))
img.show()
Related
I want to add a white background to my transparant images (png) and resize them. The images are located in a folder. I need to do bulk work, not 1 image at the time.
I removed the background from the images first with rembg (works good) and now I want to change the images.
My code
import rembg
import glob
from pathlib import Path
from rembg import remove, new_session
session = new_session()
for file in Path(r'C:\test\images').glob('*.jpg'):
input_path = str(file)
output_path = str(file.parent / (file.stem + ".out.png"))
with open(input_path, 'rb') as i:
with open(output_path, 'wb') as o:
input = i.read()
output = remove(input, session=session)
o.write(output)
I do not know how to add the white backgroud and resize with python because I'm fairly new to this. Thank you in advance!
I think you want a helper function to do the work, something like:
from PIL import Image
import rembg
def process(session, image, *, size=None, bgcolor='white'):
"session is a rembg Session, and image is a PIL Image"
if size is not None:
image = image.resize(size)
else:
size = image.size
result = Image.new("RGB", size, bgcolor)
out = rembg.remove(image, session=session)
result.paste(out, mask=out)
return result
The idea being that you pass a rembg Session and a Pillow Image in and it will remove the background and flatten that image, resizing along the way.
As a working example, you could do something like:
from io import BytesIO
import requests
session = rembg.new_session("u2netp")
res = requests.get("https://picsum.photos/600")
res.raise_for_status()
with Image.open(BytesIO(res.content)) as img:
out = process(session, img, size=(256, 256), bgcolor='#F0E68C')
out.save("output.png")
For example, an input and output might be:
If you wanted to work with lots of files, your pathlib objects can be passed directly to Pillow:
from pathlib import Path
for path_in in Path(r'C:\test\images').glob('*.jpg'):
path_out = path_in.parent / f"{path_in.stem}-out.png"
# no point processing images that have already been done!
if path_out.exists():
continue
with Image.open(path_in) as img:
out = process(session, img, size=(256, 256), bgcolor='#F0E68C')
out.save(path_out)
Update: it's often worth adding a check into these loops so they can be rerun and not have to process everything again. If you really do want images to be re-processed then just delete *-out.png
I want to append the contents of a Tkinter widget to the end of an existing pdf.
First, I capture the widget to an PIL image. Then, it seems it is required when using PyPDF2 to create an intermediate temporary file from the image, which feels unnecessary. Instead I would like to append the image directly, or at least convert the image to something that can be appended without the need to be written to a file.
In the code snippet below I save the image to a temporary pdf, then open the pdf and append. This works, but is not a very elegant solution.
import tkinter as tk
from PIL import ImageGrab
import os
import PyPDF2
def process_pdf(widget, pdf_filepath):
""""Append Tkinter widget to pdf"""
# capture the widget
img = capture_widget(widget)
# create pdf merger object
merger = PyPDF2.PdfMerger()
# creating a pdf file object of original pdf and add to output
pdf_org = open(pdf_filepath, 'rb')
merger.append(pdf_org)
# NOTE I want to get rid of this step
# create temporary file, read it, append it to output, delete it.
temp_filename = pdf_filepath[:-4] + "_temp.pdf"
img.save(temp_filename)
pdf_temp = open(temp_filename, 'rb')
merger.append(pdf_temp)
pdf_temp.close()
# write
outputfile = open(pdf_filepath, "wb")
merger.write(outputfile)
# clean up
pdf_org.close()
outputfile.close()
os.remove(temp_filename)
def capture_widget(widget):
"""Take screenshot of the passed widget"""
x0 = widget.winfo_rootx()
y0 = widget.winfo_rooty()
x1 = x0 + widget.winfo_width()
y1 = y0 + widget.winfo_height()
img = ImageGrab.grab((x0, y0, x1, y1))
return img
Does someone have a more elegant solution not requiring an intermediate file while retaining the flexibility PyPDF2 provides?
Thanks.
So I was able to find a solution myself. The trick is writing the PIL image to a byte array (see this question), then converting that to a pdf using img2pdf. For img2pdf, it is required to use the format = "jpeg" argument during saving to byte array.
Subsequently, the result of img2pdf can be written to another io.BytesIO() stream. Since it implements a .read() method, PyPDF2.PdfMerger() can read it.
Below is the code, hope this helps someone.
import io
import PyPDF2
import img2pdf
def process_pdf(widget, pdf_filepath):
""""Append Tkinter widget to pdf"""
# create pdf merger object
merger = PyPDF2.PdfMerger()
# creating a pdf file object of original pdf and add to output
pdf_org = open(pdf_filepath, 'rb')
merger.append(pdf_org)
pdf_org.close()
# capture the widget and rotate
img = capture_widget(widget)
# create img bytearray
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format = "jpeg", quality=100)
img_byte_arr = img_byte_arr.getvalue()
# create a pdf bytearray and do formatting using img2pdf
pdf_byte_arr = io.BytesIO()
layout_fun = img2pdf.get_layout_fun((img2pdf.mm_to_pt(210),img2pdf.mm_to_pt(297)))
pdf_byte_arr.write(img2pdf.convert(img_byte_arr, layout_fun=layout_fun))
merger.append(pdf_byte_arr)
# write
outputfile = open(pdf_filepath[:-4] + "_appended.pdf", "wb")
merger.write(outputfile)
outputfile.close()
I am using pypng and pyqrcode for QR code image generation in django app.
import os
from pyqrcode import create
import png # pypng
import base64
def embed_QR(url_input, name):
embedded_qr = create(url_input)
embedded_qr.png(name, scale=7)
def getQrWithURL(url):
name = 'url.png'
embed_QR(url, name)
with open(name, "rb") as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
return image_data
When I call getQrWithURL with a url, it produces a file url.png to my directory. Is there a way to only get the image data without producing a file output?
thanks for your help.
Use a BytesIO as a writable stream:
import io
# Make a writeable stream
buffer = io.BytesIO()
# Create QR and write to buffer
embedded_qr = create(url_input)
embedded_qr.png(buffer,scale=7)
# Extract buffer contents - this is what you would get by reading a PNG disk file but without creating it
PNG = buffer.getvalue()
Your variable PNG now contains this:
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\xe7\x00\x00\x00\xe7\x01\x00\x00\x00\x00\xcd\x8f|\x8d\x00\x00\x01CIDATx\x9c\xed\x971\x92\x830\x0cE\xc5PPr\x04\x8e\x92\xa3\xc1\xd1|\x14\x8e\x90\x92\x82\xb1V_2\t\xde\xb0;i?c\x15q\xecG\xf3\x91\xfc%D\xff\x89,\x8d6\xda\xe8\x97\xf4)\x88AW\xfb\xed\xb7Q\x93\xef\'fj\x7ft\x1f\x9e\xf2Xt\xb7\xe34\x97Cb\x9a\xa43\x8a\xe3XL\xfd-\xa8\xc8\x1c\x19\xbc\x11-\xbb\x1b\xd0\xa3&m\x11\xe8\xbd\xaeX"\x1a\xbe\x01\xa1Q\x9aW\xaeBE\x8fXe\xee\xf4\x1c\xc44\xcd\x19\n\x93dX\xfcc\xb1\xcd\xc8L\x91A\x08\xb5c\xd5\xcd\x1f\xea\xb7*\xbfl\xd4\xf4\x9ao\xb8\xdeB\xbb\xd3-c\xa4h\xbc\x9dn\xa3\xd5\xa4\t\x1dW\xdf\t5\x9dt\xb1b,N\x88\xc8}\xe5\x93t\xd4\x8aQ\xa1Pp\xbd\x06\xe43\x9f\x9d\x90\x90\xfa-\xdb\xa3\xe3b\xa2\xc0Rw+6\n\',U\x18S\xdfo\xdf\xa0\xa3\x18\xf70J\xac\xf2\xea\xc6\xf5\xdb\xa0\xa3%\xdc>"\x91R\xbd\r>z\xccHq\xcb|T\xba\x98\xfa\xa8\xe8G1J\xcfN\xfd[\xc3/[x\xbbQ\x04=\x8d\xfek\x16\xafn\xf1\xfc\x14m\x18BK\xef\x9a\xa8\xa9\xd7\xa4\'\xed\xfd\x902\xd3\xf0\r\x8c\x12z\xb4\xe1\x0fW\xa1\xa2\x7fF\xa3\x8d6\xfa\x15\xfd\x01\xb9MCH#\xc3\xa2\x96\x00\x00\x00\x00IEND\xaeB`\x82'
I tried to decode the following base64 image
data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAAAyCAYAAACqNX6+AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAASwSURBVHhe7VoxaxsxFPYvye5sJZC5JRkcSoZCoMSFDIVOGWpSQ6FDqLN5CqGBEnAIoe7sKTSbF2fqmiWbMWTLkl+gPp2k89P56U7Syc7Z1Qdf6+9OTz70qvednlv78efwNWPsuyLo9HPUi9c1/sf57SH8NUXUL6eThPAPISfliNpPpwnhCDWpQtTumntImhCOEJNiRO2mo4cAqqSjh0hURUcPQaiCjh5SMR09BFAlHT1Eoio6eghCFXT0kIrp6CGAl9DDdo3VagThfkkPeWBntRP21okX7Oja9/sE3PSEXe7U2e/19Vl2RsmIsN9n0kPWppKACaPKechwQCx4ARsj9qjiEdz1R3bfnF3oO7HGcP89u6sbEpEhjyn/PGZt3BFZlvWQ4z1iwQvY6j3JaPfvw3rSIRa7vgX7QWHEriwTwjmfpBywXoNYeBMhpoSH4HI1YH/lVf/5BAr16Ra5oAmhBE3Hj8QO2dllz/IKR3p/1NFjm/1knPPzGPU4JxltKGAC57fvpqUMtL+H4HK190XeEfCaD0HT2YUzsd5k93J7WM+P56530t1lHS9BaWOZag9nxp+01sQ9+OztIbhcnUG6XeNtNVmaKJp2ggStJ1Mfqm+mCeWwi59C08P2bCIS4p1BjPf3kEWVK3tjvulPPOZHLwawwy77rvGUNpeq7a4hXiUEPvp5yCqUqwTI+GXJcosndPfVTCIExe6g4se9hhgD2stDFlWuFKYalRhMr3IFmPTZjUqIPJNwWMdLYJ3nHQp6/Jh92JBjQHl4iM1hcMCOrecTsNJ4ARH9ytWh5k9Xp+7xGEKby1W9dSAGSqTx4x5rqHFeHuJwGNxvfZaRAlbzI2T1z2+bWiIS+pYrXA6DndjRK2yWxNsV12m54oRrjh7yxI426MWnGbJNErBc4Z2mHSYt4xE0Df/a6zgJGfKqlTseLrl5yHjEWrDQxn/5xO7hJ3Pr+SVIHapcwTy/0mSI3eUUDzBri35VHmEGDw8RMGqZNJwQjrLzhylX8vSexAY8dwCUtu5bUZzP7yG66YfpXYUoV/gVN/TOEBDacpe022w7ew2iw/8ecv0V7ZBAHlK2XGmeMZ+doaC0ZtaI6nBI3ofrbh4ikacfexfp7gjVai9VrubqGQLO2nR4hHuBPUQvV/zQqOA/f4lypTUPA75NAXy10WM23oT3kG7rWNsd3ZLzJdq3XBGdXKvvQ3DTw8QTGr2x1Pr9fLMXbRUrD+GLbGPMWqkC7xjI5yqav0h7lyvbPhgmzOvdYDR2eIu4xlQOLTzkiQ0aapGB7YfkqjYeXnP300QIfjJ1NSVM2rrVTjHbi/JMiK/H+L3u6g3HYg/JnClsqHyDnA9hRvd3ydJky5leVN4viybavigAdC3KFb3oNKneVqGHaJ5QxJKe8dxv0otkw4V4RI52KVfGrq/L/8vKaSiGaSAa3qRsGah1zuGjC8tVo8eU1efNZ+EhAsumyaQ5MPTz2OpiD0FYJk0tsgs5Qj4Ph43O9RCqPPlSwechMVZd53oItbC+xHB9yP9Jr6yHLKteWQ/hWEY9p99DpojaTed6iELUi9PRQySqoqOHIFRBRw+pmI4eAqiOZuwfl74j6FwpdBsAAAAASUVORK5CYII=
with:
import base64
x = """iVBORw0KGgoAAAANSUhEUgAAAGQAAAAyCAYAAACqNX6+AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAASwSURBVHhe7VoxaxsxFPYvye5sJZC5JRkcSoZCoMSFDIVOGWpSQ6FDqLN5CqGBEnAIoe7sKTSbF2fqmiWbMWTLkl+gPp2k89P56U7Syc7Z1Qdf6+9OTz70qvednlv78efwNWPsuyLo9HPUi9c1/sf57SH8NUXUL6eThPAPISfliNpPpwnhCDWpQtTumntImhCOEJNiRO2mo4cAqqSjh0hURUcPQaiCjh5SMR09BFAlHT1Eoio6eghCFXT0kIrp6CGAl9DDdo3VagThfkkPeWBntRP21okX7Oja9/sE3PSEXe7U2e/19Vl2RsmIsN9n0kPWppKACaPKechwQCx4ARsj9qjiEdz1R3bfnF3oO7HGcP89u6sbEpEhjyn/PGZt3BFZlvWQ4z1iwQvY6j3JaPfvw3rSIRa7vgX7QWHEriwTwjmfpBywXoNYeBMhpoSH4HI1YH/lVf/5BAr16Ra5oAmhBE3Hj8QO2dllz/IKR3p/1NFjm/1knPPzGPU4JxltKGAC57fvpqUMtL+H4HK190XeEfCaD0HT2YUzsd5k93J7WM+P56530t1lHS9BaWOZag9nxp+01sQ9+OztIbhcnUG6XeNtNVmaKJp2ggStJ1Mfqm+mCeWwi59C08P2bCIS4p1BjPf3kEWVK3tjvulPPOZHLwawwy77rvGUNpeq7a4hXiUEPvp5yCqUqwTI+GXJcosndPfVTCIExe6g4se9hhgD2stDFlWuFKYalRhMr3IFmPTZjUqIPJNwWMdLYJ3nHQp6/Jh92JBjQHl4iM1hcMCOrecTsNJ4ARH9ytWh5k9Xp+7xGEKby1W9dSAGSqTx4x5rqHFeHuJwGNxvfZaRAlbzI2T1z2+bWiIS+pYrXA6DndjRK2yWxNsV12m54oRrjh7yxI426MWnGbJNErBc4Z2mHSYt4xE0Df/a6zgJGfKqlTseLrl5yHjEWrDQxn/5xO7hJ3Pr+SVIHapcwTy/0mSI3eUUDzBri35VHmEGDw8RMGqZNJwQjrLzhylX8vSexAY8dwCUtu5bUZzP7yG66YfpXYUoV/gVN/TOEBDacpe022w7ew2iw/8ecv0V7ZBAHlK2XGmeMZ+doaC0ZtaI6nBI3ofrbh4ikacfexfp7gjVai9VrubqGQLO2nR4hHuBPUQvV/zQqOA/f4lypTUPA75NAXy10WM23oT3kG7rWNsd3ZLzJdq3XBGdXKvvQ3DTw8QTGr2x1Pr9fLMXbRUrD+GLbGPMWqkC7xjI5yqav0h7lyvbPhgmzOvdYDR2eIu4xlQOLTzkiQ0aapGB7YfkqjYeXnP300QIfjJ1NSVM2rrVTjHbi/JMiK/H+L3u6g3HYg/JnClsqHyDnA9hRvd3ydJky5leVN4viybavigAdC3KFb3oNKneVqGHaJ5QxJKe8dxv0otkw4V4RI52KVfGrq/L/8vKaSiGaSAa3qRsGah1zuGjC8tVo8eU1efNZ+EhAsumyaQ5MPTz2OpiD0FYJk0tsgs5Qj4Ph43O9RCqPPlSwechMVZd53oItbC+xHB9yP9Jr6yHLKteWQ/hWEY9p99DpojaTed6iELUi9PRQySqoqOHIFRBRw+pmI4eAqiOZuwfl74j6FwpdBsAAAAASUVORK5CYII=""")
print(base64.b64decode(x))
also with using encode("ascii") for x:
import base64
x = """iVBORw0KGgoAAAANSUhEUgAAAGQAAAAyCAYAAACqNX6+AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAASwSURBVHhe7VoxaxsxFPYvye5sJZC5JRkcSoZCoMSFDIVOGWpSQ6FDqLN5CqGBEnAIoe7sKTSbF2fqmiWbMWTLkl+gPp2k89P56U7Syc7Z1Qdf6+9OTz70qvednlv78efwNWPsuyLo9HPUi9c1/sf57SH8NUXUL6eThPAPISfliNpPpwnhCDWpQtTumntImhCOEJNiRO2mo4cAqqSjh0hURUcPQaiCjh5SMR09BFAlHT1Eoio6eghCFXT0kIrp6CGAl9DDdo3VagThfkkPeWBntRP21okX7Oja9/sE3PSEXe7U2e/19Vl2RsmIsN9n0kPWppKACaPKechwQCx4ARsj9qjiEdz1R3bfnF3oO7HGcP89u6sbEpEhjyn/PGZt3BFZlvWQ4z1iwQvY6j3JaPfvw3rSIRa7vgX7QWHEriwTwjmfpBywXoNYeBMhpoSH4HI1YH/lVf/5BAr16Ra5oAmhBE3Hj8QO2dllz/IKR3p/1NFjm/1knPPzGPU4JxltKGAC57fvpqUMtL+H4HK190XeEfCaD0HT2YUzsd5k93J7WM+P56530t1lHS9BaWOZag9nxp+01sQ9+OztIbhcnUG6XeNtNVmaKJp2ggStJ1Mfqm+mCeWwi59C08P2bCIS4p1BjPf3kEWVK3tjvulPPOZHLwawwy77rvGUNpeq7a4hXiUEPvp5yCqUqwTI+GXJcosndPfVTCIExe6g4se9hhgD2stDFlWuFKYalRhMr3IFmPTZjUqIPJNwWMdLYJ3nHQp6/Jh92JBjQHl4iM1hcMCOrecTsNJ4ARH9ytWh5k9Xp+7xGEKby1W9dSAGSqTx4x5rqHFeHuJwGNxvfZaRAlbzI2T1z2+bWiIS+pYrXA6DndjRK2yWxNsV12m54oRrjh7yxI426MWnGbJNErBc4Z2mHSYt4xE0Df/a6zgJGfKqlTseLrl5yHjEWrDQxn/5xO7hJ3Pr+SVIHapcwTy/0mSI3eUUDzBri35VHmEGDw8RMGqZNJwQjrLzhylX8vSexAY8dwCUtu5bUZzP7yG66YfpXYUoV/gVN/TOEBDacpe022w7ew2iw/8ecv0V7ZBAHlK2XGmeMZ+doaC0ZtaI6nBI3ofrbh4ikacfexfp7gjVai9VrubqGQLO2nR4hHuBPUQvV/zQqOA/f4lypTUPA75NAXy10WM23oT3kG7rWNsd3ZLzJdq3XBGdXKvvQ3DTw8QTGr2x1Pr9fLMXbRUrD+GLbGPMWqkC7xjI5yqav0h7lyvbPhgmzOvdYDR2eIu4xlQOLTzkiQ0aapGB7YfkqjYeXnP300QIfjJ1NSVM2rrVTjHbi/JMiK/H+L3u6g3HYg/JnClsqHyDnA9hRvd3ydJky5leVN4viybavigAdC3KFb3oNKneVqGHaJ5QxJKe8dxv0otkw4V4RI52KVfGrq/L/8vKaSiGaSAa3qRsGah1zuGjC8tVo8eU1efNZ+EhAsumyaQ5MPTz2OpiD0FYJk0tsgs5Qj4Ph43O9RCqPPlSwechMVZd53oItbC+xHB9yP9Jr6yHLKteWQ/hWEY9p99DpojaTed6iELUi9PRQySqoqOHIFRBRw+pmI4eAqiOZuwfl74j6FwpdBsAAAAASUVORK5CYII=""".encode("ascii")
print(base64.b64decode(x))
I saved the result as PNG image and JPG.
the issue here that the files didn't open, and I dont understand how the whole thing works. I watched some tutorials and I failed to display the image on my computer.
try to write to the image file the binary of this string like this
f = open('img.png', 'wb')
f.write(base64.b64decode(x))
f.close()
You could do as follows:
import base64
image_64_decode = base64.decodestring(x)
image_result = open('filename.png', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())