For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:
exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
texto_converted = f.read()
But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.
The result is something like this:
59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$#"12!/ 21#$#A$3A$>#>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$#;A533> "/\ko/f\#e#e#p
I Even trying using zlib + regex:
import re
import zlib
pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in re.findall(stream,pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s).decode('UTF-8'))
print("")
except:
pass
The result was something like this:
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
I even tried pdftopng (xpdf) to try tesseract after, without success
So, Is there any way to extract pure text from a PDF like that using Python or a third party app?
If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file
qpdf --decrypt --stream-data=uncompress document.pdf out.pdf
doesn't help either.
I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert
to create PNG files in a temporary directory and tesseract, you can do:
import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess
DPI=600
def call(*args):
cmd = [str(x) for x in args]
return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')
def ocr(docpath, lang):
result = []
abs_path = Path(docpath).expanduser().resolve()
old_dir = os.getcwd()
out = Path('out.txt')
with TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
call('convert', '-density', DPI, abs_path, 'out.png')
index = -1
while True:
# names have no leading zeros on the digits, would be difficult to sort glob() output
# so just count them
index += 1
png = Path(f'out-{index}.png')
if not png.exists():
break
call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
result.append(out.read_text())
os.chdir(old_dir)
return result
pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))
which gives:
DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO
Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre
If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.
The final print is limited as the full output is quite large.
This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.
There are two fairly simple techniques you can use.
1) Google's "Tessaract" open source OCR (optical character recognition). You could apply this evenly to all PDFs, though converting all that data into pixels and then working magic upon them is going to be more computationally expensive. Which is more important, engineer time or CPU time? There's a pytesseract module. Note that this tool works on image formats, so you'd have to use something like GhostScript (another open source project) to convert all of a PDF's pages to images, then run [py]tessaract on those images.
2) pyPDF can get each page and programmatically extract any text draw operations in the order they were drawn onto the page. This may be nothing like the logical reading order of the page... While a PDF could draw all the 'a's and then all the 'b's (and so forth), it's actually more efficient to draw everything in "font a" , then everything in "font b". It's important to note that "font b" might just be the italic version of "font a". This produces a shorter/more efficient stream of drawing commands, though probably not by such an amount as to be a good business decision to do so.
The kicker here is that a random pile of PDF files might require you to do some OCR. A poorly assembled PDF (one with a font subset that has no "to unicode" data) can't be properly mined for text even though it has nothing but text drawing operations. "Draw glyphs one through five from "font C" doesn't mean much if you don't know that those first five glyphs are "g-l-y-p-h", because that's the order they were used in.
On the other hand, if you've got home-grown PDFs or all your pdfs are from some known source (Word's pdf converter for example), you'll know what to expect in advance.
Note that the only thing mentioned above that I've actually used is Ghostscript. I remember it having a solid command line interface we used to generate images for some online PDF viewer Many Years Ago.
Related
I'm trying to convert an image to ZPl and then print the label to a 6.5*4cm label on a TLP 2844 zebra printer on Python.
My main problems are:
1.Converting the image
2.Printing from python to the zebra queue (I've honestly tried all the obvious printing packages like zebra0.5/ win32 print/ ZPL...)
Any help would be appreciated.
I had the same issue some weeks ago. I made a python script specifically for this printer, with some fields available. I commented (#) what does not involve your need, but left it in as you may find it helpful.
I also recommend that you set your printer to the EPL2 driver, and 5cm/s print speed. With this script you'll get the PNG previews with an EAN13 formatted barcode. (If you need other formats, you might need to hit the ZPL module docs.)
Please bear in mind that if you print with ZLP 2844, you will either need to use their paid software, or you will need to manually configure the whole printer.
import os
import zpl
#import pandas
#df= pandas.read_excel("Datos.xlsx")
#a=pandas.Series(df.GTIN.values,index=df.FINAL).to_dict()
for elem in a:
l = zpl.Label(15,25.5)
height = 0
l.origin(3,1)
l.write_text("CUIT: 30-11111111-7", char_height=2, char_width=2, line_width=40)
l.endorigin()
l.origin(2,5)
l.write_text("Art:", char_height=2, char_width=2, line_width=40)
l.endorigin()
l.origin(5.5,4)
l.write_text(elem, char_height=3, char_width=2.5, line_width=40)
l.endorigin()
l.origin(2, 7)
l.write_barcode(height=40, barcode_type='2', check_digit='N')
l.write_text(a[elem])
l.endorigin()
height += 8
l.origin(8.5, 13)
l.write_text('WILL S.A.', char_height=2, char_width=2, line_width=40)
l.endorigin()
print(l.dumpZPL())
lista.append(l.dumpZPL())
l.preview()
To print the previews without having to watch and confirm each preview, I ended up modifying the ZPL preview method, to return an IO variable so I can save it to a file.
fake_file = io.BytesIO(l.preview())
img = Image.open(fake_file)
img = img.save('tags/'+'name'+'.png')
On the Label.py from ZPL module (preview method):
#Image.open(io.BytesIO(res)).show(). <---- comment out the show, but add the return of the BytesIO
return res
I had similar issues and created a .net core application which takes an image and converts it to ZPL, either to a file or to the console so it's pipeable in bash scripts. You could package it with your python app call it as a subprocess like so:
output = subprocess.Popen(["zplify", "path/to/file.png"], stdout=subprocess.PIPE).communicate()[0]
Or feel free to use my code as a reference point and implement it in python.
Once you have a zpl file or stream you can send it directly to a printer using lpr if you're on linux. If on windows you can connect to a printer using it's IP address as shown in this stack overflow question
For what is worth and for anyone else reference, was facing a similar situation and came up with a solution. To whom it may help:
Converting the image?
After trying many libraries i came across ZPLGRF although it seems the demo is focused on PDF only, i found in the source that there is a from_image() class property that could convert from image to zpl combining it part of the demo/exaples. Full code description below
Printing from python to the zebra queue?
Many libraries again but i settled with ZEBRA seem to be the most straight forward one to send raw zpl to a zebra printer
CODE
from zplgrf import GRF
from zebra import Zebra
#Open the image file and generate ZPL from it
with open(path_to_your_image, 'rb') as img:
grf = GRF.from_image(img.read(), 'LABEL')
grf.optimise_barcodes()
zpl_code = grf.to_zpl
#Setup and print to Zebra Printer
z = Zebra()
#This will return a list of all the printers on a given machine as a list
#['printer1', 'printer2', ...]
z.getqueues()
#If or once u know the printer queue name then u can set it up with
z.setqueue('printer1')
#And now is ready to send the raw ZPL text
z.output(zpl_code )
The above i have tested successfully with a Zebra GX430t printer connected via USB in a Windows 11 machine.
Hope it helps
I have been working on setting up a PDF conversion-to-png and cropping script with Python 3.6.3 and the wand library.
I tried Pillow, but it's lacking the conversion part. I am experimenting with extracting the alpha channel because I want to feed the images to an OCR, at a later point, so I turned to trying the code provided in this SO answer.
A couple of issues came out: the first is that if the file is large, I get a "Killed" message from the terminal. The second is that it seems rather picky with the file, i.e. files that get converted properly by imagemagick's convert or pdftoppm in the command line, raise errors with wand.
I am mostly concerned with the first one though, and would really appreciate a check from more knowledgeable coders. I suspect it might come from the way the loop is structured:
from wand.image import Image
from wand.color import Color
def convert_pdf(filename, path, resolution=300):
all_pages = Image(filename=path+filename, resolution=resolution)
for i, page in enumerate(all_pages.sequence):
with Image(page) as img:
img.format = 'png'
img.background_color = Color('white')
img.alpha_channel = 'remove'
image_filename = '{}.png'.format(i)
img.save(filename=path+image_filename)
I noted that the script outputs all files at the end of the process, rather than one by one, which I am guessing it might put unnecessary burden on memory, and ultimately cause a SEGFAULT or something similar.
Thanks for checking out my question, and for any hints.
Yes, your line:
all_pages = Image(filename=path+filename, resolution=resolution)
Will start a GhostScript process to render the entire PDF to a huge temporary PNM file in /tmp. Wand will then load that massive file into memory and hand out pages from it as you loop.
The C API to MagickCore lets you specify which page to load, so you could perhaps render a page at a time, but I don't know how to get the Python wand interface to do that.
You could try pyvips. It renders PDFs incrementally by making direct calls to libpoppler, so there are no processes being started and stopped and no temporary files.
Example:
#!/usr/bin/python3
import sys
import pyvips
def convert_pdf(filename, resolution=300):
# n is number of pages to load, -1 means load all pages
all_pages = pyvips.Image.new_from_file(filename, dpi=resolution, n=-1, \
access="sequential")
# That'll be RGBA ... flatten out the alpha
all_pages = all_pages.flatten(background=255)
# the PDF is loaded as a very tall, thin image, with the pages joined
# top-to-bottom ... we loop down the image cutting out each page
n_pages = all_pages.get("n-pages")
page_width = all_pages.width
page_height = all_pages.height / n_pages
for i in range(0, n_pages):
page = all_pages.crop(0, i * page_height, page_width, page_height)
print("writing {}.tif ..".format(i))
page.write_to_file("{}.tif".format(i))
convert_pdf(sys.argv[1])
On this 2015 laptop with this huge PDF, I see:
$ /usr/bin/time -f %M:%e ../pages.py ~/pics/Audi_US\ R8_2017-2.pdf
writing 0.tif ..
writing 1.tif ..
....
writing 20.tif ..
720788:35.95
So 35s to render the entire document at 300dpi, and a peak memory use of 720MB.
I need to generate a customized PDF copy of a template document.
The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.
I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?
I looked at PyPDF2 and ReportLab but neither seem to be able to do so.
Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.
There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.
If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.
From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.
There is no definite solution but I found 2 solutions that works most of the time.
In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:
# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
# First convert all dash-like characters to dashes.
(
re.compile(u"Tom Xavier"),
lambda m : "XXXXXXX"
),
# Then do an actual SSL regex.
# See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
(
re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
lambda m : "XXX-XX-XXXX"
),
]
# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)
Full Example can be found here
In ruby https://github.com/gettalong/hexapdf works for black out text.
Example code:
require 'hexapdf'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page, to_hide_arr)
super()
#canvas = page.canvas(type: :overlay)
#to_hide_arr = to_hide_arr
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
if #to_hide_arr.include? boxes.string
#canvas.stroke_color(0, 0 , 0)
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
#canvas.rectangle(x, y, tx - x, ty - y).fill
end
end
end
alias :show_text_with_positioning :show_text
end
file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")
doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
processor = ShowTextProcessor.new(page, strings_to_black)
page.process_contents(processor)
end
new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)
puts "Writing updated file [#{new_file_name}]."
In this you can black out text on select text will be visible.
As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.
First thing first, install the Aspose.PDF Cloud SDK for Python
pip install asposepdfcloud
Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://aspose.cloud
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#Replace Text
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true')
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2])
response = pdf_api.post_document_text_replace(remote_name, text_replace_list)
print(response)
I'm developer evangelist at aspose.
I'm using Adobe Acrobat Pro to extract information from PDFs in XML format. Acrobat does this particularly well. I want to extract information from about a thousand documents and do stuff with that information, so using Acrobat by hand would be annoying. Are there plugins to call Acrobat functions (i.e. save as XML) from any common language, ideally Python?
If you're on Windows, you can talk to Acrobat using DDE commands. The pyWin32 module supports DDE calls, or you could try your luck with this stand-alone binding.
But you'll have to figure out the request to send to Acrobat. (here's some random documentation, but it doesn't mention XML). It seems that the commands change from version to version, (or at least some things break), so keep an eye on the version. Good luck.
Maybe you could take a look at pypdf? It allows python reference to Adobe PDF's. Also PDFminer allows pdf xml extracting. I know perl can do it because I have previously used it myself, here is the reference to the module CAM::PDF
Example:
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))
# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))
# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))
# add page 4 from input1, but first add a watermark from another pdf:
page4 = input1.getPage(3)
watermark = PdfFileReader(file("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))
# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
page5.mediaBox.getUpperRight_x() / 2,
page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)
# print how many pages input1 has:
print "document1.pdf has %s pages." % input1.getNumPages()
# finally, write "output" to document-output.pdf
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
Also take a look at this question: python and pyPdf - how to extract text from the pages so that there are spaces between lines. Describes XML parsing and such in PDF's.
I am generating SVG image in python (pure, no external libs yet). I want to know what will be a text element size, before I place it properly. Any good idea how to make it? I checked pysvg library but I saw nothing like getTextSize()
This can be be pretty complicated. To start with, you'll have to familiarize yourself with chapter on text of the SVG specification. Assuming you want to get the width of plain text elements, and not textpath elements, at a minimum you'd have to:
Parse the font selection properties, spacing properties and read the xml:space attibute, as well as the writing-mode property (can also be top-bottom instead of just left-to-right and right-to-left).
Based on the above, open the correct font, and read the glyph data and extract the widths and heights of the glyphs in your text string. Alone finding the font can be a big task, seeing the multiple places where font files can hide.
(optionally) Look through the string for possible ligatures (depending on the language), and replace them with the correct glyph if it exists in the font.
Add the widths for all the characters and spaces, the latter depending on the spacing properties and (optionally) possible kerning pairs.
A possible solution would be to use the pango library. You can find python bindings for it in py-gtk. Unfortunately, except from some examples, the documentation for the python bindings is pretty scarce. But it would take care of the details of font loading and determining the extents of a Layout.
Another way is to study the SVG renderer in your browser. But e.g. the support for SVG text in Firefox is limited.
Also instructive is to study how TeX does it, especially the concept (pdf) of boxes (for letters) and glue (for spacing).
I had this exact same problem, but I had a variable width font. I solved it by taking the text element (correct font and content) I wanted, wrote it to a svg file, and I used Inkscape installed on my PC to render the drawing to a temporary png file. I then read back the dimensions of the png file (extracted from the header), removed the temp svg and png files and used the result to place the text where I wanted and elements around it.
I found that rendering to a drawing, using a DPI of 90 seemed to give me the exact numbers I needed, or the native numbers used in svgwrite as a whole. -D is the flag to use so that only the drawable element, i.e. the text, is rendered.
os.cmd(/cygdrive/c/Program\ Files\ \(x86\)/Inkscape/inkscape.exe -f work_temp.svg -e work_temp.png -d 90 -D)
I used these functions to extract the png numbers, found at this link, note mine is corrected slightly for python3 (still working in python2)
def is_png(data):
return (data[:8] == b'\x89PNG\r\n\x1a\n'and (data[12:16] == b'IHDR'))
def get_image_info(data):
if is_png(data):
w, h = struct.unpack('>LL', data[16:24])
width = int(w)
height = int(h)
else:
raise Exception('not a png image')
return width, height
if __name__ == '__main__':
with open('foo.png', 'rb') as f:
data = f.read()
print is_png(data)
print get_image_info(data)
It's clunky, but it worked