App Engine - file conversion

App Engine - file conversion - python

I am using from google.appengine.api import conversion
To allow me to convert an HTML email into a PDF file. The code is below and it works.
However the width of the PDF document slices off the right hand side of my document.
Amy clues how to fix this
asset = conversion.Asset("text/html", message.html, "test.html")
conversion_obj = conversion.Conversion(asset, "application/pdf")
result = conversion.convert(conversion_obj)
if result.assets:
for asset in result.assets:
message.attachments=message.attachments+[(BnPresets.email_filename[0:BnPresets.email_filename.find('.')]+".pdf",asset.data)]

Unfortunately it looks like you cannot control the width (or dimensions) of converted PDF documents. Seems like you can only do this with .png images. Some extra conversion options:
The width of output images (*.png only)
A specific page or page range to output to an image (*.png only)
The language of source text for OCR operations (*.txt, *.html, and *.pdf only)
Note that one way around this could be to convert your html page to a png image (with the correct width), and then re-convert this png image to a pdf document. I wouldn't advise you to use this method because you would end up using two API calls per conversion, which gets expensive really quickly.
A better way would be to structure the dimensions of your input html page properly such that one-off conversions to PDF documents end up being pretty much identical.

Related

How to convert a PDF to a JPG/PNG in Python with the highest possible quality?

I am tying to convert a PDF to an image so I can OCR it. But the quality is being degraded during the conversion.
There seem to be two main methods for converting a PDF to an image (JPG/PNG) with Python - pdf2image and ImageMagick/Wand.
#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
for page in pages:
page.save("page.jpg", 'JPEG')
#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
img.compression_quality = 100
img.save(filename="page.jpg")
But if I simply take a screenshot of the PDF on a Mac, the quality is higher than using either Python conversion method.
A good way to see this is to run Tesseract OCR on the resulting images - both Python methods give average results, whereas the screenshot gives perfect results. (I've tried both PNG and JPG.)
Assume I have infinite time, computing power and storage space. I am only interested in image quality and OCR output. It's frustrating to have the perfect image just within reach, but not be able to generate it in code.
What is going on here? Is there a better way to convert a PDF? Is there a way I can get more direct control? What is a screenshot doing such a better job than an actual conversion?

You can use PyMuPDF and set the dpi you want:
import fitz
doc = fitz.open('some/pdf/path')
page = doc.load_page(0)
pixmap = page.get_pixmap(dpi=300)
img = pixmap.tobytes()
# Continue with whatever logic...

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.

I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..

You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Displaying lots of png files to a webpage

Hopefully this question wont be asking for too much and can be understandable, but any help would be amazing. Currently I am doing [astronomy] research, and I am required to construct a webpage of quasar spectra to look like this...Sample of final product
This is to be done so by downloading each individual spectra from this source here...https://data.sdss.org/sas/dr13/eboss/spectro/redux/images/v5_9_0/v5_9_0/3590-55201/.
The problem is, I am struggling to find a way to download large quantities of png files all at once. For some reason, all the spectra on this link do not have their coordinates (Right ascension and declination) on the file name. Whereas the code provided to me as an example does.
In the situation that I have the png "00:14:53.206-09:12:17.70-4536-55857-0770.png" downloaded, it should be displayed. However as mentioned before, all the files I have viewed when trying to do this myself, do not list those. My page looks like direct code, no actual images. But it remains in code because it cannot pull forward those spectra since they are not downloaded, and I would prefer to have them assorted by their coordinates.
Downloading a FITS file which contains the quasar catalog was suggested to me. Presumably, the coords would in some way have to be appended to the png files downloaded. Apparently this is all supposed to be easy.
In summary: How do I download large quantities of png files, where they do not display their coordinates. I also need a method of renaming the image files to so that their file names correspond with the coordinates, and then print to a webpage.

When displaying images on a website (regardless of where you sourced the images from, or the format - jpg/png etc), it is advisable that you COMPRESS your images. This is especially valid in cases where the images are big, and where there are a number of images on the page (pages like yours!). There are a few online image compressors like tinypng (where you can upload ~30 images at at time to compress, and it compresses both jpg and pngs) or pngcrush.
Compressing images this way will reduce the file size (greatly in some cases) but the image appears the same. This will very much improve the load time on your site.
When you download a file (any file, not just an image file, you can save it as anything you want (name-wise) so you can rename the files on download. You will need to upload all the [preferably compressed] images to a web server in order to display them on a webpage. If you don't know ANY webscripting, start with learning basic html (you won't need a lot for this project), but the best way to display the images would probably be to use a loop to loop through the image folder using either javascript or php

Saving and Organizing Multiple .svg Files to Multi-Paged PDF, PNG, or Other Common Image File Types in Python

Goal:
I want to be able to organize many SVG images of Lewis/Kekulé structures into a coherent list of reactions displayed in some common viewing file, like PDF or PNG; that is, each reaction might take up one line with the reactants and products separated by an arrow or some other divider, and there would exist as many lines on as many pages of a PDF as was needed to display each reaction in a given database.
Context:
I am reading canonical SMILES representations of molecules from my local machine with pybel.readstring(), converting that string to an SVG string with molSVG = mol._repr_svg_(), then displaying the results on a jupyter notebook with display(SVG(molSVG)) while also saving the SVG string into a list and an .svg file. For example:
import pybel as pb
from IPython.display import display, SVG
molecules = []
# Read in canonical SMILES to some tractable object.
mol = pb.readstring('smi', '[O-][C]([C]1[CH][CH][CH][O]1)[O]')
# Convert it to an SVG string
molSVG = mol._repr_svg_()
# Display the SVG.
display(SVG(molSVG))
# Save the SVG to an .svg file and a list
with open('out.svg', 'w+') as result:
result.write(molSVG)
molecules.append(result)
What I would like to do instead of displaying these is to save them to some singular image file.
Attempted Solutions:
I've run through quite a few over the last six or seven weeks, so I've forgotten some, but I tried pybel's Outfile class, but I don't think that produced any images whatsoever as it was probably made to solely store OBAtom and OBMol class objects. Matplotlib's PdfPages from matplotlib.backends.backend_pdf seemed like a first step just to getting more than one SVG into a single file, but I think that failed because it requires plots as input, not SVG files. I've also seen modules like svgwrite and the like but haven't been able to get them to work.
As an example, a page of the finished product might have a similar structure to this albeit without some of the flourishes like the lines in between and labels on the arrows.
Also, just a note: I'm extremely inexperienced with all forms of coding and especially with Python, so please keep it as simple as possible for me. Any suggestions are greatly appreciated!

Is there a way to extract text information from a postscript file? (.ps .eps)

I want to extract the text information contained in a postscript image file (the captions to my axis labels).
These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?
Thanks

It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.
Edit:
If you have enough plots to be worth
the effort than it's a very simple
image processing task. Convert each
page to something like tiff, in mono
chrome Threshold the image to binary,
the text will be max pixel value.
Use a template matching technique.
If you have a limited set of
possible labels then just match the
entire label, you can even start
with a template of the correct size
and rotation. Then just flag each
plot as containing label[1-n], no
need to read the actual text.
If you
don't know the label then you can
still do OCR fairly easily, just
extract the region around the axis,
rotate it for the vertical - and use
Google's free OCR lib
If you have pgplot you can even
build the training set for OCR or
the template images directly rather
than having to harvest them from the
image list

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.