Is it possible to generate vector based pdf using wordcloud - python

I am using wordcloud in python to generate word clouds.
I was able to reproduce this example on my machine, and then tried to change the last line plt.show() to plt.savefig('image.pdf') to have a pdf output.
I had a pdf with the same result, however, the pdf seems like pixel-based instead of vector-based. When I focus a particular point in the pdf it becomes a very low-quality picture.
Is there any way to produce vector-based pdf using wordcloud? If not, is there any other library that I can produce vector-based (pdf) wordclouds in Python?

If wordcloud can generate any sort of vector output such as ps or svg, inkscape can usually convert it to a PDF without rasterizing it. You can even do this headless, e.g. inkscape my.svg -A my.pdf.
Hmm, looking at wordcloud, it looks like it uses PIL. I don't think that PIL can produce vector images. But if you could use the logic in wordcloud and separate it from PIL, you can get vector fonts onto PDFs by drawing onto a reportlab canvas.

You can save the images in a vector format so that they will be scalable without quality loss. Such formats are PDF and EPS. Just change the extension to .pdf or .eps and matplotlib will write the correct image format.
plt.savefig('destination_path.eps', format='eps')
plt.savefig('destination_path.pdf', format='pdf')
I have found that eps/pdf files work best.

Related

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.
I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..
You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

matplotlib: saved imshow pdf looks different from the plot window

The following figure was plotted using imshow in matplotlib with option interpolation='none':
However, after I saved it as a pdf file, the saved pdf file looks quite different:
The problem is: the blue patterns become very blurry.
My question is: How can I save a pdf figure that looks exactly like the plot window?
I solved this problem by specifying the dpi in the savefig for filetype pdf. Even though i read online that dpi is not supposed to make a difference in the vector based pdf format in theory, it did solve the problem for me in practice.
plt.imshow(np.random.random((10,10)))
plt.savefig("test.pdf", dpi=300)
PDF format is a vector image format. This means it is upto the program you open it in to interpret how it should be drawn. This can have some benefits when you want to be able to arbitrarily zoom in and out of an image while keeping high quality. However some programs can modify the image through anti-aliasing.
Your best bet for consistency is to use a pixel based image format. I would suggest try saving it as a .png.

How do you improve matplotlib image quality?

I am using a python program to produce some data, plotting the data using matplotlib.pyplot and then displaying the figure in a latex file.
I am currently saving the figure as a .png file but the image quality isn't great. I've tried changing the DPI in matplotlib.pyplot.figure(dpi=200) etc but this seems to make little difference. I've also tried using differnet image formats but they all look a little faded and not very sharp.
Has anyone else had this problem?
Any help would be much appreciated
You can save the images in a vector format so that they will be scalable without quality loss. Such formats are PDF and EPS. Just change the extension to .pdf or .eps and matplotlib will write the correct image format. Remember LaTeX likes EPS and PDFLaTeX likes PDF images. Although most modern LaTeX executables are PDFLaTeX in disguise and convert EPS files on the fly (same effect as if you included the epstopdf package in your preamble, which may not perform as well as you'd like).
Alternatively, increase the DPI, a lot. These are the numbers you should keep in mind:
300dpi: plain paper prints
600dpi: professional paper prints. Most commercial office printers reach this in their output.
1200dpi: professional poster/brochure grade quality.
I use these to adapt the quality of PNG figures in conjunction with figure's figsize option, which allows for correctly scaled text and graphics as you improve the quality through dpi.

How to save figures to pdf as raster images in matplotlib

I have some complex graphs made using matplotlib. Saving them to a pdf using the savefig command uses a vector format, and the pdf takes ages to open. Is there any way to save the figure to pdf as a raster image to get around this problem?
You can force individual figure elements to be rasterized like this:
text(1,1,'foobar',rasterized=True)
Not that I know, but you can use the 'convert' program (ImageMagick') to convert a jpg to a pdf: `convert file.jpg file.pdf'.

Is there a way to extract text information from a postscript file? (.ps .eps)

I want to extract the text information contained in a postscript image file (the captions to my axis labels).
These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?
Thanks
It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.
Edit:
If you have enough plots to be worth
the effort than it's a very simple
image processing task. Convert each
page to something like tiff, in mono
chrome Threshold the image to binary,
the text will be max pixel value.
Use a template matching technique.
If you have a limited set of
possible labels then just match the
entire label, you can even start
with a template of the correct size
and rotation. Then just flag each
plot as containing label[1-n], no
need to read the actual text.
If you
don't know the label then you can
still do OCR fairly easily, just
extract the region around the axis,
rotate it for the vertical - and use
Google's free OCR lib
If you have pgplot you can even
build the training set for OCR or
the template images directly rather
than having to harvest them from the
image list

Categories