SVG tspan find the pixel width and height - python

I have some text paragraphs in Unicode.
What I want to do is to convert these paragraphs into images.
The process flow I chose was to convert the text into SVG first and then to png.
I am using python for this and do not use any third-party library at the moment as it's just simple text processing. (tspan element for each line)
The issue I face is some sentences are lengthy and go out of the view box.
Is there a way to figure out a consistent font size that would work for all paragraphs/images.?
Or in other words, is there a way to figure out the height and width in pixels of a given string?
Thanks
This question is related but uses javascript instead.

Related

Parse PDF shape data in python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.
When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.
So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.
Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.
What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.
The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.
To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.
I didn't end up finding a solution with PyPDF2, I was able to find an easy way to iterate over shape data in pdfminer.six, but then couldn't find a nice way in pdfminer to extract annotation data.
As such I am using one library to get the annotations, one to look at the shape data, and last of all a third library to add the hidden text on the new pdf. It runs pretty slowly as sheet complexity increases but is giving me good enough results, see image below where the rough green borders as found in the annotations are shrunk to the blue borders surrounding the text. Of course I don't draw the boundaries, and use invisible text for the actual program output, giving pretty good selectable/searchable text.
If anyone is interested in looping over the shape data in PDFs the below snippet should get you started.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTLine, LTCurve
for page_layout in extract_pages("TestSchem.pdf"):
for element in page_layout:
if isinstance(element, LTCurve) or isinstance(element, LTLine):
print(element.bbox)

Calculate rendered text size in PIL

How can I calculate the text size rendered by text method of an ImageDraw object in PIL?
I would like to draw multiline text with a maximum box width or height for example, being able to calculate the width from the text would help me to decide where to cut the text before writing it on the image.
You can do it using the ImageFont.getsize() method. As far as I know it will handle text with embedded newlines. See example of using it in this answer of mine to another question.

Adding justified text to image in Python

I already can add text to an image using Pillow in Python. However, I want to know how I can add formatted text. In particular, I want to add a box of text to an image such that the text is center justified.
If this isn't possible using Pillow, I am open to other image manipulation libraries (including in other languages) that make overlaying formatted text on images easier.
refer to the function in this link - http://pillow.readthedocs.io/en/3.1.x/reference/ImageDraw.html#PIL.ImageDraw.PIL.ImageDraw.Draw.text
the first argument is location. you can give it based on the size of your image on which you want to add text.
Here is a simple library which does the job of text alignment using PIL:
https://gist.github.com/turicas/1455973

Can PIL be used to get dimensions of an svg file?

I have svg files which I would like to compare based on their dimensions.
I read about PIL as the best image tool in python. Does PIL handle svg files? I can't seem to find this anywhere.
When googling I saw people interpreting svg files as text which seems counterintuitive.
What if not PIL is be the best way to get the x & y dimensions of a .svg file?
Thanks
PIL handles many image types, but not (yet?) SVG. Partly, this is because SVG is a set of instructions to produce an image, not a container for raw image data.
Fortunately, SVG can be read as XML, using the tool of your choice; for example, xml.etree.ElementTree in the Python standard library.
Unfortunately, by its nature, SVG doesn't have a single native size. Instead, it has two size concepts: the view box, and the height and width attributes.
If your svg file has width and height attributes, you can safely use those as the x and y dimensions, respectively. Otherwise, if it has a viewBox attribute, it is meant to scale to any size you need it to; however, you can use its third and fourth numbers as width and height, if you need to.
Worse, SVG files could lack either one. In that case, one could potentially compute a height and width based on the elements in the file, but that's trickier than anyone really wants to do, given the full capabilities of the format.

Is there a way to extract text information from a postscript file? (.ps .eps)

I want to extract the text information contained in a postscript image file (the captions to my axis labels).
These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?
Thanks
It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.
Edit:
If you have enough plots to be worth
the effort than it's a very simple
image processing task. Convert each
page to something like tiff, in mono
chrome Threshold the image to binary,
the text will be max pixel value.
Use a template matching technique.
If you have a limited set of
possible labels then just match the
entire label, you can even start
with a template of the correct size
and rotation. Then just flag each
plot as containing label[1-n], no
need to read the actual text.
If you
don't know the label then you can
still do OCR fairly easily, just
extract the region around the axis,
rotate it for the vertical - and use
Google's free OCR lib
If you have pgplot you can even
build the training set for OCR or
the template images directly rather
than having to harvest them from the
image list

Categories