Count number of lines of a powerpoint textbox - python

I have been using python-pptx and if scanny sees this: Thank a lot for your work! It has been an absolute pleasure to work with this package. Every functionality I needed was some how available and if I couldn't find it, there was some answer on SO.
Can I count the number of lines my textbox uses after I have inserted my text? According to my reserach you can only count the paragraphs. Of course with some python wizardry I can calculate the amount words and characters but I guess I look for some functionality which in python-pptx might look like this len(text_frame.lines).
If that functionality is available it would be the icing of the cake! If somebody can do this functionality without using python-pttx or even python I am also interested.
Code snippet looks like this:
for slide in prs.slides:
for shape in slide.shapes:
text_frame = shape.text_frame
key=[substring for substring in keys_dict if substring in text_frame.text]
if 'My_Key' in key:
text_frame.text=item_to_replace
print(len(text_frame.paragraphs))
# searched: len(text_frame.lines)

The short answer is "No".
The reason why is that python-pptx is essentially a PowerPoint (.pptx) file editor and the number of rendered lines in a text box is not specified in the file. The .pptx file specifies the text, how it is divided into paragraphs and runs and the character formatting of each run, but the actual display characteristics, like which letter shows up in what location on the screen or printed page is all determined by the rendering engine.
In Microsoft PowerPoint, this is (part of) the PowerPoint application itself. Other clients like LibreOffice have their own rendering engines.
As you say, with some fancy programming you can approximate the values you mention, like how many lines it takes up and where the line-breaks occur, but in doing so you are implementing your own (partial) rendering engine. This of course cannot be guaranteed to always produce the same result as a particular client.
Note that not all clients will necessarily render identically. You wouldn't have to look far to find small differences in how a slide was rendered in Google Docs, for example, vs. PowerPoint.
Other related common questions like "How can I tell when my table is full so I can continue it on the next slide?" also cannot be solved without resorting to rendering.

Related

How to parse this kind of PDF with python

I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.
However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:
It looks like text, but when you copy and paste it, it becomes:
%A>&1;<81
FB9#4AH4EL
%BJ8XF8#C?BL874CCEBK<#4G8?L
9H??G<#84FFB6<4G8F4A7
C4EGG<#84FFB6<4G8F
CE<#4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<#8
F84FBA4?
4A79H??G<#8CBF<G<BAFGB9H?9<??G;8F84FBA4?78#4A7B9BHE,CE<A:F84FBA
<A6E84F8778#4A77HE<A:G;8(/"C4A78#<6
4F6HFGB#8EF9B6HF87BA;B#8<#CEBI8#8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB#B7<9LBHEFGBE8?4LBHG
What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?
Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527
The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read
A and replace with n
B and replace with o
C and replace with p
etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.
I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.
In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long)
that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.
I'm new to this library and i don't know what to do, to fix this problem!!
PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.
You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.
There are three different things in a PDF that can look like letters to the human eye.
Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.
In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).
With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

Bluring the Image which is under specific Heading using Python docx

I am planning to search the specific heading in the document, and then i have to strike out all the contents in that heading. The document has many headings, each heading may have paragraph, tables, images altogether or in any combinations.
I have installed docx, i was able to search the specific heading, strike out paragraph, tables.
Now i am not able to access the images under that Heading. To indicate that, the image is strikeout, we are trying to blur the image
Problem 1: I am able to get the Image ID (Resource ID), Image Name for all the images in the document. But i don't know how to get the resource id for the images which is under Specific Heading, and then i have to blur it.
Problem 2: I have enabled Track Changes option using VBMacro from python code. But whatever changes i did using docx (strikeout) is not highlighted for Tracking.
These are two separate questions (or three, depending on how you count). I'll address the first one here, you can post the other question as a separate new question. (Maybe: "How use python-pptx to track changes in Word document").
Regarding blurring the image, you have two challenges:
Identify images associated with a particular area in the document.
Blur the image.
There is no direct API support for either of these operations in python-docx. However, you can use python-docx to access the underlying XML and make the changes using lxml calls (which python-docx uses internally). Such efforts are commonly called "workaround functions", so if you search Google on 'python-docx OR python-pptx workaround function' you will find examples.
An in-line image is stored at the Run level. So you can iterate over all the runs in the section of interest and see if any of them have images. This analysis page from the python-docx project has some of the details you'll need: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/shapes-inline.html
Basically you'd do something like this:
for run in runs: # however you decide to get the runs
r = run._element # this is the `<w:r>` XML element for the run
pics = r.xpath('.//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic')
if not pics:
break
print(r.xml) # if you want to see the XML for this run
This will print the XML for run elements containing a picture.
Regarding the actual blurring, there are two approaches I can think of:
Replace the current picture with a "blurred" version.
Change the transparency of the image in Word to make it look much lighter. This does not remove detail from the image and the actual image is still "behind", unchanged, if for example the user wanted to right click and pick "Save image...".
The second approach is much easier. You'll have to decide whether it meets your requirements.
Once you decide which way you want to go you can search for solutions to that problem or submit a new question focused on that topic.

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

Modify and create PDF using Python

I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.

Categories