I am planning to search the specific heading in the document, and then i have to strike out all the contents in that heading. The document has many headings, each heading may have paragraph, tables, images altogether or in any combinations.
I have installed docx, i was able to search the specific heading, strike out paragraph, tables.
Now i am not able to access the images under that Heading. To indicate that, the image is strikeout, we are trying to blur the image
Problem 1: I am able to get the Image ID (Resource ID), Image Name for all the images in the document. But i don't know how to get the resource id for the images which is under Specific Heading, and then i have to blur it.
Problem 2: I have enabled Track Changes option using VBMacro from python code. But whatever changes i did using docx (strikeout) is not highlighted for Tracking.
These are two separate questions (or three, depending on how you count). I'll address the first one here, you can post the other question as a separate new question. (Maybe: "How use python-pptx to track changes in Word document").
Regarding blurring the image, you have two challenges:
Identify images associated with a particular area in the document.
Blur the image.
There is no direct API support for either of these operations in python-docx. However, you can use python-docx to access the underlying XML and make the changes using lxml calls (which python-docx uses internally). Such efforts are commonly called "workaround functions", so if you search Google on 'python-docx OR python-pptx workaround function' you will find examples.
An in-line image is stored at the Run level. So you can iterate over all the runs in the section of interest and see if any of them have images. This analysis page from the python-docx project has some of the details you'll need: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/shapes-inline.html
Basically you'd do something like this:
for run in runs: # however you decide to get the runs
r = run._element # this is the `<w:r>` XML element for the run
pics = r.xpath('.//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic')
if not pics:
break
print(r.xml) # if you want to see the XML for this run
This will print the XML for run elements containing a picture.
Regarding the actual blurring, there are two approaches I can think of:
Replace the current picture with a "blurred" version.
Change the transparency of the image in Word to make it look much lighter. This does not remove detail from the image and the actual image is still "behind", unchanged, if for example the user wanted to right click and pick "Save image...".
The second approach is much easier. You'll have to decide whether it meets your requirements.
Once you decide which way you want to go you can search for solutions to that problem or submit a new question focused on that topic.
Related
Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).
Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.
This is the table from the PDF in question:
But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block
Any guidance you can provide on improving Camelots "vision" here would be helpful.
Apart from the block, the horizontal lines are also marked as text, which is odd.
Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.
I have been using python-pptx and if scanny sees this: Thank a lot for your work! It has been an absolute pleasure to work with this package. Every functionality I needed was some how available and if I couldn't find it, there was some answer on SO.
Can I count the number of lines my textbox uses after I have inserted my text? According to my reserach you can only count the paragraphs. Of course with some python wizardry I can calculate the amount words and characters but I guess I look for some functionality which in python-pptx might look like this len(text_frame.lines).
If that functionality is available it would be the icing of the cake! If somebody can do this functionality without using python-pttx or even python I am also interested.
Code snippet looks like this:
for slide in prs.slides:
for shape in slide.shapes:
text_frame = shape.text_frame
key=[substring for substring in keys_dict if substring in text_frame.text]
if 'My_Key' in key:
text_frame.text=item_to_replace
print(len(text_frame.paragraphs))
# searched: len(text_frame.lines)
The short answer is "No".
The reason why is that python-pptx is essentially a PowerPoint (.pptx) file editor and the number of rendered lines in a text box is not specified in the file. The .pptx file specifies the text, how it is divided into paragraphs and runs and the character formatting of each run, but the actual display characteristics, like which letter shows up in what location on the screen or printed page is all determined by the rendering engine.
In Microsoft PowerPoint, this is (part of) the PowerPoint application itself. Other clients like LibreOffice have their own rendering engines.
As you say, with some fancy programming you can approximate the values you mention, like how many lines it takes up and where the line-breaks occur, but in doing so you are implementing your own (partial) rendering engine. This of course cannot be guaranteed to always produce the same result as a particular client.
Note that not all clients will necessarily render identically. You wouldn't have to look far to find small differences in how a slide was rendered in Google Docs, for example, vs. PowerPoint.
Other related common questions like "How can I tell when my table is full so I can continue it on the next slide?" also cannot be solved without resorting to rendering.
I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.
I'm currently creating a part of a program: I'm supposed to take an entirely HTML page, and get it into a WORD file. There are title, paragraph, graphic...
But, into my HTML page, I also have table (tr, th, td, table tags). I parse it into a Word table. But, my table are too large, so I have some word that are cutting into parts. I expected something like this:
|Myentireword|
but it go out like this (because there are too many cell in one line)
|Myent|
|irewo|
|rd |
It's really really ugly, so, I didn't see other solution than transforme my table into image, but I can't find the way to do this. (use my html-parsor, creatable a table of string, display it into an image, save the image, then load it into my Word document. I can do the first and second thing, and load an image in my document, but I do not have a way for creating the image itself.
I cannot change the output, it IS required to be a Word. Our custommer want to be able to modified and save it, and many customers are not really expert. So, I'm not able to use some PIL technic. (also, it does'nt solve my problem)
I already have a working parsor, and have ouput for my html, so i can't use something like "create a table from html"
I'm not able to use css, nor pdf-creating as I said, so this way won't work too: "Quality tables in python" But the question is exactly mine in that case!
I cannot use Matplotlib to crate table myself, because I do not know how to print into it, and I need something (If possible) to "merge" cells. If it's not, I'll do without, I juste need something working, it's really necessary, please!
I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.