How can dynamic page numbers be inserted into added slides? - python

When adding slides via a modified python-pptx, placeholders appear for the slide number on each slide. Instead of the actual slide number, however, the words "Slide Number" appears in that field.
Other answers mention using a static text box, but we need the placeholder to be used, as we transfer create slides to a variety of client masters. A standard textbox would not correctly adjust to varied layouts. We also need the box to float, so that we can place photos behind it on some pages. Finally, it must be dynamic, as we often shuffle pages afterward, so passing it the index as a static number would cause issues.
Is there a string or command that can be inserted into the placeholder that would automatically pull the slide's position in the deck? The master uses the combination "<#>", but this doesn't work when passed as a string.
Apologies for my ignorance, I've only been working with python and python-pptx for a couple weeks (huge thanks to its creators!).

An "auto-update" slide-number in PowerPoint is a field. The XML looks like this:
<a:fld id="{1F4E2DE4-8ADA-4D4E-9951-90A1D26586E7}" type="slidenum">
<a:rPr lang="en-US" smtClean="0"/>
<a:t>2</a:t>
</a:fld>
So the short answer to your question is no. There is no string you can insert in a textbox that triggers this XML to be inserted and there is no API support for adding fields yet either.
The approach would be to get a <a:p> element for a textbox and insert this XML into it using lxml calls (a process commonly referred to as a "workaround function" in other python-pptx posts).
You can get the a:p element like so:
p = textbox.text_frame.paragraphs[0]._p
Something like this might work, and at least provides a sketch of that approach:
from pptx.oxml import parse_xml
from pptx.oxml.ns import nsdecls
# ---get a textbox paragraph element---
p = your_textbox.text_frame.paragraphs[0]._p
# ---add fld element---
fld_xml = (
'<a:fld %s id="{1F4E2DE4-8ADA-4D4E-9951-90A1D26586E7}" type="slidenum">\n'
' <a:rPr lang="en-US" smtClean="0"/>\n'
' <a:t>2</a:t>\n'
'</a:fld>\n' % nsdecls("a")
)
fld = parse_xml(fld_xml)
p.append(fld)
The number enclosed in the a:t element is the cached slide number and may not be automatically updated by PowerPoint when opening the presentation. If that's the case, you'll need to get it right when inserting, but it will be automatically updated when you shuffle slides.

Related

Hyperlink is showing as a STRING. Outlook Task for creating email holders

I've tried all my luck, but I still can't figure out why my task.Body is not printing the Hyperlink that I want it to be.
I've tried changing it to HTMLBody, changing the bodyformat to 2 for it to be an HTML, Tried formatting it differently but I still get the same results.
I tried using HTMLBody but I get "property htmlbody cannot be set".
shortenedLink = f'Hyperlink' 
inviteItem.Body = shortenedLink
inviteItem.Save()
inviteItem.Display(true)
TaskItem object does not expose the HTMLBody property the way MailItem object does, even though Outlook is perfectly capable of displaying HTML in the UI.
You have a couple of options:
Set the TaksItem.RTFBody instead. It is a binary (array) property, so the best way to figure out the value to be set is to take a look at an existing task with the RTF body set and use it as a template. At run-time, you can replace a placeholder and insert your own url in the array to be assigned to the RTFBody property. Note that OOM has a bug that prevents the RTFBody property from being set using late binding (like you'd be doing in Python or VBA), only early binding via the TaskItem interface works. You can take a look at the RTF body of an existing task with OutlookSpy (I am its author) - select or open a task in Outlook, click IMessage button, select the PR_RTF_COMPRESSED property.
If using Redemption is an option (I am also its author), it exposes RDOTaskItem.HTMLBody property.
The TaskItem class doesn't provide the HTMLBody property. Instead, you need to use the RTFBody one if you want to use any formatting in the message body. The property returns or sets a Byte array that represents the body of the Microsoft Outlook item in Rich Text Format.
But there is a trick - to use RTF markup in the body of a task, simply try to use the Body property. Setting this property to a byte array containing RTF will automatically use that RTF in the body. Concretely, to get the body you want, you could use the following code:
task.Body = rb'{\rtf1{Here is the }{\field{\*\fldinst { HYPERLINK "https://www.python.org" }}{\fldrslt {link}}}{ I need}}'

How i can extract only text without tables inside a pdf file using PDFplumber?

I want to process some pdf files using a NLP module, then I want to clean those files from all existing tables.
this is the code for extracting tables using pdfplumber
import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[1]
table=page.extract_table()
but I want to inverse the operation to extract text only
disclaimer: I am the author of pText, the library used in this answer.
load the Document
you need to define a LocationFilter
A LocationFilter does pretty much what it says on the tin. It will listen to parsing events (like "render text" or "change font to") but it will only allow those to come through that fall within a given boundary.
Keep in mind the origin in PDF coordinates is at the lower left corner.
The LocationFilter in this example will therefor match only text in the lower left corner of the page.
Add a SimpleTextExtraction to the LocationFilter
The next question is "what is the LocationFilter going to pass events to?"
In this case, you can start by trying a SimpleTextExtraction.
Putting it all together:
l0 = LocationFilter(0, 0, 100, 100)
l1 = SimpleTextExtraction()
l0.add_listener(l1)
doc = PDF.loads(pdf_file_handle, [l])
After the Document has loaded, you can ask the SimpleTextExtraction for all the text on a given Page.
l1.get_text(0)
You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.
Do you really have to stick to the pdfplumber?.
If not, I can suggest a better solution, use tabula instead.
Here is an answer to a similar question you can check out: tabula

How to wrap cell text in tables via docx library or xml?

I have been using python docx library and oxml to automate some changes to my tables in my word document. Unfortunately, no matter what I do, I cannot wrap the text in the table cells.
I managed to successfully manipulate 'autofit' and 'fit-text' properties of my table, but non of them contribute to the wrapping of the text in the cells. I can see that there is a "w:noWrap" in the xml version of my word document and no matter what I do I cannot manipulate and remove it. I believe it is responsible for the word wrapping in my table.
for example in this case I am adding a table. I can fit text in cell and set autofit to 'true' but cannot for life of me wrap the text:
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
table = doc.add_table(5,5)
table.autofit = True # Does Autofit but not wrapping
tc = table.cell(0,0)._tc # As a test, fit text to cell 0,0
tcPr = tc.get_or_add_tcPr()
tcFitText = OxmlElement('w:tcFitText')
tcFitText.set(qn('w:val'),"true")
tcPr.append(tcFitText) #Does fitting but no wrapping
doc.save('demo.docx')
I would appreciate any help or hints.
The <w:noWrap> element appears to be a child of <w:tcPr>, the element that controls table cell properties.
You should be able to access it from the table cell element using XPath:
tc = table.cell(0, 0)._tc
noWraps = tc.xpath(".//w:noWrap")
The noWraps variable here will then be a list containing zero or more <w:noWrap> elements, in your case probably one.
Deleting it is probably the simplest approach, which you can accomplish like this:
if noWraps: # ---skip following code if list is empty---
noWrap = noWraps[0]
noWrap.getparent().remove(noWrap)
You can also take the approach of setting the value of the w:val attribute of the w:noWrap element, but then you have to get into specifying the Clark name of the attribute namespace, which adds some extra fuss and doesn't really produce a different outcome unless for some reason you want to keep that element around.

Scrapy + Python, Error in Finding links from a website

I am trying to find the URLs of all the events of this page:
https://www.eventshigh.com/delhi/food?src=exp
But I can see the URL only in a JSON format:
{
"#context":"http://schema.org",
"#type":"Event",
"name":"DANDIYA NIGHT 2018",
"image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
"eventStatus": "EventScheduled",
"startDate":"2018-10-14T18:30:00+05:30",
"doorTime":"2018-10-14T18:30:00+05:30",
"endDate":"2018-10-14T22:30:00+05:30",
"description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
"location":{
"#type":"Place",
"name":"K And L Community Hall (senior Citizen Complex )",
"address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"
},
Here it is:
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"
But I cannot find any other HTML/XML tag which contains the links. Also I cannot find the corresponding JSON file which contains the links. Could you please help me to scrape the links of all events of this page:
https://www.eventshigh.com/delhi/food?src=exp
Gathering information from a JavaScript-powered page, like this one, may look daunting at first; but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups.
So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe.
The tool that is invaluable to get there, is XPath. Sometimes a little additional help from our friend regex may be required.
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath:
>>> response.status
200
You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content.
After that, converting it to a Python dict for further use will be trivial.
In this case it's inside a container node <script type="application/ld+json">. Our XPath for that could look like this:
>>> response.xpath('//script[#type="application/ld+json"]')
[<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n '>]
This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json".
Apparently that is not specific enough, since we find three nodes (Selector-wrapped in our returned list).
From your analysis we know that our JSON must contain the "#type":"Event", so let our xpath do a little substring-search for that:
>>> response.xpath("""//script[#type="application/ld+json"]/self::node()[contains(text(), '"#type":"Event"')]""")
[<Selector xpath='//script[#type="application/ld+json"]/self::node()[contains(text(), \'"#type":"Event"\')]' data='<script type="application/ld+json">\n '>]
Here we added a second qualifier which says our script node must contain the given text.
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.)
Now our return list contains a single node/Selector. As we see from the data= string, if we were to extract() this, we would now
get some string like <script type="application/ld+json">[...]</script>.
Since we care about the content of the node, but not the node itself, we have one more step to go:
>>> response.xpath("""//script[#type="application/ld+json"][contains(text(), '"#type":"Event"')]/text()""")
[<Selector xpath='//script[#type="application/ld+json"][contains(text(), \'"#type":"Event"\')]/text()' data='\n [\n \n \n '>]
And this returns (a SelectorList of) our target text(). As you may see we could also do away with the self-reference.
Now, xpath() always returns a SelectorList, but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it.
We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values:
>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
'<url>',
'<url>',
'<url>']
Now you can turn them into scrapy.Request(url)s, and you'll know how to continue from there.
.
As always, crawl responsibly and keep the 'net a nice place to be. I do not endorse any unlawful behavior.
Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility.

Python docx paragraph in textbox

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.
A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html

Categories