Page number python-docx - python

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?

Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.

Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

Related

Database searches in separate files

I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!

Removing heading numbers from docx

I need to use python to pre-process docx (Word) documents, so that pandoc can properly convert them into markdown. One of the key requirements is that the styles of the docx document should be "cleaned up", in particular that the numbering of headings (Heading 1, Heading 2, etc.) should be removed.
Restrictions: I know how to do that using VBA (and likely could do it from python using PyWin32 or such). But it is a requirement that it must be implemented without Microsoft Windows and without LibreOffice/UNO.
How can I use the python-docx package to do that? I have looked at the documentation and there does not seem to be any proper to do it (actually the heading numbering style does not seem to be implemented). Did I miss something?
Unless I should use another method, such as applying a different Word template to the docx document, with the main styles correctly predefined according to my requirements? Could that be done through an available python package?
Code in VBA
This is the code in VBA that got the job done:
Sub RemoveHeaderNos()
' Remove the header nos
Debug.Print "Removing header numbers and formatting..."
For Each s In ActiveDocument.Styles
s.LinkToListTemplate ListTemplate:=Nothing
Next
End Sub
On terminology, I'm understanding you to mean "the numbering of heading paragraphs" as opposed to something like page numbers in the page headers, have I got that right? The two terms "heading" and "header" are unfortunately close and mean quite different things, in Word parlance anyway :)
I'm assuming your paragraph headings are numbered, like 'Heading 1' style causes the next sequential integer to be prefixed to the heading paragraph text, like '9. Ninth section heading', (then likewise for Heading 2 -> 9.1, 9.2, etc.
You're correct that this hasn't been implemented in python-docx yet. You would need to get as close to the XML element in question (perhaps the <w:style> element for Heading 1 for example) as possible using the python-docx API, and then use lxml calls to manipulate the XML under that.
You'd need to start with a strategy for what XML changes you need to make. opc-diag is handy for that. You can change the .docx manually (preferably a radically stripped-down, super-short document) using Word to make it look the way you want, then compare the XML before and after to discover what changes you need to make to the XML.
Then you can validate your strategy by extracting a .docx (using opc-diag), manually updating the XML with the minimum required changes, repackage it (also using opc-diag), and load it in Word to make sure it behaves as expected.
I suspect there's a way to "disconnect" the "Heading 1" style from a numbering definition in the styles.xml part that would accomplish what you're after and be a fairly straightforward handful of element changes.
Anyway, that's where I would start.
This issue was solved in version 1.17 of pandoc, released on 20 March 2016 ("Don’t turn numbered headers into lists"). If other people meet the same issue, the best thing at this stage would be to upgrade to a that version or a later one.
Nevertheless, the exploration of the various solutions with the python-docx was interesting, because it indicated a point of possible improvement.

Crawling all wikipedia pages for phrases in python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).
I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:
First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder)
EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?
Your help on either of the issues is greatly appreciated!
Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.
EDIT: I see that you have all the article on you computer, so this answer might not help much.
The snapshots of the pages have pictures and tables in them. How would
I extract solely the main text of the article?
If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):
from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)
BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:
list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)
Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.
bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.
point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))
You asked:
I have downloaded all the articles onto my hard drive, but I'm not
sure how I can tell the program to iterate through each one in the
folder
Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:
for line in open("filename"):
if "search_string" in line:
print line
Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

Supporting different revisions of XML format with Python LXML

I am writing a server side process in Python that takes XML in a directory and puts it into a database. The XML that is put in the directory is generated from forms that are filled out on remote laptops and sent via HTTP to the server. When we add fields to the form it adds tags to the XML which allows for situations where one XML file will have more or fewer tags than another. How can I make my server side script robust enough to handle these scenarios.
I would do something like mentioned here: https://stackoverflow.com/questions/9845943/how-to-convert-xml-data-in-to-sqlite-database/9879617#9879617
There is different ways you can apply the logic in the for loop depending on any patterns in the xml, but the idea is the same. This should then let you handle the query much more smoothly depending on which values exist.
Make sure you look at: http://lxml.de/tutorial.html there a lots of great tips with using lxml.
A mini example may get you started:
from xml.dom.minidom import parseString
doc = parseString('<one><two>three</two></one>')
for twoElement in doc.getElementsByTagName('two'):
print twoElement.firstChild.data
Maybe you should have a look at the minidom documentation or ask further questions here. But with that eggs.getElementsByTagName() you can find all elements below the tree eggs. Of course you can be more specific than searching in doc.

Is there a way to use urllib to open one site until a specified object in it?

I'm using urllib to open one site and get some information on it.
Is there a way to "open" this site only to the part I need and discard the rest (discard I mean don't open/load the rest)?
I'm not sure what you are trying to do. If you are simply trying to parse the site to find the useful "information", then I recommend using the library BeautifulSoup. That library makes it easy to keep certain parts of the site while discarding the rest.
If however you trying to save download bandwidth by downloading only a piece of the site, then you will need to do a lot more work. If that is the case please say so in your question and I'll update the answer.
You should be able to read(bytes) instead of read(), this will read a number of bytes instead of all of it. Then append to already downloaded bytes, and see if it contains what you're looking for. Then you should be able to stop download with .close().

Categories