Delineating divs of large chunks of text with XPath (or other) - python

Given a page such as this, with two jobs (we'll ignore 'Open applications' for now) fully described one after the other, I'm looking for a reliable way of extracting the individual job specs. The first goal is to extract the specs, and then hopefully wrap them in some enclosing HTML tags so that they render in a browser when saved as a HTML file.
Obviously if I knew in advance that the class name for the top level div were called "jobitem", I could run a simple XPath like //div[#class='jobitem']
There will be several such sites though (with widely differing designs, but all with full job specs listed one after the other), and my program won't have the luxury of such class name knowledge in advance. One thing my program will know: the absolute and relative position of the job headings (<h2>, <h3> etc.). In other words, I'll be running a query like the following:
//*[self::h2 or self::h3 or self::h4][contains(., 'Country Manager')]
... resulting in an array of Python lxml XPath objects, from which relative XPaths can then be performed. Perhaps this knowledge is a starting point for grabbing all text in between each heading?

"... resulting in an array of Python lxml XPath objects, from which relative XPaths can then be performed. Perhaps this knowledge is a starting point for grabbing all text in between each heading?"
Sure (if I understand this correctly), at this point the task is straightforward using following-sibling axis in the relative XPath :
following-sibling::div

Related

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

Parsing and editing HTML files using Python

Issue is following:
Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content.
I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:
I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?
For now I'm dealing with that as follows:
1) Make workcopy of base file
2) Open workcopy as I/O string in Python:
content = content_file.read()
3) Run this through html.parser object:
ModifyHtmlParser.feed(content)
4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:
def handle_starttag(self, tag, attrs):
#print("Encountered a start tag:", tag)
if tag == "tr":
print("Table row start!")
offset = self.getpos()
tagText = self.get_starttag_text()
As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?
I would recommend you use the following general approach.
Load and parse the HTML into a convenient in-memory tree representation using any of the existing libraries for such tasks.
Find relevant nodes in the tree. (Most libraries from part 1 will provide some form of XPath and/or CSS selectors. Both allow you to find all nodes which satisfy a particular rule. In your case, the rule is probably "tr which ...".)
Process the found nodes individually (most libraries from part 1 will let you edit the tree in-place).
Write out either modified tree or newly generated tree.
Here is one particular example for how you could implement the above. (The exact choice of libraries is somewhat flexible. You have multiple options here.)
There's multiple options for HTML parsing and representation library. Most common recommendation I hear these days is LXML.
LXML provides both CSS selector support and XPath support.
See LXML etree documentation.

Crawling all wikipedia pages for phrases in python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).
I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:
First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder)
EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?
Your help on either of the issues is greatly appreciated!
Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.
EDIT: I see that you have all the article on you computer, so this answer might not help much.
The snapshots of the pages have pictures and tables in them. How would
I extract solely the main text of the article?
If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):
from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)
BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:
list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)
Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.
bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.
point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))
You asked:
I have downloaded all the articles onto my hard drive, but I'm not
sure how I can tell the program to iterate through each one in the
folder
Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:
for line in open("filename"):
if "search_string" in line:
print line
Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

Smart Automatic Scraping

I have this problem, I need to scrape lots of different HTML data sources, each data source contains a table with lots of rows, for example country name, phone number, price per minute.
I would like to build some semi automatic scraper which will try to ..
find automatically the right table in the HTML page,
-- probably by searching the text for some sample data and trying to find the common HTML element which contain both
extract the rows
-- by looking at above two elements and selecting the same patten
identify which column contains what
-- by using some fuzzy algorithm to best guess which column is what.
export it to some python / other list
-- cleaning everytihng.
does this look like a good design ? what tools would you choose to do it in if you program in python ?
does this look like a good design ?
No.
what tools would you choose to do it in if you program in python ?
Beautiful Soup
find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both
Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.
extract the rows -- by looking at above two elements and selecting the same patten
Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.
identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.
A person can do this trivially.
export it to some python / other list -- cleaning everytihng.
Almost a good idea.
A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.
Why include a person?
Because web pages are filled with notoriously bad errors.
After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.
It's easier to write a simple script to create a "data profile" of the page.
It's easier to write a simple script reads a configuration file and does the processing.
I cannot see better solution.
It is convenient to use XPath to find the right table.

Can I look at the actual line that was the source of an element parsed from an html document using lxml

I have been having fun manipulating html with lxml. Now I want to do some manipulation of the actual file, after finding a particular element that meets my needs I want to know if it is possible to retrieve the source of the element.
I jumped up and down in my chair after seeing sourceline as a method of my element but that did not give me what I wanted.
some_element.sourceline
Near as I can figure, sourceline can only be used when the htm source is a file of lists so you get the line number.
I better add that I generated my elements by
theTree=html.fromstring(open(myFileRef).read())
the_elements=[e for e in theTree.iter()]
To be clear, I am getting None as the value for some_element.sourceline - I tested this for all 27,000 elements in my tree
One thing I am imagining doing is using the html source in an expression to find that particular place in the document, maybe to snip something out. I can't rely on the text of an element because the text is not necessarily unique.
One solution that was posted but taken down was to use sourceline but even after reading in my file as a list I was not able to get any value other than None for sourceline. I am going to post another question to see if someone has an example using sourceline
I just tried and discarded html.tostring(myelement) as it converts at least some encodings automatically (I am probably not phrasing that correctly) Here is an example:
Snip of the html source
<b> KEY 1A. REGIONAL PRODUCTION <br> </b>
html.tostring(the_element,method='html')
Clearly I am not getting the original, unvarnished source.
'<b> KEY 1A.    REGIONAL PRODUCTION <br></b>'
I think I found the issue as I was having the same problem.
I believe the element.sourceline is lost if you do any kind of xslt transform to the document when you parse it.
When I do not transform the document I get the sourceline fine, however, when I use etree.XSLT I lose all sourceline data.

Categories