Parsing and editing HTML files using Python - python

Issue is following:
Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content.
I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:
I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?
For now I'm dealing with that as follows:
1) Make workcopy of base file
2) Open workcopy as I/O string in Python:
content = content_file.read()
3) Run this through html.parser object:
ModifyHtmlParser.feed(content)
4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:
def handle_starttag(self, tag, attrs):
#print("Encountered a start tag:", tag)
if tag == "tr":
print("Table row start!")
offset = self.getpos()
tagText = self.get_starttag_text()
As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?

I would recommend you use the following general approach.
Load and parse the HTML into a convenient in-memory tree representation using any of the existing libraries for such tasks.
Find relevant nodes in the tree. (Most libraries from part 1 will provide some form of XPath and/or CSS selectors. Both allow you to find all nodes which satisfy a particular rule. In your case, the rule is probably "tr which ...".)
Process the found nodes individually (most libraries from part 1 will let you edit the tree in-place).
Write out either modified tree or newly generated tree.
Here is one particular example for how you could implement the above. (The exact choice of libraries is somewhat flexible. You have multiple options here.)
There's multiple options for HTML parsing and representation library. Most common recommendation I hear these days is LXML.
LXML provides both CSS selector support and XPath support.
See LXML etree documentation.

Related

HTML parser in Python without using external libraries

I've been assigned a problem at university.
The problem is as follows:
I have to make a program, in a language of my choice(I decide on Python since I'm most comfortable in it), that takes an HTML file name as input, parses it(if such exists), and allows CRUD operations on it. The catch is that I can't use external libraries or packages, only ones that I've written.
How should I approach the parser?
I'd have to have a structure in which I can save the parsed data. I was thinking about a list of tuples, each tuple consisting of the opening tag, the content, and the closing tag(if there is one), or discarding the closing tags altogether and focusing on the opening ones and the content. About the parser, I was thinking to read the file's content into a string variable and try to separate the different tags and their content. I don't know, if either of those is a good idea, or if there is a better way that I haven't considered.
I'd appreciate any help. Thank you in advance!

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!
The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.
There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

Delineating divs of large chunks of text with XPath (or other)

Given a page such as this, with two jobs (we'll ignore 'Open applications' for now) fully described one after the other, I'm looking for a reliable way of extracting the individual job specs. The first goal is to extract the specs, and then hopefully wrap them in some enclosing HTML tags so that they render in a browser when saved as a HTML file.
Obviously if I knew in advance that the class name for the top level div were called "jobitem", I could run a simple XPath like //div[#class='jobitem']
There will be several such sites though (with widely differing designs, but all with full job specs listed one after the other), and my program won't have the luxury of such class name knowledge in advance. One thing my program will know: the absolute and relative position of the job headings (<h2>, <h3> etc.). In other words, I'll be running a query like the following:
//*[self::h2 or self::h3 or self::h4][contains(., 'Country Manager')]
... resulting in an array of Python lxml XPath objects, from which relative XPaths can then be performed. Perhaps this knowledge is a starting point for grabbing all text in between each heading?
"... resulting in an array of Python lxml XPath objects, from which relative XPaths can then be performed. Perhaps this knowledge is a starting point for grabbing all text in between each heading?"
Sure (if I understand this correctly), at this point the task is straightforward using following-sibling axis in the relative XPath :
following-sibling::div

Supporting different revisions of XML format with Python LXML

I am writing a server side process in Python that takes XML in a directory and puts it into a database. The XML that is put in the directory is generated from forms that are filled out on remote laptops and sent via HTTP to the server. When we add fields to the form it adds tags to the XML which allows for situations where one XML file will have more or fewer tags than another. How can I make my server side script robust enough to handle these scenarios.
I would do something like mentioned here: https://stackoverflow.com/questions/9845943/how-to-convert-xml-data-in-to-sqlite-database/9879617#9879617
There is different ways you can apply the logic in the for loop depending on any patterns in the xml, but the idea is the same. This should then let you handle the query much more smoothly depending on which values exist.
Make sure you look at: http://lxml.de/tutorial.html there a lots of great tips with using lxml.
A mini example may get you started:
from xml.dom.minidom import parseString
doc = parseString('<one><two>three</two></one>')
for twoElement in doc.getElementsByTagName('two'):
print twoElement.firstChild.data
Maybe you should have a look at the minidom documentation or ask further questions here. But with that eggs.getElementsByTagName() you can find all elements below the tree eggs. Of course you can be more specific than searching in doc.

edit in place using xpath

Is it possible to do in place edit of XML document using xpath ?
I'd prefer any python solution but Java would be fine too.
XPath is not intended to edit document in place, as far as I know. It is intended to only select nodes of the document. XSLT relies on XPath and can transform documents.
Regarding Python, see answer to this question: how to use xpath in python. It mentions also libraries which can do XSLT transformations.
Using XML to store data is probably not optimal, as you experience here. Editing XML is extremely costly.
One way of doing the editing is parsing the xml into a tree, and then inserting stuff into that three, and then rebuilding the xml file.
Editing an xml file in place is also possible, but then you need some kind of search mechanism that finds the location you need to edit or insert into, and then write to the file from that point. Remember to also read the remaining data, because it will be overwritten. This is fine for inserting new tags or data, but editing existing data makes it even more complicated.
My own rule is to not use XML for storage, but to present data. So the storage facility, or some kind of middle man, needs to form xml files from the data it has.

Categories