HTML parser in Python without using external libraries - python

I've been assigned a problem at university.
The problem is as follows:
I have to make a program, in a language of my choice(I decide on Python since I'm most comfortable in it), that takes an HTML file name as input, parses it(if such exists), and allows CRUD operations on it. The catch is that I can't use external libraries or packages, only ones that I've written.
How should I approach the parser?
I'd have to have a structure in which I can save the parsed data. I was thinking about a list of tuples, each tuple consisting of the opening tag, the content, and the closing tag(if there is one), or discarding the closing tags altogether and focusing on the opening ones and the content. About the parser, I was thinking to read the file's content into a string variable and try to separate the different tags and their content. I don't know, if either of those is a good idea, or if there is a better way that I haven't considered.
I'd appreciate any help. Thank you in advance!

Related

How can I see what's wrong with an XML without looking at the data in that XML

I've got some badly-formed XML files using Python, and I need to figure out what's wrong with them (ie. what the errors are) without actually looking at the data (the files are sensitive client data).
I figure there should be a way to sanitize the XML (ie. remove all content in all nodes) but keep the tags, so that I can see any structural issues.
However, ElementTree doesn't return any detailed information about mismatched tags - just a line number and a character position which is useless if I can't reference the original XML.
Does anyone know how I can either sanitize the XML so I can view it, or get more detailed error messages for badly formed XML (that won't return tag contents)? I could write a customer parser to strip content, but I wanted to exhaust other options first.
It's a hard enough problem to try to automatically fix markup problems when you can look at the file. If you're not permitted to see the document contents, forget about having any reasonable hope of fixing such doubly undefined problems.
Your best bet is to fix the bad "XML" at its source.
If you can't do that, I suggest that you use a tool listed in How to parse invalid (bad / not well-formed) XML? to attempt to automatically repair the well-formedness problem. Then, after you actually have XML, you can use XML tools to strip or sanitize content (if that's even still necessary at that point).

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!
The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.
There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

Python - editing local HTML files - Should I edit all of the content as a one string or as an array line by line?

Just to be clear this is not a scraping question.
I'm trying to automate some editing of similar HTML files. This involves removing content between tags.
When editing HTML files locally, is it easier to open() the file then dump the content line by line into a string so it's easier to apply a regular expression?
Thanks
For structured markup like HTML, it is better to use a parser like BeautifulSoup than regular expressions. A few reasons for this include better results for malformed HTML and decreased complexity (you don't need to reinvent the wheel).
Considering the question at face value though, it seems easier to split the HTML up into lines using readlines so that you are dealing with only one line at a time when applying regular expressions.
I suggest that instead of creating your own templating language (which is what this task amounts to), you use one of the many which already exist, and use that to perform the necessary operations. Try Jinja2, Django Templates, or Cheetah to see what you fancy. There are also many others.

Need suggestions for designing a content organizer

I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.

What is the most efficient way of extracting information from a large number of xml files in python?

I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields.
I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.
Is there a more efficient way? (simple text matching doesn't work)
Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
Any caveats?
Thanks!
Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.
The important thing about iterparse is that you get the element (sub-)structures as they are parsed.
import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()
event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.
Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.
If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.
The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.
But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.
EDIT:
Are the files on a local drive or network drive? Network I/O will kill you here.
The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).
If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.
Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.
Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.
One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.

Categories