Supporting different revisions of XML format with Python LXML

Supporting different revisions of XML format with Python LXML - python

I am writing a server side process in Python that takes XML in a directory and puts it into a database. The XML that is put in the directory is generated from forms that are filled out on remote laptops and sent via HTTP to the server. When we add fields to the form it adds tags to the XML which allows for situations where one XML file will have more or fewer tags than another. How can I make my server side script robust enough to handle these scenarios.

I would do something like mentioned here: https://stackoverflow.com/questions/9845943/how-to-convert-xml-data-in-to-sqlite-database/9879617#9879617
There is different ways you can apply the logic in the for loop depending on any patterns in the xml, but the idea is the same. This should then let you handle the query much more smoothly depending on which values exist.
Make sure you look at: http://lxml.de/tutorial.html there a lots of great tips with using lxml.

A mini example may get you started:
from xml.dom.minidom import parseString
doc = parseString('<one><two>three</two></one>')
for twoElement in doc.getElementsByTagName('two'):
print twoElement.firstChild.data
Maybe you should have a look at the minidom documentation or ask further questions here. But with that eggs.getElementsByTagName() you can find all elements below the tree eggs. Of course you can be more specific than searching in doc.

Related

Is pandas a local-only library

I recently started coding, but took a brief stint. I started a new job and I’m under some confidential restrictions. I need to make sure python and pandas are secure before I do this—I’ll also be talking with IT on Monday
I was wondering if pandas in python was a local library, or does the data get sent to or from elsewhere? If I write something in pandas—will the data be stored somewhere under pandas?
The best example of what I’m doing is best found on a medium article about stripping data from tables that don’t have csv Exports.
https://medium.com/#ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58

Creating a DataFrame out of a dict, doing vectorized operations on its rows, printing out slices of it, etc. are all completely local. I'm not sure why this matters. Is your IT department going to say, "Well, this looks fishy—but some random guy on the internet says it's safe, so forget our policies, we'll allow it"? But, for what it's worth, you have this random guy on the internet saying it's safe.
However, Pandas can be used to make network requests. Some of the IO functions can take a URL instead of a filename or file object. Some of them can also use another library that does so—e.g., if you have lxml installed, read_html, will pass the filename to lxml to open, and if that filename is an HTTP URL, lxml will go fetch it.
This is rarely a concern, but if you want to get paranoid, you could imagine ways in which it might be.
For example, let's say your program is parsing user-supplied CSV files and doing some data processing on them. That's safe; there's no network access at all.
Now you add a way for the user to specify CSV files by URL, and you pass them into read_csv and go fetch them. Still safe; there is network access, but it's transparent to the end user and obviously needed for the user's task; if this weren't appropriate, your company wouldn't have asked you to add this feature.
Now you add a way for CSV files to reference other CSV files: if column 1 is #path/to/other/file, you recursively read and parse path/to/other/file and embed it in place of the current row. Now, what happens if I can give one of your users a CSV file where, buried at line 69105, there's #http://example.com/evilendpoint?track=me (an endpoint which does something evil, but then returns something that looks like a perfectly valid thing to insert at line 69105 of that CSV)? Now you may be facilitating my hacking of your employees, without even realizing it.
Of course this is a more limited version of exactly the same functionality that's in every web browser with HTML pages. But maybe your IT department has gotten paranoid and clamped down security on browsers and written an application-level sniffer to detect suspicious followup requests from HTML, and haven't thought to do the same thing for references in CSV files.
I don't think that's a problem a sane IT department should worry about. If your company doesn't trust you to think about these issues, they shouldn't hire you and assign you to write software that involves scraping the web. But then not every IT department is sane about what they do and don't get paranoid about. ("Sure, we can forward this under-1024 port to your laptop for you… but you'd better not install a newer version of Firefox than 16.0…")

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!

The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.

There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

How to extract financial statements only from XBRL files using Arelle's Python API?

Somehow, with the broken documentation on Arelle's python API as of date, I managed to make the API work and successfully load an XBRL file.
Anyways, my question is:
How do I extract only the STATEMENTS from the XBRL file?
Below is a screenshot from Arelle's Windows App.
URL used in this example: https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml
I tried experimenting with the API and here's my code
from arelle import Cntlr
xbrl = Cntlr.Cntlr().modelManager.load('https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml')
for fact in xbrl.facts:
print(fact)
but after executing this snippet, I'm bombarded with these:
I tried getting the keys available per modelFact and its a mixture between contextRef, id, decimals and unitRef which is not helpful from what I want to extract. With no documentation to help further with this, I'm at a loss here. Can someone enlighten me on how to achieve extracting only the statements?

I am doing something similar and have so far had some progress which I can share:
Going through the python code files of arelle you can detect which properties you can access for the different classes such as ModelFact, ModelContext, ModelUnit etc.
To extract the individual data, you can for example put them in a panda dataframe as follows:
factData=pd.DataFrame(data=[(fact.concept.qname,
fact.value,
fact.isNumeric,
fact.contextID,
fact.context.isStartEndPeriod,
fact.context.isInstantPeriod,
fact.context.isForeverPeriod,
fact.context.startDatetime,
fact.context.endDatetime,
fact.unitID) for fact in xbrl.facts])
Now it is easier to work with all the data, filter those that you want to use etc. If you want to reproduce the statements tables, you will also need to incorporate the links for each of the facts and than order and sort, but I haven't gotten this far either.

CLI git log statistics

I'm being faced with the task of generating statistics about the history of a Git project, and I need to produce some specific numbers and representations for various metrics - things like commits per author, commits-over-time/date histograms, that sort of thing.
The trouble is that I need all this data generated in a format that can be dealt with via a script or similar - the output has to be text, and if I can get the numbers into a Python (or similar) script, so much the better.
My question is this: are there any existing frameworks or projects that will provide such an interface? I've seen GitStats, and it does a lot of what I want, but then it dumps the results into a HTML structure instead of just providing textual or programmatic representations back to me. Are there (for example) Python bindings for a Git log parser, or even a Git statistics generator that returns a big text dump of data?
I realize it's a very specific need, and I'm willing to do some serious coding to get the precise format I want, but I'd like to think there's a starting point out there somewhere. Ideas?

How about using XML logs instead, and then you can parse the xml in python relativily easily and build your stats
see this answer for how to get an xml log from git

edit in place using xpath

Is it possible to do in place edit of XML document using xpath ?
I'd prefer any python solution but Java would be fine too.

XPath is not intended to edit document in place, as far as I know. It is intended to only select nodes of the document. XSLT relies on XPath and can transform documents.
Regarding Python, see answer to this question: how to use xpath in python. It mentions also libraries which can do XSLT transformations.

Using XML to store data is probably not optimal, as you experience here. Editing XML is extremely costly.
One way of doing the editing is parsing the xml into a tree, and then inserting stuff into that three, and then rebuilding the xml file.
Editing an xml file in place is also possible, but then you need some kind of search mechanism that finds the location you need to edit or insert into, and then write to the file from that point. Remember to also read the remaining data, because it will be overwritten. This is fine for inserting new tags or data, but editing existing data makes it even more complicated.
My own rule is to not use XML for storage, but to present data. So the storage facility, or some kind of middle man, needs to form xml files from the data it has.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.