Using DTDs to Parse XML - python

I'm attempting to parse the USPTO data that is hosted Here. I have also retrieved the DTDs associated with the files. My question is: is it possible to use these to parse the files, or are they only used for validation? I have already used one as a guideline for parsing some of the documents, but doing it the way I am would require having a separate parser for each DTD. Here is an example snippet of how I'm currently doing it.
# <!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
def parseDocumentId(ref):
data = {}
data["Country"] = ref.find("country").text
data["ID"] = ref.find("doc-number").text
if ref.find("date") != None:
d= ref.find("date").text
try:
date = datetime.strptime(d, "%Y%m%d").date()
except:
date= None
data["Date"]= date
if ref.find("kind") != None:
data["Kind"]= ref.find("kind").text
if ref.find("name") != None:
data["Name"]= ref.find("name").text
return data
This way just seems very manual to me, so I'm curious if there is a better way to help automate the process
Note: I'm using lxml for parsing.

DTDs will just help you to follow specifications. You can create a dictionary for tokenize the document and then parse it. Anyway, I believe that using lxml is the better way.

The usual approach to processing XML is to use an off-the-shelf XML parser for your programming language, and from its API construct whatever data structures you want to have. When many XML documents using the same XML vocabulary must be processed, it may make sense to generate a parser for that class of XML documents using a tool, or even to construct a parser by hand. But most programs use generic XML parsers instead of custom-constructed parsers.
To store XML documents in a database, however, it may not be necessary to employ an XML parser at all (except perhaps in checking beforehand that the documents are all in fact well-formed): all XML databases and many SQL databases have the ability to read and ingest XML documents.

Related

Querying XML with Azure Databricks

I currently have a console app which strips down xml files using xpath.
I wanted to recreate this in databricks in order to speed up processing time. I'm processing around 150k xml files and it takes around 4 hours with the console app
This is a segment of the console app(I have around another 30 xpath conditions though all similar to below). As you can see one XML file can contain multiple "shareclass" elements which means one row of data per shareclass.
XDocument xDoc = XDocument.Parse(xml);
IEnumerable<XElement> elList = xDoc.XPathSelectElements("/x:Feed/x:AssetOverview/x:ShareClasses/x:ShareClass", nm);
foreach (var el in elList)
{
DataRow dr = dt.NewRow();
dr[0] = el.Attribute("Id")?.Value;
dr[1] = Path.GetFileNameWithoutExtension(fileName);
dr[2] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Fund Management Company']/x:Code", nm)?.Value;
dr[3] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Promoter']/x:Code", nm)?.Value;
dr[4] = el.XPathSelectElement("x:Profile/x:Names/x:Name[#Type='Full']/x:Text[#Language='ENG']", nm)?.Value;
I have replaced this with a python notebook shown below but getting very poor performance with it taking around 8 hours to run. Double my old console app.
First step is to read the xml files into a dataframe as strings
df = spark.read.text(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/Extracted", wholetext=True)
I then use xpath and explode to do the same actions as the console app. Below is a trimmed section of the code
df2 = df.selectExpr("xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/#Id') id",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Code/text()') FundManagerCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Code/text()') PromoterCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Name/text()') FundManagerName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Name/text()') PromoterName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/Names/Name[#Type=\"Legal\"]/Text[#Language=\"ENG\"]/text()') FullName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/FeesExpenses/Fees/Report/FeeSet/Fee[#Type=\"Total Expense Ratio\"]/Percent/text()') Fees",
).selectExpr("explode_outer(arrays_zip(id,FundManagerCode,PromoterCode,FullName,FundManagerName,PromoterName,ISIN_Code,SEDOL_Code,APIR_Code,Class_A,Class_B,KD_1,KD_2,KD_3,KD_4,AG_1,KD_5,KD_6,KD_7,MIFI_1,MIFI_2,MIFI_3,MIFI_4,MIFI_5,MIFI_6,MIFI_7,MIFI_8,MIFI_9,Prev_Perf,AG_2,MexId,AG_3,AG_4,AG_5,AG_6,AG_7,MinInvestVal,MinInvestCurr,Income,Fees)) shareclass"
).select('shareClass.*')
I feel like there must be a much simpler and quicker way to process this data though I don't have enough knowledge to know what it is.
When I first started re-writing the console app I loaded the data through spark using com.databricks.spark.xml and then tried to use JSON path to do the querying however the spark implementation of json path isn't rich enough to do some of the structure querying you can see above.
Any way anyone can think to speed up my code or do it a completely different way I'm all ears. The XML structure has lots of nesting but doesn't feel like it should be too difficult to flatten these nests as columns. I want to use as much standard functionality as possible without creating my own loops through the structure as I feel like this would be even slower.

How to convert jenkins job configuration config.xml to YAML format in python to be used jenkins-job-builder?

jenkins-job-builder is a nice tool to help me to maintain jobs in YAML files. see example in configuration chapter.
Now I had lots of old jenkins jobs, it will be nice to have a python script xml2yaml to convert the existing jenkins job config.xml to YAML file format.
Do you any suggestions to had a quick solution in python ?
I don't need it to be used in jenkins-job-builder directly, just can be converted it into YAML for reference.
For the convert, some part can be ignored like namespace.
config.xml segment looks like:
<project>
<logRotator class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>20</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</logRotator>
...
</project>
The yaml output could be:
- project:
logrotate:
daysToKeep: -1
numToKeep: 20
artifactDaysToKeep: -1
artifactNumToKeep: -1
If you are not familiar with config.xml in jenkins, you can check infra_backend-merge-all-repo job in https://ci.jenkins-ci.org
I'm writing a program that does this conversion from XML to YAML. It can dynamically query a Jenkins server and translate all the jobs to YAML.
https://github.com/ktdreyer/jenkins-job-wrecker
Right now it works for very simple jobs. I've taken a safe/pessimistic approach and the program will bail if it encounters XML that it cannot yet translate.
It's hard to tell from your question exactly what you're looking for here, but assuming you're looking for the basic structure:
Python has good support on most platforms for XML Parsing. Chances are you'll want to use something simple and easy to use like minidom. See the XML Processing Modules in the python docs for your version of python.
Once you've opened the file, looking for project and then parsing down from there and using a simple mapping should work pretty well given the simplicity of the yaml format.
from xml.dom.minidom import parse
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def getTextForTag(nodelist,tag):
elements = nodelist.getElementsByTagName(tag)
if (elements.length>0):
return getText( elements[0].childNodes)
return ''
def printValueForTag(parent, indent, tag, valueName=''):
value = getTextForTag( parent,tag)
if (len(value)>0):
if (valueName==''):
valueName = tag
print indent + valueName+": "+value
def emitLogRotate(indent, rotator):
print indent+"logrotate:"
indent+=' '
printValueForTag( rotator,indent, 'daysToKeep')
printValueForTag( rotator,indent, 'numToKeep')
def emitProject(project):
print "- project:"
# all projects have log rotators, so no need to chec
emitLogRotate(" ",project.getElementsByTagName('logRotator')[0])
# next section...
dom = parse('config.xml')
emitProject(dom)
This snippet will print just a few lines of the eventual configuration file, but it puts you in the right direction for a simple translator. Based on what I've seen, there's not much room for an automatic translation scheme due to naming differences. You could streamline the code as you iterate for more options and to be table driven, but that's "just a matter of programming", this will at least get you started with the DOM parsers in python.
I suggest querying and accessing the xml with xpath expressions using xmlstarlet on the command line and in shell scripts. No trouble with low-level programmatical access to XML. XMLStarlet is an XPath swiss-army knife on the command line.
"xmlstarlet el" shows you the element structure of the entire XML as XPath expressions.
"xmlstarlet sel -t -c XPath-expression" will extract exactly what you want.
Maybe you want to spend an hour (or two) on freshing up your XPath know-how in advance.
You will shed a couple of tears, once you recognize how much time you spent with programming XML access before you used XMLStarlet.

how to insert into DB from XMl using python multithreading?

Could someone please advice me what are the possible ways with python multithreading?
I have one xml file (163 MB). My task is required to
read that xml file
insert the data into a DB ( many tables)
record the count of inserted rows in a log file
I already have python code that reads an xml file that does the above 1,2 and 3 steps. Actually, I want to speed up that process using multithread. I don't know how to start work on.
Here's XML structure.
<Content id="359366">
<Title>This title</Title>
<SortTitle>sorting</SortTitle>
<PublisherEntity id="2003">ABC Publishing Group</PublisherEntity>
<Publisher>ABC Publishing Group</Publisher>
<Imprint>Revell</Imprint>
<Language code = "en">English</Language>
<GeoRight>
<GeoCountry code = "WW" model = "Distribution">World</GeoCountry>
</GeoRight>
<Format type = "Adobe EPUB eBook">
<Identifier type = "DRMID">xxx-xxx-xx</Identifier>
<Identifier type = "ISBN">1234567</Identifier>
<SRP currency = "SGD">18.89</SRP>
<WholesaleCost currency = "SGD">11.14</WholesaleCost>
<OnSaleDate>01 Sep 2010</OnSaleDate>
<MinimumSoftwareVersion number="1.x">Adobe Digital Editions</MinimumSoftwareVersion>
<DownloadFileName>HouseonMalcolmStreet9781441213877</DownloadFileName>
<SecurityLevel value="ACS4">Adobe Content Server 4</SecurityLevel>
<ContentFileSize>473923</ContentFileSize>
<DownloadUrl>http://xxx.xx.com/</DownloadUrl>
<DownloadIDType>CRID</DownloadIDType>
<DrmInfo>
<Copy>
<Enabled>1</Enabled>
<Selections>2</Selections>
<Interval type = "Days">7</Interval>
</Copy>
<Print>
<Enabled>1</Enabled>
<Selections>20</Selections>
<Interval type = "Days">7</Interval>
</Print>
<Lend>
<Enabled>0</Enabled>
</Lend>
<ReadAloud>
<Enabled>0</Enabled>
</ReadAloud>
<Expires>
<Enabled>0</Enabled>
<Interval type = "Days">-1</Interval>
</Expires>
</DrmInfo>
</Format>
<Creator rank="1" id="923710">
<Name>name</Name>
<FileAs>Kelly, Leisha</FileAs>
<Role id="aut">Author</Role>
</Creator>
<SubTitle>A Novel</SubTitle>
<Edition></Edition>
<Series></Series>
<Coverage></Coverage>
<AgeGroup></AgeGroup>
<ContentType></ContentType>
<PublicationDate>09/01/2010</PublicationDate>
<ShortDescription>description</ShortDescription>
<FullDescription>full desc</FullDescription>
<Image type = "Cover Image">http://xxx.xx.jpg</Image>
<Image type = "Thumbnail Image">http://xxx.xx.jpg</Image>
<Subject code="FIC000000">Fiction</Subject>
<Subject code="FIC014000">Historical Fiction</Subject>
</Content>
Here's existing python code download.
I've had a look through your code. I don't think that multithreading is the answer to your problems.
Not all xml libraries are equal, lxml is a python interface to libxml2, which is written in C and the fastest I've used.
Consider, if you haven't already, which operations are comparitively expensive time-wise. File operations are expensive compared to memory access. Each call to a database is expensive. Downloading things from the internet is expensive.
I don't know what database and db interface you're using, but you should really use built-in parameterisation instead of your sanitizing functions.
I'd recommend re-structuring your code to use a batch-processing approach:
Process the entire xml file extracting the data you need into a python data structure.
Don't use separate files in the filesystem as part of your processing or caching. Try to avoid writing something to a file that you want to read later as part of the same job.
Pre-cache your table lookups e.g. create a dictionary of select name,id from table instead of 100s of calls to select id from table where name=%s.
Determine what foreign key table entries need creating in one go and create them all in one go, updating your id/name cache.
Group database updates into executeMany calls if available.
If you need to tidy rows from tables where they are no longer used as a foreign key, do it at the end, with a single SQL command.
Well, you can't split reading in XML, from what I understand, but what you can do, is maybe depending on your XML structure and DB structure parallelize inserts into the database. Unfortunately without seeing XML and DB structure, also without knowing constraints of the database (like, for instance keeping order of the xml records vs auto_increment id's) - it's very difficult to advise you on some solution that would work for you in your particular situation.

Parsing a text file with a special markup

I need to parse a DSL file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.
It looks like:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
Now I see the only option to parse this file using regexps. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.
I can't use special xml and html parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html and xml.
What is the best way to parse a file in such a format? Is there any Python library for that purpose?
"some engine which allows to create a tree basing on nesting tag structure".
Look at http://www.dabeaz.com/ply/
You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.
If you don't like that one, here's a list of alternatives.
http://wiki.python.org/moin/LanguageParsing
Using RegExp for this for something other than trivial use will give heartache and pain.
If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML
If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.

How can I anonymise XML data for selected tags?

My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
You have to do something like the following in Python.
import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
print alphaTag, alphaTag.text
alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 'md5_base64';
my #tags_to_anonymize= qw( name surname address email phone);
# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } #tags_to_anonymize;
XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
->parsefile( "my_big_file.xml")
->flush;
Bottom line: don't parse XML using regex.
Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).
As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.
Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.
Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).

Categories