Querying XML with Azure Databricks

Querying XML with Azure Databricks - python

I currently have a console app which strips down xml files using xpath.
I wanted to recreate this in databricks in order to speed up processing time. I'm processing around 150k xml files and it takes around 4 hours with the console app
This is a segment of the console app(I have around another 30 xpath conditions though all similar to below). As you can see one XML file can contain multiple "shareclass" elements which means one row of data per shareclass.
XDocument xDoc = XDocument.Parse(xml);
IEnumerable<XElement> elList = xDoc.XPathSelectElements("/x:Feed/x:AssetOverview/x:ShareClasses/x:ShareClass", nm);
foreach (var el in elList)
{
DataRow dr = dt.NewRow();
dr[0] = el.Attribute("Id")?.Value;
dr[1] = Path.GetFileNameWithoutExtension(fileName);
dr[2] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Fund Management Company']/x:Code", nm)?.Value;
dr[3] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Promoter']/x:Code", nm)?.Value;
dr[4] = el.XPathSelectElement("x:Profile/x:Names/x:Name[#Type='Full']/x:Text[#Language='ENG']", nm)?.Value;
I have replaced this with a python notebook shown below but getting very poor performance with it taking around 8 hours to run. Double my old console app.
First step is to read the xml files into a dataframe as strings
df = spark.read.text(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/Extracted", wholetext=True)
I then use xpath and explode to do the same actions as the console app. Below is a trimmed section of the code
df2 = df.selectExpr("xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/#Id') id",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Code/text()') FundManagerCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Code/text()') PromoterCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Name/text()') FundManagerName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Name/text()') PromoterName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/Names/Name[#Type=\"Legal\"]/Text[#Language=\"ENG\"]/text()') FullName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/FeesExpenses/Fees/Report/FeeSet/Fee[#Type=\"Total Expense Ratio\"]/Percent/text()') Fees",
).selectExpr("explode_outer(arrays_zip(id,FundManagerCode,PromoterCode,FullName,FundManagerName,PromoterName,ISIN_Code,SEDOL_Code,APIR_Code,Class_A,Class_B,KD_1,KD_2,KD_3,KD_4,AG_1,KD_5,KD_6,KD_7,MIFI_1,MIFI_2,MIFI_3,MIFI_4,MIFI_5,MIFI_6,MIFI_7,MIFI_8,MIFI_9,Prev_Perf,AG_2,MexId,AG_3,AG_4,AG_5,AG_6,AG_7,MinInvestVal,MinInvestCurr,Income,Fees)) shareclass"
).select('shareClass.*')
I feel like there must be a much simpler and quicker way to process this data though I don't have enough knowledge to know what it is.
When I first started re-writing the console app I loaded the data through spark using com.databricks.spark.xml and then tried to use JSON path to do the querying however the spark implementation of json path isn't rich enough to do some of the structure querying you can see above.
Any way anyone can think to speed up my code or do it a completely different way I'm all ears. The XML structure has lots of nesting but doesn't feel like it should be too difficult to flatten these nests as columns. I want to use as much standard functionality as possible without creating my own loops through the structure as I feel like this would be even slower.

Related

How to load json files with rasdaman

im studying Array database management systems a bit, in particular Rasdaman, i understand superficially the architecture and how the system works with sets and multidimensional arrays instead of tables as it is usual in relational dbms, im trying to save my own type of data to check if this type of databases can give me better performance to my specific problem(geospatial data in a particular format: DGGS), to do so i have created my own basic type based on a structure as indicated by the documentation, created my array type, set type and finally my collection for testing, i'm trying to insert data into this collection with the following idea:
query_executor.execute_update_from_file("insert into test_json_dict values decode($1, 'json', '{\"formatParameters\": {\"domain\": \"[0:1000]\",\"basetype\": struct { char k, long v } } })'", "...path.../rasdapy-demo/dggs_sample.json")
I'm using the library rasdapy to work from python instead of using rasql only(i use it anyways to validate small things), but i have been fighting with error messages that give little to no information:
Internal error: RasnetClientComm::executeQuery(): illegal status value 5
My source file has this type of data into it:
{
"N1": 6
}
A simple dict with a key and a value, i wanna save both things, i also tried to have a bigger dict with multiples keys and values on it but as the rasdaman decode function expects a basetype definition if i understand correctly i tried to change my data source format as a simple dict. It is obvious to see that i'm not doing the appropriate definition for decoding or that my source file has the wrong format but i haven't been able to find any examples on the web, any ideas on how to proceed? maybe i am even doing this whole thing from the wrong perspective and maybe i should try to use the OGC Web Coverage Service (WCS) standard ? i don't understand this yet so i have been avoiding it, anyways any advice or direction is greatly appreciated. Thanks in advance.
Edit:
I have been trying to load CSV data with the following format:
1 930
2 461
..
and the following query
query_executor.execute_update_from_file("insert into test_json_dict values decode($1, 'csv', '{\"formatParameters\": {\"domain\": \"[1:255]\",\"basetype\": struct { char key, long value } } })'", "...path.../rasdapy-demo/dggs_sample_4.csv")
but still no results, even tho it looks quite similar to the documentation example in Look for the CSV/JSON examples but no results still. What could be the issue?

It seems that my problem was trying to use the rasdapy library, this lib works fine but when working with data formats like csv and json it is best to use the rasql command line option, it states in the documentation :
filePaths - An array of absolute paths to input files to be decoded, e.g. ["/path/to/rgb.tif"]. This improves ingestion performance if the data is on the same machine as the rasdaman server, as the network transport is bypassed and the data is read directly from disk. Supported only for GDAL, NetCDF, and GRIB data formats.
and also it says:
As a first parameter the data to be decoded must be specified. Technically this data must be in the form of a 1D char array. Usually it is specified as a query input parameter with $1, while the binary data is attached with the --file option of the rasql command-line client tool, or with the corresponding methods in the client API.
It would be interesting to note if rasdapy takes this into account. Anyhow use of rasql gives way better response errors so i recommend that to anyone having a similar problem.
An example command could be:
rasql -q 'insert into test_basic values decode($1, "csv", "{ \"formatParameters\": {\"domain\": \"[0:1,0:2]\",\"basetype\": \"long\" } }")' --out string --file "/home/rasdaman/Documents/TFM/include/DGGS-Comparison/rasdapy-demo/dggs_sample_6.csv" --user rasadmin --passwd rasadmin
using this data:
1,2,3,2,1,3
After that you just got to start making it more and more complex as you need.

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?

It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/

Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.

I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Best Practices for Text Generation in Python

I'm writing a python script that generates another python script based off an external file. A small section of my code can be seen below. I haven't been exposed to many examples of these kinds of scripts, so I was wondering what the best practices were.
As seen in the last two lines of the code example, the techniques that I'm using can be unwieldy at times.
SIG_DICT_NAME = "sig_dict"
SIG_LEN_KEYWORD = "len"
SIG_BUS_IND_KEYWORD = "ind"
SIG_EP_ADDR_KEYWORD = "ep_addr"
KEYWORD_DEC = "{} = \"{}\""
SIG_LEN_KEYWORD_DEC = KEYWORD_DEC.format(SIG_LEN_KEYWORD, SIG_LEN_KEYWORD)
SIG_BUS_IND_KEYWORD_DEC = KEYWORD_DEC.format(SIG_BUS_IND_KEYWORD,
SIG_BUS_IND_KEYWORD)
SIG_EP_ADDR_KEYWORD_DEC = KEYWORD_DEC.format(SIG_EP_ADDR_KEYWORD,
SIG_EP_ADDR_KEYWORD)
SIG_DICT_DEC = "{} = dict()"
SIG_DICT_BODY_LINE = "{}[{}.{}] = {{{}:{}, {}:{}, {}:{}}}"
#line1 = SIG_DICT_DEC.format(SIG_DICT_NAME)
#line2 = SIG_DICT_BODY.format(SIG_DICT_NAME, x, y, z...)

You don't really see examples of this kind of thing because your solution might be a wee bit over-engineered ;)
I'm guessing that you're trying to collect some "state of things", and then you want to run a script to process that "state of things". Rather than writing a meta-script, what is typically far more convenient is to write a script that will do the processing (say, process.py), and another script that will do the collecting of the "state of things" (say, collect.py).
Then you can take the results from collect.py and throw them at process.py and write out todays_results.txt or some such:
collect.py -> process.py -> 20150207_results.txt
If needed, you can write intermediate files to disk with something like:
with open('todays_progress.txt') as f_out:
for thing, state in states_of_things.iteritems():
f.write('{}<^_^>{}\n'.format(state, thing))
Then you can parse it back in later with something like:
with open('todays_progress.txt') as f_in:
lines = f_in.read().splitlines()
things, states = [x, y for x, y in lines.split('<^_^>')]
states_of_things = dict(zip(things, states))
More complicated data structures than a flat dict? Well, this is Python. There's probably more than one module for that! Off the top of my head I would suggest json if plaintext will do, or pickle if you need some more detailed structures. Two warnings with pickle: custom objects don't always get reinstantiated well, and it's vulnerable to code injection attacks, so only use it if your entire workflow is trusted.
Hope this helps!

You seem to be translating keyword-by-keyword.
It would almost certainly be better to read each "sentence" into a representative Python class; you could then run the simulation directly, or have each class write itself to an "output sentence".
Done correctly, this should be much easier to write and debug and produce more idiomatic output.

How to convert jenkins job configuration config.xml to YAML format in python to be used jenkins-job-builder?

jenkins-job-builder is a nice tool to help me to maintain jobs in YAML files. see example in configuration chapter.
Now I had lots of old jenkins jobs, it will be nice to have a python script xml2yaml to convert the existing jenkins job config.xml to YAML file format.
Do you any suggestions to had a quick solution in python ?
I don't need it to be used in jenkins-job-builder directly, just can be converted it into YAML for reference.
For the convert, some part can be ignored like namespace.
config.xml segment looks like:
<project>
<logRotator class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>20</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</logRotator>
...
</project>
The yaml output could be:
- project:
logrotate:
daysToKeep: -1
numToKeep: 20
artifactDaysToKeep: -1
artifactNumToKeep: -1
If you are not familiar with config.xml in jenkins, you can check infra_backend-merge-all-repo job in https://ci.jenkins-ci.org

I'm writing a program that does this conversion from XML to YAML. It can dynamically query a Jenkins server and translate all the jobs to YAML.
https://github.com/ktdreyer/jenkins-job-wrecker
Right now it works for very simple jobs. I've taken a safe/pessimistic approach and the program will bail if it encounters XML that it cannot yet translate.

It's hard to tell from your question exactly what you're looking for here, but assuming you're looking for the basic structure:
Python has good support on most platforms for XML Parsing. Chances are you'll want to use something simple and easy to use like minidom. See the XML Processing Modules in the python docs for your version of python.
Once you've opened the file, looking for project and then parsing down from there and using a simple mapping should work pretty well given the simplicity of the yaml format.
from xml.dom.minidom import parse
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def getTextForTag(nodelist,tag):
elements = nodelist.getElementsByTagName(tag)
if (elements.length>0):
return getText( elements[0].childNodes)
return ''
def printValueForTag(parent, indent, tag, valueName=''):
value = getTextForTag( parent,tag)
if (len(value)>0):
if (valueName==''):
valueName = tag
print indent + valueName+": "+value
def emitLogRotate(indent, rotator):
print indent+"logrotate:"
indent+=' '
printValueForTag( rotator,indent, 'daysToKeep')
printValueForTag( rotator,indent, 'numToKeep')
def emitProject(project):
print "- project:"
# all projects have log rotators, so no need to chec
emitLogRotate(" ",project.getElementsByTagName('logRotator')[0])
# next section...
dom = parse('config.xml')
emitProject(dom)
This snippet will print just a few lines of the eventual configuration file, but it puts you in the right direction for a simple translator. Based on what I've seen, there's not much room for an automatic translation scheme due to naming differences. You could streamline the code as you iterate for more options and to be table driven, but that's "just a matter of programming", this will at least get you started with the DOM parsers in python.

I suggest querying and accessing the xml with xpath expressions using xmlstarlet on the command line and in shell scripts. No trouble with low-level programmatical access to XML. XMLStarlet is an XPath swiss-army knife on the command line.
"xmlstarlet el" shows you the element structure of the entire XML as XPath expressions.
"xmlstarlet sel -t -c XPath-expression" will extract exactly what you want.
Maybe you want to spend an hour (or two) on freshing up your XPath know-how in advance.
You will shed a couple of tears, once you recognize how much time you spent with programming XML access before you used XMLStarlet.

how to insert into DB from XMl using python multithreading?

Could someone please advice me what are the possible ways with python multithreading?
I have one xml file (163 MB). My task is required to
read that xml file
insert the data into a DB ( many tables)
record the count of inserted rows in a log file
I already have python code that reads an xml file that does the above 1,2 and 3 steps. Actually, I want to speed up that process using multithread. I don't know how to start work on.
Here's XML structure.
<Content id="359366">
<Title>This title</Title>
<SortTitle>sorting</SortTitle>
<PublisherEntity id="2003">ABC Publishing Group</PublisherEntity>
<Publisher>ABC Publishing Group</Publisher>
<Imprint>Revell</Imprint>
<Language code = "en">English</Language>
<GeoRight>
<GeoCountry code = "WW" model = "Distribution">World</GeoCountry>
</GeoRight>
<Format type = "Adobe EPUB eBook">
<Identifier type = "DRMID">xxx-xxx-xx</Identifier>
<Identifier type = "ISBN">1234567</Identifier>
<SRP currency = "SGD">18.89</SRP>
<WholesaleCost currency = "SGD">11.14</WholesaleCost>
<OnSaleDate>01 Sep 2010</OnSaleDate>
<MinimumSoftwareVersion number="1.x">Adobe Digital Editions</MinimumSoftwareVersion>
<DownloadFileName>HouseonMalcolmStreet9781441213877</DownloadFileName>
<SecurityLevel value="ACS4">Adobe Content Server 4</SecurityLevel>
<ContentFileSize>473923</ContentFileSize>
<DownloadUrl>http://xxx.xx.com/</DownloadUrl>
<DownloadIDType>CRID</DownloadIDType>
<DrmInfo>
<Copy>
<Enabled>1</Enabled>
<Selections>2</Selections>
<Interval type = "Days">7</Interval>
</Copy>
<Print>
<Enabled>1</Enabled>
<Selections>20</Selections>
<Interval type = "Days">7</Interval>
</Print>
<Lend>
<Enabled>0</Enabled>
</Lend>
<ReadAloud>
<Enabled>0</Enabled>
</ReadAloud>
<Expires>
<Enabled>0</Enabled>
<Interval type = "Days">-1</Interval>
</Expires>
</DrmInfo>
</Format>
<Creator rank="1" id="923710">
<Name>name</Name>
<FileAs>Kelly, Leisha</FileAs>
<Role id="aut">Author</Role>
</Creator>
<SubTitle>A Novel</SubTitle>
<Edition></Edition>
<Series></Series>
<Coverage></Coverage>
<AgeGroup></AgeGroup>
<ContentType></ContentType>
<PublicationDate>09/01/2010</PublicationDate>
<ShortDescription>description</ShortDescription>
<FullDescription>full desc</FullDescription>
<Image type = "Cover Image">http://xxx.xx.jpg</Image>
<Image type = "Thumbnail Image">http://xxx.xx.jpg</Image>
<Subject code="FIC000000">Fiction</Subject>
<Subject code="FIC014000">Historical Fiction</Subject>
</Content>
Here's existing python code download.

I've had a look through your code. I don't think that multithreading is the answer to your problems.
Not all xml libraries are equal, lxml is a python interface to libxml2, which is written in C and the fastest I've used.
Consider, if you haven't already, which operations are comparitively expensive time-wise. File operations are expensive compared to memory access. Each call to a database is expensive. Downloading things from the internet is expensive.
I don't know what database and db interface you're using, but you should really use built-in parameterisation instead of your sanitizing functions.
I'd recommend re-structuring your code to use a batch-processing approach:
Process the entire xml file extracting the data you need into a python data structure.
Don't use separate files in the filesystem as part of your processing or caching. Try to avoid writing something to a file that you want to read later as part of the same job.
Pre-cache your table lookups e.g. create a dictionary of select name,id from table instead of 100s of calls to select id from table where name=%s.
Determine what foreign key table entries need creating in one go and create them all in one go, updating your id/name cache.
Group database updates into executeMany calls if available.
If you need to tidy rows from tables where they are no longer used as a foreign key, do it at the end, with a single SQL command.

Well, you can't split reading in XML, from what I understand, but what you can do, is maybe depending on your XML structure and DB structure parallelize inserts into the database. Unfortunately without seeing XML and DB structure, also without knowing constraints of the database (like, for instance keeping order of the xml records vs auto_increment id's) - it's very difficult to advise you on some solution that would work for you in your particular situation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Querying XML with Azure Databricks - python

Related

How to load json files with rasdaman

Efficient way to get data from lotus notes view

Best Practices for Text Generation in Python

How to convert jenkins job configuration config.xml to YAML format in python to be used jenkins-job-builder?

how to insert into DB from XMl using python multithreading?

Categories

Resources