Could someone please advice me what are the possible ways with python multithreading?
I have one xml file (163 MB). My task is required to
read that xml file
insert the data into a DB ( many tables)
record the count of inserted rows in a log file
I already have python code that reads an xml file that does the above 1,2 and 3 steps. Actually, I want to speed up that process using multithread. I don't know how to start work on.
Here's XML structure.
<Content id="359366">
<Title>This title</Title>
<SortTitle>sorting</SortTitle>
<PublisherEntity id="2003">ABC Publishing Group</PublisherEntity>
<Publisher>ABC Publishing Group</Publisher>
<Imprint>Revell</Imprint>
<Language code = "en">English</Language>
<GeoRight>
<GeoCountry code = "WW" model = "Distribution">World</GeoCountry>
</GeoRight>
<Format type = "Adobe EPUB eBook">
<Identifier type = "DRMID">xxx-xxx-xx</Identifier>
<Identifier type = "ISBN">1234567</Identifier>
<SRP currency = "SGD">18.89</SRP>
<WholesaleCost currency = "SGD">11.14</WholesaleCost>
<OnSaleDate>01 Sep 2010</OnSaleDate>
<MinimumSoftwareVersion number="1.x">Adobe Digital Editions</MinimumSoftwareVersion>
<DownloadFileName>HouseonMalcolmStreet9781441213877</DownloadFileName>
<SecurityLevel value="ACS4">Adobe Content Server 4</SecurityLevel>
<ContentFileSize>473923</ContentFileSize>
<DownloadUrl>http://xxx.xx.com/</DownloadUrl>
<DownloadIDType>CRID</DownloadIDType>
<DrmInfo>
<Copy>
<Enabled>1</Enabled>
<Selections>2</Selections>
<Interval type = "Days">7</Interval>
</Copy>
<Print>
<Enabled>1</Enabled>
<Selections>20</Selections>
<Interval type = "Days">7</Interval>
</Print>
<Lend>
<Enabled>0</Enabled>
</Lend>
<ReadAloud>
<Enabled>0</Enabled>
</ReadAloud>
<Expires>
<Enabled>0</Enabled>
<Interval type = "Days">-1</Interval>
</Expires>
</DrmInfo>
</Format>
<Creator rank="1" id="923710">
<Name>name</Name>
<FileAs>Kelly, Leisha</FileAs>
<Role id="aut">Author</Role>
</Creator>
<SubTitle>A Novel</SubTitle>
<Edition></Edition>
<Series></Series>
<Coverage></Coverage>
<AgeGroup></AgeGroup>
<ContentType></ContentType>
<PublicationDate>09/01/2010</PublicationDate>
<ShortDescription>description</ShortDescription>
<FullDescription>full desc</FullDescription>
<Image type = "Cover Image">http://xxx.xx.jpg</Image>
<Image type = "Thumbnail Image">http://xxx.xx.jpg</Image>
<Subject code="FIC000000">Fiction</Subject>
<Subject code="FIC014000">Historical Fiction</Subject>
</Content>
Here's existing python code download.
I've had a look through your code. I don't think that multithreading is the answer to your problems.
Not all xml libraries are equal, lxml is a python interface to libxml2, which is written in C and the fastest I've used.
Consider, if you haven't already, which operations are comparitively expensive time-wise. File operations are expensive compared to memory access. Each call to a database is expensive. Downloading things from the internet is expensive.
I don't know what database and db interface you're using, but you should really use built-in parameterisation instead of your sanitizing functions.
I'd recommend re-structuring your code to use a batch-processing approach:
Process the entire xml file extracting the data you need into a python data structure.
Don't use separate files in the filesystem as part of your processing or caching. Try to avoid writing something to a file that you want to read later as part of the same job.
Pre-cache your table lookups e.g. create a dictionary of select name,id from table instead of 100s of calls to select id from table where name=%s.
Determine what foreign key table entries need creating in one go and create them all in one go, updating your id/name cache.
Group database updates into executeMany calls if available.
If you need to tidy rows from tables where they are no longer used as a foreign key, do it at the end, with a single SQL command.
Well, you can't split reading in XML, from what I understand, but what you can do, is maybe depending on your XML structure and DB structure parallelize inserts into the database. Unfortunately without seeing XML and DB structure, also without knowing constraints of the database (like, for instance keeping order of the xml records vs auto_increment id's) - it's very difficult to advise you on some solution that would work for you in your particular situation.
Related
I currently have a console app which strips down xml files using xpath.
I wanted to recreate this in databricks in order to speed up processing time. I'm processing around 150k xml files and it takes around 4 hours with the console app
This is a segment of the console app(I have around another 30 xpath conditions though all similar to below). As you can see one XML file can contain multiple "shareclass" elements which means one row of data per shareclass.
XDocument xDoc = XDocument.Parse(xml);
IEnumerable<XElement> elList = xDoc.XPathSelectElements("/x:Feed/x:AssetOverview/x:ShareClasses/x:ShareClass", nm);
foreach (var el in elList)
{
DataRow dr = dt.NewRow();
dr[0] = el.Attribute("Id")?.Value;
dr[1] = Path.GetFileNameWithoutExtension(fileName);
dr[2] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Fund Management Company']/x:Code", nm)?.Value;
dr[3] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Promoter']/x:Code", nm)?.Value;
dr[4] = el.XPathSelectElement("x:Profile/x:Names/x:Name[#Type='Full']/x:Text[#Language='ENG']", nm)?.Value;
I have replaced this with a python notebook shown below but getting very poor performance with it taking around 8 hours to run. Double my old console app.
First step is to read the xml files into a dataframe as strings
df = spark.read.text(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/Extracted", wholetext=True)
I then use xpath and explode to do the same actions as the console app. Below is a trimmed section of the code
df2 = df.selectExpr("xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/#Id') id",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Code/text()') FundManagerCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Code/text()') PromoterCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Name/text()') FundManagerName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Name/text()') PromoterName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/Names/Name[#Type=\"Legal\"]/Text[#Language=\"ENG\"]/text()') FullName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/FeesExpenses/Fees/Report/FeeSet/Fee[#Type=\"Total Expense Ratio\"]/Percent/text()') Fees",
).selectExpr("explode_outer(arrays_zip(id,FundManagerCode,PromoterCode,FullName,FundManagerName,PromoterName,ISIN_Code,SEDOL_Code,APIR_Code,Class_A,Class_B,KD_1,KD_2,KD_3,KD_4,AG_1,KD_5,KD_6,KD_7,MIFI_1,MIFI_2,MIFI_3,MIFI_4,MIFI_5,MIFI_6,MIFI_7,MIFI_8,MIFI_9,Prev_Perf,AG_2,MexId,AG_3,AG_4,AG_5,AG_6,AG_7,MinInvestVal,MinInvestCurr,Income,Fees)) shareclass"
).select('shareClass.*')
I feel like there must be a much simpler and quicker way to process this data though I don't have enough knowledge to know what it is.
When I first started re-writing the console app I loaded the data through spark using com.databricks.spark.xml and then tried to use JSON path to do the querying however the spark implementation of json path isn't rich enough to do some of the structure querying you can see above.
Any way anyone can think to speed up my code or do it a completely different way I'm all ears. The XML structure has lots of nesting but doesn't feel like it should be too difficult to flatten these nests as columns. I want to use as much standard functionality as possible without creating my own loops through the structure as I feel like this would be even slower.
What I am doing:
Get data from data source (could be from API or scraping) in form of a dictionary
Clean/manipulate some of the fields
Combine fields from data source dictionary into new dictionaries that represent objects
Save the created dictionaries into database
Is there a pythonic way to do this? I am wondering about the whole process but I'll give some guiding questions:
What classes should I have?
What methods/classes should the cleaning of fields from the data source to objects be in?
What methods/classes should the combining/mapping of fields from the data source to objects be in?
If the method is different in scraping vs. api, please explain how and why
Here is an example:
API returns:
{data: {
name: "<b>asd</b>",
story: "tame",
story2: "adjet"
}
}
What you want to do:
Clean name
Create a name_story object
Set name_story.name = dict['data']['name']
Set name_story.story = dict['data']['story'] + dict['data']['story2']
Save name_story to database
(and consider that there could be multiple objects to create and multiple incoming data sources)
How would you structure this process? An interface of all classes/methods would be enough for me without any explanation.
What classes should I have?
In Python, there is no strong need to use classes. Classes are the way to manage complexity. If your solution is not complex, use functions (or, maybe, module-level code, if it is one-time solution)
If the method is different in scraping vs. api, please explain how and why
I prefer to organize my code in respect with modularity and principle of least knowledge and define clear interfaces between parts of modules system.
Example of modular solution
You can have module (either function or class) for fetching information, and it should return dictionary with specified fields, no matter what exactly it does.
Another module should process dictionary and return dictionary too (for example).
Third module can save information from that dictionary to database.
There is great possibility, that this plan far from what you need or want and you should develop your modules system yourself.
And some words about your wants:
Clean name
Consider this stackoverflow answer
Create a name_story object
Set name_story.name = dict['data']['name']
Set name_story.story = dict['data']['story'] + dict['data']['story2']
If you want to have access to attributes of object through dot (as you specified in 3 and 4 items, you could use either python namedtuple or plain python class. If indexed access is OK for you, use python dictionary.
In case of namedtuple, it will be:
from collections import namedtuple
NameStory = namedtuple('NameStory', ['name', 'story'])
name_story1 = NameStory(name=dict['data']['name'], story=dict['data']['story'] + dict['data']['story2'])
name_story2 = NameStory(name=dict2['data']['name'], story=dict2['data']['name'])
If your choice if dictionary, it's easier:
name_story = {
'name': dict['data']['name'],
'story': dict['data']['story'] + dict['data']['story2'],
}
Save name_story to database
This is much more complex question.
You can use raw SQL. Specific instructions depends on your database. Google for 'python sqlite' or 'python postgresql' or what you want, there are plenty of good tutorials.
Or you can utilize one of python ORMs:
peewee
SQLAlchemy
google for more options
By the way
It's strongly recommended to not override python built-in types (list, dict, str etc), as you did in this line:
name_story.name = dict['data']['name']
I'm attempting to parse the USPTO data that is hosted Here. I have also retrieved the DTDs associated with the files. My question is: is it possible to use these to parse the files, or are they only used for validation? I have already used one as a guideline for parsing some of the documents, but doing it the way I am would require having a separate parser for each DTD. Here is an example snippet of how I'm currently doing it.
# <!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
def parseDocumentId(ref):
data = {}
data["Country"] = ref.find("country").text
data["ID"] = ref.find("doc-number").text
if ref.find("date") != None:
d= ref.find("date").text
try:
date = datetime.strptime(d, "%Y%m%d").date()
except:
date= None
data["Date"]= date
if ref.find("kind") != None:
data["Kind"]= ref.find("kind").text
if ref.find("name") != None:
data["Name"]= ref.find("name").text
return data
This way just seems very manual to me, so I'm curious if there is a better way to help automate the process
Note: I'm using lxml for parsing.
DTDs will just help you to follow specifications. You can create a dictionary for tokenize the document and then parse it. Anyway, I believe that using lxml is the better way.
The usual approach to processing XML is to use an off-the-shelf XML parser for your programming language, and from its API construct whatever data structures you want to have. When many XML documents using the same XML vocabulary must be processed, it may make sense to generate a parser for that class of XML documents using a tool, or even to construct a parser by hand. But most programs use generic XML parsers instead of custom-constructed parsers.
To store XML documents in a database, however, it may not be necessary to employ an XML parser at all (except perhaps in checking beforehand that the documents are all in fact well-formed): all XML databases and many SQL databases have the ability to read and ingest XML documents.
I've just written my first script for my first proper job (rather proud).
Basically I parse a large xml file that contains 3 types of data: estates, symbols and types.
It then creates 3 .txt files listing all items of the 3 types, one file for estates, one for symbols and one for types.
What I need to do now is format the output for use in our internal wiki.
I want to be able to filter it with a drop down menu, so that I can select an "Estate" and see what "Symbols" are in the "Estate" and then see what "Type" these symbols are.
For scope there are ~50 estates, ~26 types, and about 93000 Symbols (They all vary from day to day).
A symbol belongs to an estate and each symbol has a type.
If you want any code snippets from either the xml doc or my current script, feel free to ask, I didn't want to dump a load of code in here.
EDIT:
Here is an example of how the XML is formatted, showing the symbol name, its estate and then its type
<Symbol SymbolName="<SYMNAME>" Estate="<ESTATENAME>" TickType="<TYPE>" />
Names have been omitted for confidentiality.
EDIT2:
Had the idea of using dictionaries to better sort the parsed data.
Eg.
dictionary1 = {symbol1[estate], symbol2[estate]}
dictionary2 = {symbol1[type], symbol2[type]}
TO CLARFIY: I have a bunch of data from an xml that needs to be written to an output file in such a way that it can be filtered on a web page (drop down menus preferable)
jenkins-job-builder is a nice tool to help me to maintain jobs in YAML files. see example in configuration chapter.
Now I had lots of old jenkins jobs, it will be nice to have a python script xml2yaml to convert the existing jenkins job config.xml to YAML file format.
Do you any suggestions to had a quick solution in python ?
I don't need it to be used in jenkins-job-builder directly, just can be converted it into YAML for reference.
For the convert, some part can be ignored like namespace.
config.xml segment looks like:
<project>
<logRotator class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>20</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</logRotator>
...
</project>
The yaml output could be:
- project:
logrotate:
daysToKeep: -1
numToKeep: 20
artifactDaysToKeep: -1
artifactNumToKeep: -1
If you are not familiar with config.xml in jenkins, you can check infra_backend-merge-all-repo job in https://ci.jenkins-ci.org
I'm writing a program that does this conversion from XML to YAML. It can dynamically query a Jenkins server and translate all the jobs to YAML.
https://github.com/ktdreyer/jenkins-job-wrecker
Right now it works for very simple jobs. I've taken a safe/pessimistic approach and the program will bail if it encounters XML that it cannot yet translate.
It's hard to tell from your question exactly what you're looking for here, but assuming you're looking for the basic structure:
Python has good support on most platforms for XML Parsing. Chances are you'll want to use something simple and easy to use like minidom. See the XML Processing Modules in the python docs for your version of python.
Once you've opened the file, looking for project and then parsing down from there and using a simple mapping should work pretty well given the simplicity of the yaml format.
from xml.dom.minidom import parse
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def getTextForTag(nodelist,tag):
elements = nodelist.getElementsByTagName(tag)
if (elements.length>0):
return getText( elements[0].childNodes)
return ''
def printValueForTag(parent, indent, tag, valueName=''):
value = getTextForTag( parent,tag)
if (len(value)>0):
if (valueName==''):
valueName = tag
print indent + valueName+": "+value
def emitLogRotate(indent, rotator):
print indent+"logrotate:"
indent+=' '
printValueForTag( rotator,indent, 'daysToKeep')
printValueForTag( rotator,indent, 'numToKeep')
def emitProject(project):
print "- project:"
# all projects have log rotators, so no need to chec
emitLogRotate(" ",project.getElementsByTagName('logRotator')[0])
# next section...
dom = parse('config.xml')
emitProject(dom)
This snippet will print just a few lines of the eventual configuration file, but it puts you in the right direction for a simple translator. Based on what I've seen, there's not much room for an automatic translation scheme due to naming differences. You could streamline the code as you iterate for more options and to be table driven, but that's "just a matter of programming", this will at least get you started with the DOM parsers in python.
I suggest querying and accessing the xml with xpath expressions using xmlstarlet on the command line and in shell scripts. No trouble with low-level programmatical access to XML. XMLStarlet is an XPath swiss-army knife on the command line.
"xmlstarlet el" shows you the element structure of the entire XML as XPath expressions.
"xmlstarlet sel -t -c XPath-expression" will extract exactly what you want.
Maybe you want to spend an hour (or two) on freshing up your XPath know-how in advance.
You will shed a couple of tears, once you recognize how much time you spent with programming XML access before you used XMLStarlet.