Removing blank lxml element from ElementTree in python - python

I've been struggling with this for a couple of days now, and I figured I would ask here.
I am working on preparing an XML payload to POST to an Oracle endpoint that contains financials data. I've got most of the XML structured per Oracle specs, but I am struggling with one aspect of it. This is data that will feed the general ledger financial system and the xml structure is below (some elements have been omitted to cut down on the post.
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:typ="http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/" xmlns:jour="http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/">
<soapenv:Header/>
<soapenv:Body>
<typ:importJournals>
<typ:interfaceRows>
<jour:BatchName>batch</jour:BatchName>
<jour:AccountingPeriodName>Aug-20</jour:AccountingPeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:GlInterface>
<jour:LedgerId>1234567890</jour:LedgerId>
<jour:PeriodName>Aug-20</jour:PeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:Segment1>1</jour:Segment1>
<jour:Segment2>1</jour:Segment2>
<jour:Segment3>1</jour:Segment3>
<jour:Segment4>1</jour:Segment4>
<jour:Segment5>0</jour:Segment5>
<jour:Segment6>0</jour:Segment6>
<jour:CurrencyCode>USD</jour:CurrencyCode>
<jour:EnteredCrAmount currencyCode="USD">10.0000</jour:EnteredCrAmount>
</jour:GlInterface>
<jour:GlInterface>
<jour:LedgerId>1234567890</jour:LedgerId>
<jour:PeriodName>Aug-20</jour:PeriodName>
<jour:AccountingDate>2020-08-31</jour:AccountingDate>
<jour:Segment1>2</jour:Segment1>
<jour:Segment2>2</jour:Segment2>
<jour:Segment3>2</jour:Segment3>
<jour:Segment4>2</jour:Segment4>
<jour:Segment5>0</jour:Segment5>
<jour:Segment6>0</jour:Segment6>
<jour:CurrencyCode>USD</jour:CurrencyCode>
<jour:EnteredDrAmount currencyCode="USD">10.0000</jour:EnteredCrAmount>
</jour:GlInterface>
</typ:interfaceRows>
</typ:importJournals>
</soapenv:Body>
</soapenv:Envelope>
So if you look at the XML above, within the GlInterface tags, there are 2 per transaction (one is a debit and one is a credit, if you look at the Segments (account codes) they are different, and one GlInterface tag as a EnteredDrAmount tag, while the other has EnteredCrAmount tag.
In the source data, either the Cr or Dr tag is null depending on if the line is a debit or credit, which comes in as "None" in python.
The way I got this to work is to call two calls to get data, one where Cr is not null and one where Dr is not null, and this process works fine, but in Python, I get an error "only one * allowed". Code is below.
xmlOracle = x_Envelope(
x_Header,
x_Body(
x_importJournals(
x_interfaceRows(
x_h_BatchName(str(batch[0])),
x_h_AccountingPeriodName(str(batch[3])),
x_h_AccountingDate(str(batch[4])),
*[x_GlInterface(
x_d_LedgerId(str(adid[0])),
x_d_PeriodName(str(adid[1])),
x_d_AccountingDate(str(adid[2])),
x_d_Segment1(str(adid[5])),
x_d_Segment2(str(adid[6])),
x_d_Segment3(str(adid[7])),
x_d_Segment4(str(adid[8])),
x_d_Segment5(str(adid[9])),
x_d_Segment6(str(adid[10])),
x_d_CurrencyCode(str(adid[11])),
x_d_EnteredCrAmount(str(adid[14]), currencyCode=str(adid[11]))
) for adid in CrAdidToProcess],
*[x_GlInterface(
x_d_LedgerId(str(adid[0])),
x_d_PeriodName(str(adid[1])),
x_d_AccountingDate(str(adid[2])),
x_d_Segment1(str(adid[5])),
x_d_Segment2(str(adid[6])),
x_d_Segment3(str(adid[7])),
x_d_Segment4(str(adid[8])),
x_d_Segment5(str(adid[9])),
x_d_Segment6(str(adid[10])),
x_d_CurrencyCode(str(adid[11])),
x_d_EnteredDrAmount(str(adid[14]), currencyCode=str(adid[11]))
) for adid in DrAdidToProcess]
)
)
)
)
I've also tried making a single call to get the line details and then either removing or filtering out the tag (either Cr or Dr) if it's "None" but I had no luck with this.
While the above process works, there is an error in my code, and I'd like to not have an error in my code.
Thank you all.

After further testing, I believe I figured out the solution to this. I believe I was trying to remove an element from ElementTree object and it was not having any of that. When I passed an element to the remove method/function, it finally worked.
Here is code for the function to remove the "None" entries.
def removeCrDrEmptyElements(element):
for removeElement in element.xpath('/soapenv:Envelope/soapenv:Body/typ:importJournals/typ:interfaceRows/jour:GlInterface/jour:EnteredCrAmount',
namespaces = { 'soapenv' : 'http://schemas.xmlsoap.org/soap/envelope/',
'typ' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/',
'jour' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/'
}):
if removeElement.text == 'None':
removeElement.getparent().remove(removeElement)
for removeElement in element.xpath('/soapenv:Envelope/soapenv:Body/typ:importJournals/typ:interfaceRows/jour:GlInterface/jour:EnteredDrAmount',
namespaces = { 'soapenv' : 'http://schemas.xmlsoap.org/soap/envelope/',
'typ' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/types/',
'jour' : 'http://xmlns.oracle.com/apps/financials/generalLedger/journals/desktopEntry/journalImportService/'
}):
if removeElement.text == 'None':
removeElement.getparent().remove(removeElement)
return element
Obviously this can be better rewritten (which I will do) but I only want to check two elements within the GlInterface tag, the EnteredCrAmount and EnteredDrAmount and remove those elements if the text is None.
Then you can call the function by using the code below to return an element type with the removed Nulls/Nones
xmlWithoutNull = removeCrDrEmptyElements(xmlElement)
Output before running function results:
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">1.000000</jour:EnteredCrAmount>
<jour:EnteredDrAmount currencyCode="USD">None</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">None</jour:EnteredCrAmount>
<jour:EnteredDrAmount currencyCode="USD">1.000000</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>
Output after running function results:
<jour:GlInterface>
# omitted elements
<jour:EnteredCrAmount currencyCode="USD">1.000000</jour:EnteredCrAmount>
# omitted elements
</jour:GlInterface>
<jour:GlInterface>
# omitted elements
<jour:EnteredDrAmount currencyCode="USD">1.000000</jour:EnteredDrAmount>
# omitted elements
</jour:GlInterface>

Related

xml elements in elements to python dataframe

i'm trying to convert the xml data into pandas dataframe.
what i'm struggling is that i cannot get the elements in the element.
here is the example of my xml file.
i'm trying to extract the information of
-orth :"decrease"
-cre_date:2013/12/07
-morph_grp -> var type :"decease"
-subsense - eg: "abcdabcdabcd."
<superEntry>
<orth>decrease</orth>
<entry n="1" pos="vk">
<mnt_grp>
<cre>
<cre_date>2013/12/07</cre_date>
<cre_writer>james</cre_writer>
<cre_writer>jen</cre_writer>
</cre>
<mod>
<mod_date>2007/04/14</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited ver</mod_note>
</mod>
<mod>
<mod_date>2009/11/01</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited</mod_note>
</mod>
</mnt_grp>
<morph_grp>
<var type="spr">decease</var>
<cntr opt="opt" type="oi"/>
<org lg="si">decrease_</org>
<infl type="reg"/>
</morph_grp>
<sense n="01">
<sem_grp>
<sem_class>active solution</sem_class>
<trans>be added and subtracted to</trans>
</sem_grp>
<frame_grp type="FIN">
<frame>X=N0-i Y=N1-e V</frame>
<subsense>
<sel_rst arg="X" tht="THM">countable</sel_rst>
<sel_rst arg="Y" tht="GOL">countable</sel_rst>
<eg>abcdabcdabcd.</eg>
<eg>abcdabcdabcd.</eg>
</subsense>
and i'm using the code
df_cols=["orth","cre_Date","var type","eg"]
row=[]
for node in xroot:
a=node.attrib.get("sense")
b=node.attrib.get("orth").text if node is not None else None
c=node.attrib.get("var type").text if node is not None else None
d=node.attrib.get("eg").text if node is not None else None
rows.append({"orth":a, "entry":b,
"morph_grp":c, "eg" : d})
out_df= pd.DataFrame(rows,colums=df_cols)
and i'm stuck with getting the element inside the element
any good solution for this?
thank you so much in advance
Making some assumptions about what you want, here is an approach using XPath.
I'm assuming you will be iterating over multiple XML files that each have one superEntry root node in order to generate a DataFrame with more than one record.
Or, perhaps your actual XML doc has a higher-level root/parent element above superEntry, and you will be iterating over multiple superEntry elements within that.
You will need to modify the below accordingly to add your loop.
Also, the provided example XML had two of the "eg" elements with same value. Not sure how you want to handle that. The below will just get the first one. If you need to deal with both, then you can use the findall() method instead of find().
I was a little confused about what you wanted from the "var" element. You indicated "var type", but that you wanted the value to be "deceased", which is the text in the "var" element, whereas "type" is an attribute with a value of "spr". I assumed you wanted the text instead of the attribute value.
import pandas as pd
import xml.etree.ElementTree as ET
df_cols = ["orth","cre_Date","var","eg"]
data = []
xmlDocPath = "example.xml"
tree = ET.parse(xmlDocPath)
superEntry = tree.getroot()
#Below XPaths will just get the first occurence of these elements:
orth = superEntry.find("./orth").text
cre_Date = superEntry.find("./entry/mnt_grp/cre/cre_date").text
var = superEntry.find("./entry/morph_grp/var").text
eg = superEntry.find("./entry/sense/frame_grp/subsense/eg").text
data.append({"orth":orth, "cre_Date":cre_Date, "var":var, "eg":eg})
#After exiting Loop, create DataFrame:
df = pd.DataFrame(data, columns=df_cols)
df.head()
Output:
orth cre_Date var eg
0 decrease 2013/12/07 decease abcdabcdabcd.
Here is a link to the ElementTree documentation for XPath usage: https://docs.python.org/3/library/xml.etree.elementtree.html#xpath-support

What's the correct Scrapy XPath for <p> elements incorrectly placed within <h> tags?

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.
My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.
I have tried the solution in the link above, and also here, to no avail.
def parse(self, response):
chinesetitle = response.xpath('//*[#id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[#id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[#id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[#id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[#id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield {
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice
}
When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.
Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!
EDIT
Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)
def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[#class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[#class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[#class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[#class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[#id="tabcont1"]/dl/dd[1]/p/a/text()').extract()
Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.
chinesetitle = response.xpath('//div[#class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[#class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[#class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[#class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.
To find productionregions I took the 7th selector from the list response.xpath('//div[#class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.
Edit : To answer the question in the comments,
response.xpath('//div[#class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
returns a string like '\r\n 上映时间:2017-7-27(中国)\r\n ' which is not what you are looking for. You can clean it up like:
chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]
This gives us the correct date.
You don't have to torture yourself with xpath by the way, you can use css:
response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

How do I replace an element in lxml with a string

I'm trying to figure out in lxml and python how to replace an element with a string.
In my experimentation, I have the following code:
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
xref = topicroot2.xpath('//*/xref')
xref_attribute = xref[0].attrib['browsertext']
print href_attribute
The result is: 'something here'
This is the browser text attribute I'm looking for in this small sample. But what I can't seem to figure out is how to replace the entire element with the attribute text I've captured here.
(I do recognize that in my sample I could have multiple xrefs and will need to construct a loop to go through them properly.)
What's the best way to go about doing this?
And for those wondering, I'm having to do this because the link actually goes to a file that doesn't exist because of our different build systems.
Thanks in advance!
Try this (Python 3):
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
# Get the root element.
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
# Get the text of the root element. This is a list of strings!
topicroot2_text = topicroot2.xpath("text()")
# Get the xref elment.
xref = topicroot2.xpath('//*/xref')[0]
xref_attribute = xref.attrib['browsertext']
# Save a reference to the p element, remove the xref from it.
parent = xref.getparent()
parent.remove(xref)
# Set the text of the p element by combining the list of string with the
# extracted attribute value.
new_text = [topicroot2_text[0], xref_attribute, topicroot2_text[1]]
parent.text = "".join(new_text)
print(et.tostring(topicroot2))
Output:
b'<p>The value is permitted only when that includes something here, otherwise the value is reserved.</p>'

Using search terms with Biopython to return accession numbers

I am trying to use Biopython (Entrez) with search terms that will return the accession number (and not the GI*).
Here is a tiny excerpt of my code:
from Bio import Entrez
Entrez.email = 'myemailaddress'
search_phrase = 'Escherichia coli[organism]) AND (complete genome[keyword])'
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=100, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
gi_numbers = result['IdList']
print(gi_numbers)
'745369752', '910228862', '187736741', '802098270', '802098269',
'802098267', '387610477', '544579032', '544574430', '215485161',
'749295052', '387823261', '387605479', '641687520', '641682562',
'594009615', '557270520', '313848522', '309700213', '284919779',
'215263233', '544345556', '544340954', '144661', '51773702',
'202957457', '202957451', '172051323'
I am sure I can convert from GI to accession, but it would be nice to avoid the additional step. What slice of magic am I missing?
Thank you in advance.
*especially since NCBI is phasing out GI numbers
Looking through the docs for esearch on NCBI's website, there are only two rettypes available - uilist, which is the default XML format that you're getting currently (it's parsed into a dict by Entrez.read()), and count, which just displays the Count value (look at the complete contents of result, it's there), which I'm unclear on its exact meaning, as it doesn't represent the total number of items in IdList...
At any rate, Entrez.esearch() will take any value of rettype and retmode you like, but it only returns the uilist or count in xml or json mode - no accession IDs, no nothin'.
Entrez.efetch() will pass you back all sorts of cool stuff, depending on which DB you're querying. The downside, of course, is that you need to query by one or more IDs, not by a search string, so in order to get your accession IDs you'd need to run two queries:
search_phrase = "Escherichia coli[organism]) AND (complete genome[keyword])"
handle = Entrez.esearch(db="nuccore", term=search_phrase, retmax=100)
result = Entrez.read(handle)
handle.close()
fetch_handle = Entrez.efetch(db="nuccore", id=results["IdList"], rettype="acc", retmode="text")
acc_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
print(acc_ids)
gives
['HF572917.2', 'NZ_HF572917.1', 'NC_010558.1', 'NZ_HG941720.1', 'NZ_HG941719.1', 'NZ_HG941718.1', 'NC_017633.1', 'NC_022371.1', 'NC_022370.1', 'NC_011601.1', 'NZ_HG738867.1', 'NC_012892.2', 'NC_017626.1', 'HG941719.1', 'HG941718.1', 'HG941720.1', 'HG738867.1', 'AM946981.2', 'FN649414.1', 'FN554766.1', 'FM180568.1', 'HG428756.1', 'HG428755.1', 'M37402.1', 'AJ304858.2', 'FM206294.1', 'FM206293.1', 'AM886293.1']
So, I'm not terribly sure if I answered your question satisfactorily, but unfortunately I think the answer is "There is no magic."

Is there a formal method for 'walking' XML in Python?

I have been learning how to extract parts of XML using the dom.minidom function, and I can return specific elements and attributes successfully.
I have a number of large XML files I want to parse, and push all the results into a db.
Is there are a function like os.walk that I can use to and extract elements from the XML in a logical way that preserves the hierarchical structure?
The XML is pretty basic and is very straight forward:
<InternalSignature ID="9" Specificity="Generic">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0" MinFragLength="0">
<Sequence>49492A00</Sequence>
<DefaultShift>5</DefaultShift>
<Shift Byte="00">1</Shift>
<Shift Byte="2A">2</Shift>
<Shift Byte="49">3</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>
<InternalSignature ID="10" Specificity="Generic">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0" MinFragLength="0">
<Sequence>4D4D002A</Sequence>
<DefaultShift>5</DefaultShift>
<Shift Byte="2A">1</Shift>
<Shift Byte="00">2</Shift>
<Shift Byte="4D">3</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>
Is there a formal method of crawling the XML and (in this small example) extracting the elements that relate to each specific InternalSignature?
I can see how to call things via a list using the minidom.parse and the .GetElementsByName methods, but I'm not sure how you associate elements into their hierarchical representation.
So far I have found a tutorial that shows how to return various values:
xmldoc = minidom.parse("file.xml")
Versionlist = xmldoc.getElementsByTagName('FFSignatureFile')
VersionRef = Versionlist[0]
Version = VersionRef.attributes["Version"]
DateCreated = VersionRef.attributes["DateCreated"]
print Version.value
print DateCreated.value
InternalSignatureList = xmldoc.getElementsByTagName('InternalSignature')
InternalSignatureRef = InternalSignatureList[0]
SigID = InternalSignatureRef.attributes["ID"]
SigSpecificity = InternalSignatureRef.attributes["Specificity"]
print SigID.value
print SigSpecificity.value
print len(InternalSignatureList)
I can see from the last line (len) that there is 134 elements in the InternalSignatureList, and essentially I want to be able to extract all the elements inside each InternalSignature as an individual record and flick it into a db.
( What have you tried? )
from xml.etree import ElementTree
e = ElementTree.fromstring(xmlstring)
e.findall("ByteSequence")

Categories