Python lxml find text efficiently

Python lxml find text efficiently - python

Using python lxml I want to test if a XML document contains EXPERIMENT_TYPE, and if it exists, extract the <VALUE>.
Example:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
Is there a faster way than iterating through all elements?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
That attempt is also getting messy when I want to extract the <VALUE>.

Preferably you do this with XPath which is bound to be incredibly fast. My sugestion (tested and working). It will return a (possible empty) list of VALUE elements from which you can extra the text.
PS: do not use "special" words such as all as variable names. Bad practice and may lead to unexpected bugs.
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
An alternative XPath declaration was provided below by Michael Kay, which is identical to the answer by Martin Honnen.
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE

In terms of XPath it seems you simply want to select the VALUE element based on the TAG element with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE.
I think with Python and lxml people often use a text node selection with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text() as then the xpath function returns that as a Python string.

Using findall is the natural way to do it. I suggest the following code to find the VALUEs:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
This outputs:
Found val="DNA Methylation"

Related

xml elements in elements to python dataframe

i'm trying to convert the xml data into pandas dataframe.
what i'm struggling is that i cannot get the elements in the element.
here is the example of my xml file.
i'm trying to extract the information of
-orth :"decrease"
-cre_date:2013/12/07
-morph_grp -> var type :"decease"
-subsense - eg: "abcdabcdabcd."
<superEntry>
<orth>decrease</orth>
<entry n="1" pos="vk">
<mnt_grp>
<cre>
<cre_date>2013/12/07</cre_date>
<cre_writer>james</cre_writer>
<cre_writer>jen</cre_writer>
</cre>
<mod>
<mod_date>2007/04/14</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited ver</mod_note>
</mod>
<mod>
<mod_date>2009/11/01</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited</mod_note>
</mod>
</mnt_grp>
<morph_grp>
<var type="spr">decease</var>
<cntr opt="opt" type="oi"/>
<org lg="si">decrease_</org>
<infl type="reg"/>
</morph_grp>
<sense n="01">
<sem_grp>
<sem_class>active solution</sem_class>
<trans>be added and subtracted to</trans>
</sem_grp>
<frame_grp type="FIN">
<frame>X=N0-i Y=N1-e V</frame>
<subsense>
<sel_rst arg="X" tht="THM">countable</sel_rst>
<sel_rst arg="Y" tht="GOL">countable</sel_rst>
<eg>abcdabcdabcd.</eg>
<eg>abcdabcdabcd.</eg>
</subsense>
and i'm using the code
df_cols=["orth","cre_Date","var type","eg"]
row=[]
for node in xroot:
a=node.attrib.get("sense")
b=node.attrib.get("orth").text if node is not None else None
c=node.attrib.get("var type").text if node is not None else None
d=node.attrib.get("eg").text if node is not None else None
rows.append({"orth":a, "entry":b,
"morph_grp":c, "eg" : d})
out_df= pd.DataFrame(rows,colums=df_cols)
and i'm stuck with getting the element inside the element
any good solution for this?
thank you so much in advance

Making some assumptions about what you want, here is an approach using XPath.
I'm assuming you will be iterating over multiple XML files that each have one superEntry root node in order to generate a DataFrame with more than one record.
Or, perhaps your actual XML doc has a higher-level root/parent element above superEntry, and you will be iterating over multiple superEntry elements within that.
You will need to modify the below accordingly to add your loop.
Also, the provided example XML had two of the "eg" elements with same value. Not sure how you want to handle that. The below will just get the first one. If you need to deal with both, then you can use the findall() method instead of find().
I was a little confused about what you wanted from the "var" element. You indicated "var type", but that you wanted the value to be "deceased", which is the text in the "var" element, whereas "type" is an attribute with a value of "spr". I assumed you wanted the text instead of the attribute value.
import pandas as pd
import xml.etree.ElementTree as ET
df_cols = ["orth","cre_Date","var","eg"]
data = []
xmlDocPath = "example.xml"
tree = ET.parse(xmlDocPath)
superEntry = tree.getroot()
#Below XPaths will just get the first occurence of these elements:
orth = superEntry.find("./orth").text
cre_Date = superEntry.find("./entry/mnt_grp/cre/cre_date").text
var = superEntry.find("./entry/morph_grp/var").text
eg = superEntry.find("./entry/sense/frame_grp/subsense/eg").text
data.append({"orth":orth, "cre_Date":cre_Date, "var":var, "eg":eg})
#After exiting Loop, create DataFrame:
df = pd.DataFrame(data, columns=df_cols)
df.head()
Output:
orth cre_Date var eg
0 decrease 2013/12/07 decease abcdabcdabcd.
Here is a link to the ElementTree documentation for XPath usage: https://docs.python.org/3/library/xml.etree.elementtree.html#xpath-support

solving XML problem by Python

I have an xml file that contain a lot of information and tags.
For example I have this tag:
<SelectListMap SourceName="Document Type" SourceNumber="43" DestName="Document Type" DestNumber="43"/>
I have 40 other tags like this one with the same nodes, but the value of these nodes is different in each tag.
SourceName and DestName have the same value.
In some tags the DestName value is empty like this one:
<SelectListMap SourceName="Boolean Values" SourceNumber="73" DestName="" DestNumber="0" IsInternal="True"/>
So, I'm trying to give the empty DestName the value of Sourcename.
Here is my Python codes:
import re
import xml.etree.ElementTree as ET
tree = ET.parse("SPPID04A_BG3 - Copy - Copy.xml")
root = tree.getroot()
for SelectListMap in root.iter('SelectListMap'):
#DestName.text = str(DestName)
for node in tree.iter('SelectListMap'):
SourceName = node.attrib.get('SourceName')
SelectListMap.set('DestName', SourceName)
tree.write("SPPID04A_BG3 - Copy - Copy.xml")
This program is not working on the right way. any help or ideas?
Thanks!

You never check the if the DestName attribute is empty. If you replace the first for loop with the following, you should get what you want:
for SelectListMap in root.iter('SelectListMap'):
if SelectListMap.get("DestName") == "":
SourceName = SelectListMap.get("SourceName")
SelectListMap.set("DestName", SourceName)

How do I replace an element in lxml with a string

I'm trying to figure out in lxml and python how to replace an element with a string.
In my experimentation, I have the following code:
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
xref = topicroot2.xpath('//*/xref')
xref_attribute = xref[0].attrib['browsertext']
print href_attribute
The result is: 'something here'
This is the browser text attribute I'm looking for in this small sample. But what I can't seem to figure out is how to replace the entire element with the attribute text I've captured here.
(I do recognize that in my sample I could have multiple xrefs and will need to construct a loop to go through them properly.)
What's the best way to go about doing this?
And for those wondering, I'm having to do this because the link actually goes to a file that doesn't exist because of our different build systems.
Thanks in advance!

Try this (Python 3):
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
# Get the root element.
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
# Get the text of the root element. This is a list of strings!
topicroot2_text = topicroot2.xpath("text()")
# Get the xref elment.
xref = topicroot2.xpath('//*/xref')[0]
xref_attribute = xref.attrib['browsertext']
# Save a reference to the p element, remove the xref from it.
parent = xref.getparent()
parent.remove(xref)
# Set the text of the p element by combining the list of string with the
# extracted attribute value.
new_text = [topicroot2_text[0], xref_attribute, topicroot2_text[1]]
parent.text = "".join(new_text)
print(et.tostring(topicroot2))
Output:
b'<p>The value is permitted only when that includes something here, otherwise the value is reserved.</p>'

Simplify SVGs by applying transform - reduce size

I have quite often some SVGs with structures like this:
<svg:g
transform="translate(-251.5,36.5)"
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="288"
y="35.999958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
I would like to apply translate directly to the coordinates and delete the tranform-attribute:
<svg:g
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="36.5"
y="69.499958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
Do you know a script / program for simplifying SVGs? Or a python-snipplet for parsing SVG's?
This script works for my special case, but I would like one which works allways:
#http://epydoc.sourceforge.net/stdlib/xml.dom.minidom.Element-class.html
from xml.dom.minidom import parse, parseString
import re
f = open('/home/moose/mathe/svg/Solitaire-Board.svg', 'r')
xmldoc = parse(f)
p = re.compile('translate\(([-\d.]+),([-\d.]+)\)', re.IGNORECASE)
for node in xmldoc.getElementsByTagName('svg:g'):
transform_dict = node.attributes["transform"]
m = p.match(transform_dict.value)
if m:
x = float(m.group(1))
y = float(m.group(2))
child_rectangles = node.getElementsByTagName('svg:rect')
for rectangle in child_rectangles:
x_dict = rectangle.attributes["x"]
y_dict = rectangle.attributes["y"]
new_x = float(x_dict.value) + x
new_y = float(y_dict.value) + y
rectangle.setAttribute('x', str(new_x))
rectangle.setAttribute('y', str(new_y))
node.removeAttribute('transform')
print xmldoc.toxml()
I think the size of the svg could be reduced quite heavily without loss of quality, if the transform-attribute could be removed.
If the tool would be able to reduce coordinate precision, delete unnecessary regions, group and style wisely it would be great.

I'd recommend using lxml. It's extremely fast and has a lot of nice features. You can parse your example if you properly declare the svg namespace prefix. You can do that pretty easily:
>>> svg = '<svg xmlns:svg="http://www.w3.org/2000/svg">' + example_svg + '</svg>'
Now you can parse it with lxml.etree (or xml.etree.ElementTree):
>>> doc = etree.fromstring(svg)
If you use lxml you can take advantage of XPath:
>>> ns = {'svg': 'http://www.w3.org/2000/svg'}
>>> doc.xpath('//svg:g/#transform', namespaces=ns)
<<< ['translate(-251.5,36.5)']

You might want to have a look at scour:
Scour aims to reduce the size of SVG files as much as possible, while retaining the original rendering of the files. It does not do so flawlessly for all files, therefore users are encouraged not to overwrite their original files.
Optimizations performed by Scour on SVG files include: removing empty elements, removing metadata elements, removing unused id= attribute values, removing unrenderable elements, trimming coordinates to a certain number of significant places, and removing vector editor metadata.

1) it can be parsed and edited with regular expression. you can easily get the translate values, and the x,y's.
2) if you checked the minidom, and sure that your only problem is with the ':', so just replace the ':', edit what you need, and then re-replace it to ':'.
3) you can use this question: Is there any scripting SVG editor? to learn how to parse better this XML format.

path to element with conditions on parent(s) attributes using xpath,lxml,python

I am working on project using lxml. here is a sample xml
<PatientsTree>
<Patient PatientID="SKU065427">
<Study StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000006050107501192100000001">
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1176798690"/>
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1177084041"/>
<Series SeriesInstanceUID="25.2.9.2.1107.5.1.4.49339.30000006050108064034300000000"/>
</Study>
</Patient>
<Patient PatientID="SKU55527">
<Study StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000006120407393721800000007">
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1198835144"/>
</Study>
<Study StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000007010207164403100000013">
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1198835358"/>
</Patient>
</PatientsTree>
Suppose I want to get to the series element with conditions
PatientID="SKU55527"
StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000007010207164403100000013";
My result will be :
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1198835358"/>
If I can understand this solution then I will move one step closer in learning xml. P.S I am working with python and lxml and xpath

import lxml.etree as le
with open('data.xml') as f:
doc=le.parse( f )
patientID="SKU55527"
studyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000007010207164403100000013"
xpath='''\
/PatientsTree
/Patient[#PatientID="{p}"]
/Study[#StudyInstanceUID="{s}"]
/Series'''.format(p=patientID,s=studyInstanceUID)
seriesInstanceUID=doc.xpath(xpath)
for node in seriesInstanceUID:
print(node.attrib)
# {'SeriesInstanceUID': '2.16.840.1.113669.1919.1198835358'}

This XPath expression:
/PatientsTree
/Patient[#PatientID='SKU55527']
/Study[#StudyInstanceUID =
'25.2.9.2.1107.5.1.4.49339.30000007010207164403100000013']
/Series
Results in this node selected:
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1198835358"/>

If you want to use lxml natively instead of xpath: (otherwise, unutbu's solution is perfect)
from lxml import etree as ET
tree = ET.parse('some_file.xml')
patientID="SKU55527"
studyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000007010207164403100000013"
patient_node = tree.find(patientID)
if not patient_node is None:
study_node = patient_node.find(studyInstanceUID)
if not study_node is None:
for child in study_node.getchildren():
print child.attrib
#or do whatever useful thing you want
else:
#didn't find the study
else:
#didn't find the node

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python lxml find text efficiently - python

Related

xml elements in elements to python dataframe

solving XML problem by Python

How do I replace an element in lxml with a string

Simplify SVGs by applying transform - reduce size

path to element with conditions on parent(s) attributes using xpath,lxml,python

Categories

Resources