I'm still new in making python apps... But I'm willing to learn...
I want to make hash tags (converted from a string that was generated) and turn them into an element for an XML etree.
e.g.
from the string (object rawData)
rawData = "I'm soooo sleepy - feeling bored #journal #asleep"
I already got code from here to convert these hashtags (#journal and #asleep) into a python set:
hashTags = extract_hash_tags(rawData)
Result would be this (Now I already have a set of tags):
hashTags = set(['journal', 'asleep'])
The problem now is to make that set into:
<array>
<string>journal</string>
<string>asleep</string>
</array>
I know that I'm gonna make a loop for this that'll make individual parts of the set into elements.
I'm still rusty at loops though.
I'm using lxml because I need to prettify the xml. It gets the job done though.
EDIT: The answer for the stackoverflow question used a set not an array. Sorry 'bout that mistake...
With lxml.
from lxml import etree
# Code to make hashTags list...
array = etree.Element('array')
# Note: array can be also SubElement(parent, 'array')
for hash in hashTags:
string = etree.SubElement(array, 'string')
string.text = hash
print(etree.tostring(array, pretty_print=True)
Related
I am using lxml to parse an XML like this sample one:
<compounddef xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="d2/db7/class_foo" kind="class">
<compoundname>FooClass</compoundname>
<sectiondef kind="public-type">
<memberdef kind="typedef" id="d2/db7/class_bar">
<type><ref refid="d3/d73/struct_foo" kindref="compound">StructFoo</ref></type>
<definition>StructFooDefinition</definition>
</memberdef>
</sectiondef>
</compounddef>
I'm trying to get the element with <refid> "d3/d73/struct_foo" and with the <definition> containing the text "Foo".
There could be many refid with that value and many definitions containing Foo, but only one has this combination.
I am able to first find all the elements with that refid and then filter this list by checking which of them containts "Foo" in the , but since I'm working with a really big XML file (~1GB) and the application is time sensitive, I wanted to avoid this.
I tried combining the various etree paths using the keyword 'and' or '//precede:...', but without success.
My last try was:
self.dox_tree_root_.xpath(".//compounddef[#kind = 'class']//memberdef[#kind='typedef'][/type/ref[#refid='%s'] and contains(definition, 'name')]" % (independent_type_refid, name)))
but it is giving me an error.
Is there a way to combine the two filters inside one command?
You can use XPATH
//a[.//ref[#refid="12345"] and contains(c, "Good")]
If I understand your correctly, this should get you close enough:
.//compounddef[#kind = 'class']//memberdef[#kind='typedef'][./type/ref[#refid='d3/d73/struct_foo']][contains(.//definition, 'Foo')]//definition
Output:
StructFooDefinition
XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2008/CIM-schema-cim13#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:Terminal rdf:ID="A_T1">
<cim:Terminal.ConductingEquipment rdf:resource="#A_EF2"/>
<cim:Terminal.ConnectivityNode rdf:resource="#A_CN1"/>
</cim:Terminal>
</rdf:RDF>
I want to get the Terminal.ConnnectivityNode element's attribute value and Terminal element's attribute value also as output from the above xml. I have tried in below way!
Python code:
from elementtree import ElementTree as etree
tree= etree.parse(r'N:\myinternwork\files xml of bus systems\cimxmleg.xml')
cim= "{http://iec.ch/TC57/2008/CIM-schema-cim13#}"
rdf= "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}"
Appending the below line to the code
print tree.find('{0}Terminal'.format(cim)).attrib
output1: : Is as expected
{'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID': 'A_T1'}
If we Append with this below line to above code
print tree.find('{0}Terminal'.format(cim)).attrib['rdf:ID']
output2: key error in rdf:ID
If we append with this below line to above code
print tree.find('{0}Terminal/{0}Terminal.ConductivityEquipment'.format(cim))
output3 None
How to get output2 as A_T1 & Output3 as #A_CN1?
What is the significance of {0} in the above code, I have found that it must be used through net didn't get the significance of it?
First off, the {0} you're wondering about is part of the syntax for Python's built-in string formatting facility. The Python documentation has a fairly comprehensive guide to the syntax. In your case, it simply gets substituted by cim, which results in the string {http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal.
The problem here is that ElementTree is a bit silly about namespaces. Instead of being able to simply supply the namespace prefix (like cim: or rdf:), you have to supply it in XPath form. This means that rdf:id becomes {http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID, which is very clunky.
ElementTree does support a way to use the namespace prefix for finding tags, but not for attributes. This means you'll have to expand rdf: to {http://www.w3.org/1999/02/22-rdf-syntax-ns#} yourself.
In your case, it could look as following (note also that ID is case-sensitive):
tree.find('{0}Terminal'.format(cim)).attrib['{0}ID'.format(rdf)]
Those substitutions expand to:
tree.find('{http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal').attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID']
With those hoops jumped through, it works (note that the ID is A_T1 and not #A_T1, however). Of course, this is all really annoying to have to deal with, so you could also switch to lxml and have it mostly handled for you.
Your third case doesn't work simply because 1) it's named Terminal.ConductingEquipment and not Terminal.ConductivityEquipment, and 2) if you really want A_CN1 and not A_EF2, that's the ConnectivityNode and not the ConductingEquipment. You can get A_CN1 with tree.find('{0}Terminal/{0}Terminal.ConnectivityNode'.format(cim)).attrib['{0}resource'.format(rdf)].
Hello im writing a bit of code im Maya and running into some issues with ElementTree. I need help reading in this xml, or something similar. The XML is generated based on a selection, so it can change.
<root>
<Locations>
<1 name="CacheLocation">C:\Users\daunish\Desktop</1>
</Locations>
<Objects>
<1 name="Sphere">[u'pSphere1', u'pSphere2']</1>
<2 name="Cube">[u'pCube1']</2>
</Objects>
</root>
I need a way of searching for a particular "name" inside "Locations", and passing the text to a variable.
I also need a way of going through each line inside of "Objects" and preforming a functions, as in a for loop.
I'm open to all suggestions, I have been going crazy trying to get this to work. If you think i should format the XML differently I'm up for that as well. Thanks in advance for the help.
[Note: your XML is not well formed because you can't have tags that start with a number]
Not sure what you've tried but there are many ways to do this, here's one:
Find the first element with name=CacheLocation in Locations:
>>> filename = root.find("./Locations/*[#name='CacheLocation']").text
>>> filename
'C:\\Users\\daunish\\Desktop'
Iterating over all the elements in Objects:
>>> import ast
>>> for target in root.find("./Objects"):
... for i in ast.literal_eval(target.text):
... print(target.get('name'), i)
Sphere pSphere1
Sphere pSphere2
Cube pCube1
I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.
Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...
Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.
I'm trying to read an xml file into python, pull out certain elements from the xml file and then write the results back to an xml file (so basically it's the original xml file without several elements). When I use .removeChild(source) it removes the individual elements I want to remove but leaves white space in its stead making the file very unreadable. I know I can still parse the file with all of the whitespace, but there are times when I need to manually alter the values of certain element's attributes and it makes it difficult (and annyoing) to do this. I can certainly remove the whitespace by hand but if I have dozens of these xml files that's not really feasible.
Is there a way to do .removeChild and have it remove the white space as well?
Here's what my code looks like:
dom=parse(filename)
main=dom.childNodes[0]
sources = main.getElementsByTagName("source")
for source in sources :
name=source.getAttribute("name")
spatialModel=source.getElementsByTagName("spatialModel")
val1=float(spatialModel[0].getElementsByTagName("parameter")[0].getAttribute("value"))
val2=float(spatialModel[0].getElementsByTagName("parameter")[1].getAttribute("value"))
if angsep(val1,val2,X,Y)>=ROI :
main.removeChild(source)
else:
print name,val1,val2,angsep(val1,val2,X,Y)
f=open(outfile,"write")
f.write("<?xml version=\"1.0\" ?>\n")
f.write(dom.saveXML(main))
f.close()
Thanks much for the help.
If you have PyXML installed you can use xml.dom.ext.PrettyPrint()
I couldn't figure out how to do this using xml.dom.minidom, so I just wrote a quick function to read in the output file and remove all blank lines and then rewrite to a new file:
f = open(xmlfile).readlines()
w = open('src_model.xml','w')
empty=re.compile('^$')
for line in open(xmlfile).readlines():
if empty.match(line):
continue
else:
w.write(line)
This works good enough for me :)
… for searching ppl:
This funny snippet
skey = lambda x: getattr(x, "tagName", None)
mainnode.childNodes = sorted(
[n for n in mainnode.childNodes if n.nodeType != n.TEXT_NODE],
cmp=lambda x, y: cmp(skey(y), skey(x)))
removes all text nodes (and, also, reverse sorts them by tagname).
I.e. you can (recursively) do tr.childNodes = [recurseclean(n) for n in tr.childNodes if n.nodeType != n.TEXT_NODE] to remove all text nodes
Or you might want to do something like … if n.nodeType != n.TEXT_NODE or not re.match(r'^[:whitespace:]*$', n.data, re.MULTILINE) (did't try that one myself) if you need text nodes with some data. Or something more complex to leave text inside specific tags.
After that tree.toprettyxml(…) will return well-formatted XML text.
I know, that this question is quite dated, but since it took a while to figure out different approaches to the problem, here are my solutions:
The best way, I found is using lxml, indeed:
from lxml import etree
root = etree.fromstring(data)
# for tag in root.iter('tag') doesn't cope with namespaces...
for tag in root.xpath('//*[local-name() = "tag"]'):
tag.getparent().remove(tag)
data = etree.tostring(root, encoding = 'utf-8', pretty_print = True)
With minidom, it's a bit more convoluted due to the fact, that every node is accompanied with a trailing whitespace node:
import xml.dom.minidom
dom = xml.dom.minidom.parseString(data)
for tag in dom.getElementsByTagName('tag'):
if tag.nextSibling \
and tag.nextSibling.nodeType == meta.TEXT_NODE \
and tag.nextSibling.data.isspace():
tag.parentNode.removeChild(tag.nextSibling)
tag.parentNode.removeChild(tag)
data = dom.documentElement.toxml(encoding = 'utf-8')