Xpath - obtaining 2 nodes with 1 node having default value if missing - python

I am using xpath in Python 2.7 with lxml:
from lxml import html
...
tree = html.fromstring(source)
results = tree.xpath(...xpath string...)
Now the problem is the xpath string and am getting quite lost in this. I am trying to get all the nodes from one path like this:
//a[#class="hyperlinkClass"]/span/text() (1)
There are no missing entries in this part and this works fine. But I'm also trying to get a part relative to this as well, like so:
//a[#class="hyperlinkClass"]/span/following-sibling::div[#class="divClassName"]/span[#class="spanClassName"]/text() (2)
This works fine by itself, but (2) may or may not have nodes for each node in (1). What I would like to do is to have a default value for if (2) is missing/empty for each (1), say "absent". This sounds straightforward and maybe it is, but I'm hitting a brick wall here.
By doing '(1) | (2)' I get all the values needed, but no way to match them. If I do '(1) | concat((2), "absent")', this doesn't work either - concat doesn't seem to work in python, though I've read with xpath that it is valid. I saw here the "Becker method", but that doesn't work either (or I can't get it to).
Hopefully, someone can shine a light on how to get this working or if it's even possible.

Don't make this more complicated than it is:
path1 = '//a[#class="hyperlinkClass"]/span'
path2 = './following-sibling::div[#class="divClassName"]/span[#class="spanClassName"]'
for link in tree.xpath(path1):
other_node = link.xpath(path2)
if len(other_node):
print(link.text, other_node[0].text)
else:
print(link.text, 'n/a')

Related

Get a single child note using lxml

Edit - The issue was that I was running an outdated version of lxml - I feel really stupid now but I'm glad I found out.
I'm having trouble iterating through an XML tree to export single child elements.
What I'm looking for is isolating child elements and exporting them in separate xml files. But my problem is that when I'm using the 'etree.iter' function, I'm not only getting the children elements, I'm also getting all following siblings. How can I only get one child element at the time?
This should explain it better. Here's my sample code:
from lxml import etree
root = etree.XML("<users><user><name>Test</name><id>01</id></user> \
<user><name>Test</name><id>02</id></user> \
<user><name>Test</name><id>03</id></user></users>")
for record in root.iter("user"):
print(etree.tostring(record))
It produces the following output
b'<user><name>Test</name><id>01</id></user><user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>'
b'<user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>'
b'<user><name>Test</name><id>03</id></user></users>'
But what I need is
b'<user><name>Test</name><id>01</id></user>'
b'<user><name>Test</name><id>02</id></user>'
b'<user><name>Test</name><id>03</id></user>'
What am I doing wrong?
Quite not sure why iter is producing such an error. Try this, it works fine.
xn = etree.fromstring("<users><user><name>Test</name><id>01</id></user><user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>")
user_nodes = xn.findall("user")
str_nodes = [etree.tostring(un) for un in user_nodes]
print(str_nodes)
produces an expected output
[
b'<user><name>Test</name><id>01</id></user>',
b'<user><name>Test</name><id>02</id></user>',
b'<user><name>Test</name><id>03</id></user>']

IDA python Find issues

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?
in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

Using Strings to Name Hash Keys?

I'm working through a book called "Head First Programming," and there's a particular part where I'm confused as to why they're doing this.
There doesn't appear to be any reasoning for it, nor any explanation anywhere in the text.
The issue in question is in using multiple-assignment to assign split data from a string into a hash (which doesn't make sense as to why they're using a hash, if you ask me, but that's a separate issue). Here's the example code:
line = "101;Johnny 'wave-boy' Jones;USA;8.32;Fish;21"
s = {}
(s['id'], s['name'], s['country'], s['average'], s['board'], s['age']) = line.split(";")
I understand that this will take the string line and split it up into each named part, but I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes.
The purpose of the individual parts is to be searched based on an individual element and then printed on screen. For example, being able to search by ID number and then return the entire thing.
The language in question is Python, if that makes any difference. This is rather confusing for me, since I'm trying to learn this stuff on my own.
My personal best guess is that it doesn't make any difference and that it was personal preference on part of the authors, but it bewilders me that they would suddenly change form like that without it having any meaning, and further bothers me that they don't explain it.
EDIT: So I tried printing the id key both with and without single quotes around the name, and it worked perfectly fine, either way. Therefore, I'd have to assume it's a matter of personal preference, but I still would like some info from someone who actually knows what they're doing as to whether it actually makes a difference, in the long run.
EDIT 2: Apparently, it doesn't make any sense as to how my Python interpreter is actually working with what I've given it, so I made a screen capture of it working https://www.youtube.com/watch?v=52GQJEeSwUA
I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes
The answer is right there. If there's no quote, mydict[s], then s is a variable, and you look up the key in the dict based on what the value of s is.
If it's a string, then you look up literally that key.
So, in your example s[name] won't work as that would try to access the variable name, which is probably not set.
EDIT: So I tried printing the id key both with and without single
quotes around the name, and it worked perfectly fine, either way.
That's just pure luck... There's a built-in function called id:
>>> id
<built-in function id>
Try another name, and you'll see that it won't work.
Actually, as it turns out, for dictionaries (Python's term for hashes) there is a semantic difference between having the quotes there and not.
For example:
s = {}
s['test'] = 1
s['othertest'] = 2
defines a dictionary called s with two keys, 'test' and 'othertest.' However, if I tried to do this instead:
s = {}
s[test] = 1
I'd get a NameError exception, because this would be looking for an undefined variable called test whose value would be used as the key.
If, then, I were to type this into the Python interpreter:
>>> s = {}
>>> s['test'] = 1
>>> s['othertest'] = 2
>>> test = 'othertest'
>>> print s[test]
2
>>> print s['test']
1
you'll see that using test as a key with no quotes uses the value of that variable to look up the associated entry in the dictionary s.
Edit: Now, the REALLY interesting question is why using s[id] gave you what you expected. The keyword "id" is actually a built-in function in Python that gives you a unique id for an object passed as its argument. What in the world the Python interpreter is doing with the expression s[id] is a total mystery to me.
Edit 2: Watching the OP's Youtube video, it's clear that he's staying consistent when assigning and reading the hash about using id or 'id', so there's no issue with the function id as a hash key somehow magically lining up with 'id' as a hash key. That had me kind of worried for a while.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

Specific doubts on kgp.py program in dive into python book

Dive into Python: XML Processing -
Here I am referring to a portion of kgp.py program -
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
if not standaloneXrefs:
raise NoSourceError, "can't guess source, and no source specified"
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
self.grammar: parsed XML representation (using xml.dom.minidom) of -
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
self.refs: is the caching of all the refs of the above XML key'd by their id
I have two doubts with this code:
Doubt 1:
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
eventaully xrefs holds the id values in a list. Couldn't we have done this simply by -
xrefs = [xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]
Doubt 2:
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Here, we are saving the ref from self.refs which we do NOT see in our computed xrefs. But next instead of creating a <ref> element, we are creating a <xref> with the same ID. This takes us one step backward, since later we are anyway going to find the cross reference for this computed <xref> and eventually reach the <ref>. We could have just started with this <ref> in the first place.
Disclaimer
I am in no way trying to make a remark on the book. I am not even qualified for that.
I am loving every moment of reading this book. I realize few chapters have gone outdated, but I love Mark Pilgrim's writing style and I cannot stop reading.
Dive Into Python is seven years old now (published 2004), and doesn't always contain the most modern code. So you need to go easy on it: Dive Into Python 3 might be a better bet.
Your suggestion for doubt 1 changes the meaning of the code: putting the ids into the keys of a dictionary and then getting them out again eliminates duplicates, whereas your list comprehension includes duplicates. The modern approach would be to use a set comprehension:
xrefs = {xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")}
but this wasn't available in 2004.
On your doubt 2, I'm not entirely sure I see the problem. Yes, in some sense this is a waste, but on the other hand the code already has a handler for the xref case, so it makes sense to re-use that handler rather than add an extra special case.
There are several other bits of code in that example that could be modernized. For example,
source and source or self.getDefaultSource()
would now be source or self.getDefaultSource(). And the line
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
would be better expressed as a set difference operation, something like:
standaloneXrefs = set(self.refs) - set(xrefs)
But that's what happens as languages become more expressive: old code starts to look rather inelegant.
Your doubts are totally justified: that code doesn't look very good to me at all. For example, it uses 1 as a boolean value where True would have sufficed and been clearer.
Doubt 1:
These two snippets don't do the same. If there are duplicates, the original code will filter them out, but your alternative won't. On the other hand, your code preserves the original ordering whereas the original returns the elements in an arbitrary order.
To be fully equivalent, we could use the set builtin:
xrefs = list(set([xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]))
(It might not make sense to convert back to a list, though.)
Doubt 2:
Out of time, gotta run, sorry...
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
This is an extremely crude way to construct a set. This should be written as
set(xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref"))
or even (in Python 2.7+):
{xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")) }
If avoiding duplicates is not an issue, your solution (constructing a list) works too. Since xref is iterated over anyway, one could even generate an iterator.
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
This code is completely broken if xref contains a special character such as " or &.
However, in principle, it is correct to construct an <xref> element here, since this must be the same format that the external source has (getDefaultSource is called as
self.loadSource(source and source or self.getDefaultSource())
).
Both code excerpts are examples of bad programming and should not be included in a book that intends to teach people how to program. Dive Into Python3 has better XML examples and code.

Categories