pulling multiple values from python ElementTree with lxml and xpath - python

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.
I am web-scraping. The pages I am scraping have the following salient elements:
<div class='parent'>
<span class='title'>
<a>THIS IS THE TITLE</a>
</span>
<div class='copy'>
<p>THIS IS THE COPY</p>
</div>
</div>
My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')
Below is my code
## 'tree' is the ElementTree of the document I've just pulled
xpath = "//div[#class='parent']"
filtered_html = tree.xpath(xpath)
arr = []
for i in filtered_html:
title_filter = "//span[#class='author']/a/text()" # xpath for title text
copy_filter = "//div[#class='copy']/p/text()" # xpath for copy text
title = i.getroottree().xpath(title_filter)
copy = i.getroottree().xpath(copy_filter)
arr.append((title, copy))
I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.
What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.
I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?

Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.
I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.
My (untested!) code would look like this:
parent_div_xpath = "//div[#class='parent']"
title_filter = "//span[#class='title']/a"
copy_filter = "//div[#class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]
Alternately, you could skip explicit iteration entirely:
title_filter = "//div[#class='parent']/span[#class='title']/a/text()"
copy_filter = "//div[#class='parent']/div[#class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))
You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:
arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))
And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Related

Empty list as output from scrapy response object

I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()

Following sibling within an xpath is not working as intended

I've been trying to scoop out a portion of text out of some html elements using xapth but It seems I'm going somewhere wrong that is why I can't make it.
Html elements:
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
I would like to dig out International using xpath. I know I could get success using .next_sibling If I wanted to extract the same using css selector but I'm not interested in going that route.
That said If I try like this I can get the same using xpath:
tree.xpath("//*[#class='content']/p/following::text()")[0]
But the above expression is not what I'm after cause I can't use the same within selenium webdriver If I stick to driver.find_element_by_xpath()
The only way that I'm interested in is like the following but it is not working:
"//*[#class='content']/p/following::*"
Real-life example:
from lxml.html import fromstring
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
tree = fromstring(htmlelem)
item = tree.xpath("//*[#class='content']/p/following::text()")[0].strip()
elem = tree.xpath("//*[#class='content']/p/following::*")[0].text
print(elem)
In the above example, I can get success printing item but can't printing elem. However, I would like to modify the expression used within elem.
How can I make it work so that the same xpath I can use within lxml library or within selenium?
Since OP was looking for a solution which extracts the text from outside the xpath, the following should do that, albeit in a somewhat awkward manner:
tree.xpath("//*[#class='content']")[0][0].tail
Output:
International
The need for this approach is a result of the way lxml parses the html code:
tree.xpath("//*[#class='content']") results in a list of length=1.
The first (and only) element in the list - tree.xpath("//*[#class='content']")[0] is a lxml.html.HtmlElement which itself can be treated as a list and also has length=1.
In the tail of the first (and only) element in that lxml.html.HtmlElement hides the desired output...

How to delete bs4.element.Tag element in Python list?

I have a Python list which is
url_list = [<img src="https://test.com/temp.jpg" style="display:block"/>, <img src="https://test.com/not_temp.jpg" style="display:block"/>]
both element in that list are 'bs4.element.Tag' type.
How do I delete '<img src="https://test.com/temp.jpg" style="display:block"/>' element while keeping its 'bs4.element.Tag' type?
and the list will keep changing in time, so del url_list[0] is not going to work.
I tried url_list.remove('<img src="https://test.com/temp.jpg" style="display:block"/>')
but it didn't work since its type was different.
Edit:
I want to remove this '<img src="https://test.com/temp.jpg" style="display:block"/>' exact element. and "while keeping its 'bs4.element.Tag' type" means i dont want change the list element's type.
Convert the string representation of the tag into a BS object:
tag = '<img src="https://test.com/temp.jpg" style="display:block"/>'
unwanted = bs4.BeautifulSoup(tag).img
And remove it:
url_list.remove(unwanted)
Easiest thing is probably to simply go through every tag and check whether or not a tag contains a certain element, which you can do with the tag.get() method. For example, you could do something along the lines of
for tag in url_list:
if tag.get('src') == 'some_url':
url_list.remove(tag)
the get() method can be used to extract any of the individual properties of the tag, not just the src. How you filter out what tag to remove is then up to you.

Extract h1 text from div class with scrapy or selenium

I am using python along with scrapy and selenium.I want to extract the text from the h1 tag which is inside a div class.
For example:
<div class = "example">
<h1>
This is an example
</h1>
</div>
This is my tried code:
for single_event in range(1,length_of_alllinks):
source_link.append(alllinks[single_event])
driver.get(alllinks[single_event])
s = Selector(response)
temp = s.xpath('//div[#class="example"]//#h1').extract()
print temp
title.append(temp)
print title
Each and every time I tried different methods I got an empty list.
Now, I want to extract "This is an example" i.e h1 text and store it or append it in a list i.e in my example title.
Like:
temp = ['This is an example']
Try the following to extract the intended text:
s.xpath('//div[#class="example"]/h1/text()').extract()
For once, it seems that in your HTML the class attribute of the is "example" but in your code you're looking for other class values; At least for XPath queries, keep in mind that you search by exact attribute value. You can use something like:
s.xpath('//div[contains(#class, "example")]')
To find an element that has the "example" class but may have additional classes. I'm not sure if this is a mistake or this is your actual code. In addition the fact that you have spaces in your HTML around the '=' sign of the class attribute may not be helping some parsers either.
Second, your query used in s.xpath seems wrong. Try something like this:
temp = s.xpath('//div[#class="example"]/h1').extract()
Its not clear from your code what s is, so I'm assuming the extract() method does what you think it does. Maybe a more clean code sample would help us help you.

Using getElementsByTagName from xml.dom.minidom

I'm going through Asheesh Laroia's "Scrape the Web" presentation from PyCon 2010 and I have a question about a particular line of code which is this line:
title_element = parsed.getElementsByTagName('title')[0]
from the function:
def main(filename):
#Parse the file
parsed = xml.dom.minidom.parse(open(filename))
# Get title element
title_element = parsed.getElementsByTagName('title')[0]
# Print just the text underneath it
print title_element.firstChild.wholeText
I don't know what role '[0]' is performing at the end of that line. Does 'xml.dom.minidom.parse' parse the input into a list?
parse() does not return a list; getElementsByTagName() does. You're asking for all elements with a tag of <title>. Most tags can appear multiple times in a document, so when you ask for those elements, you'll get more than one. The obvious way to return them is as a list or tuple.
In this case you expect only one <title> tag in the document, so you just take the first element in the list.
This method's (getElementsByTagName) documentation says:
Search for all descendants (direct children, children’s children,
etc.) with a particular element type name.
Since it mentions "all descendants", then yes, in all likeness it returns a list that this code just indexes to see the first element.
Looking at the code of this method (in Lib/xml/dom/minidom.py) - it indeed returns a list.

Categories