There is a webpage consisting of anchor elements. I want to select the text and the attribute href values from all of anchor elements. I am using scrapy's xpath engine to do the same. So I have tried the follows without much success:
response.xpath('//a[position()>1]/(text()|#href)').extract()
response.xpath('//a[position()>1]/text()/#href').extract()
But these errors out.
Is this possible in a xpath in the first place?
Ps: its probably not correct to say scrapy's xpath engine - I think its lxml python package.
Related
I've been researching this for two days now. There seems to be no simple way of doing this. I can find an element on a page by downloading the html with Selenium and passing it to BeautifulSoup, followed by a search via classes and strings. I want to click on this element after finding it, so I want to pass its Xpath to Selenium. I have no minimal working example, only pseudo code for what I'm hoping to do.
Why is there no function/library that lets me search through the html of a webpage, find an element, and then request it's Xpath? I can do this manually by inspecting the webpage and clicking 'copy Xpath'. I can't find any solutions to this on stackoverflow, so please don't tell me I haven't looked hard enough.
Pseudo-Code:
*parser is BeautifulSoup HTML object*
for box in parser.find_all('span', class_="icon-type-2"): # find all elements with particular icon
xpath = box.get_xpath()
I'm willing to change my code entirely, as long as I can locate a particular element, and extract it's Xpath. So any other ideas on entirely different libraries are welcome.
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
I am trying to create "universal" Xpath, so when I run spider, it will be able to download the hotel name for each hotel on the list.
This is the XPath that I need to convert:
//*[#id="offerPage"]/div[3]/div[1]/div[1]/div/div/div/div/div[2]/div/div[1]/h3/a
Can anyone point me the right direction?
This is the example how they did it in the scrapy docs:
https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
for text: they have :
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
When you open "http://quotes.toscrape.com/" and copy Xpath for text you will get :
/html/body/div/div[2]/div[1]/div[1]/span[1]
When you look at the html that your are scraping just using "copy xpath" from the browser source viewer is not enough.
You need to look at the attributes that the html tags have.
Of course, using just tag types as an xpath can work, but what if not every page you are going to scrape follows that pattern?
The Scrapy example you are using uses the span's class attribute to precisely point to the target tag.
I suggest reading a bit more about Xpath (for example here) to understand how flexible your search patterns can be.
If you want to go even broader, reading about DOM structure will also be useful. Let us know if you need more pointers.
I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.
The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.
Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.
Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.
I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.
I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.