Scrapy hxs.select returning [] - python

I run the following and it works fine:
hxs.select('//h1').extract()
However this
hxs.select('//div[#class="ClassName"]/text()').extract()
returns [].
Is my syntax wrong? I'm unsure why the div class isn't working (it's definitely there!).

Without the html is a difficult to answers this.
I recommend to use firefox + xpath addon , I use firepath to evaluate xpaths on webpages. Check for other addons that allow you to create xpath from firebug view.
Your syntax seems to be correct.

Related

Python Selenium Find Element

I'm searching for a tag in class.I tried many methods but I couldn't get the value.
see source code
The data I need is inside the "data-description".
How can i get the "data-description" ?
I Tried some method but didn't work
driver.find_element_by_name("data-description")
driver.find_element_by_css_selector("data-description")
I Solved this method:
icerisi = browser.find_elements_by_class_name('integratedService ')
for mycode in icerisi:
hizmetler.append(mycode.get_attribute("data-description"))
Thanks for your help.
I think css selector would work best here. "data-description" isn't an element, it's an attribute of an element. The css selector for an element with a given attribute would be:
[attribute]
Or, to be more specific, you could use:
[attribute="attribute value"]
Here's a good tip:
Most web browsers have a way of copying an elements Selector or XPATH. For example, in Safari if you view the source code then right-click on an element it will give you the option to copy it. Then select XPATH or Selector and in your code use driver.find_element_by_xpath() or driver.find_element_by_css_selector(). I am certain Google Chrome and Firefox have similar options.
This method is not always failsafe, as the XPATH can be very specific, meaning that slight changes to the website will cause your script to crash, but it is a quick and easy solution, and is especially useful if you don't plan on reusing your code months or years later.

Selecting the first link in a google search

When I inspect the website(google search), I'm able to select my desired href by searching for this //div[#class="r"]/a/#href through the finder. But when using scrapy and accessing by response.xpath('//div[#class="r"]/a/#href') this will return empty. Many other Xpath such as link title will also result empty. Strangely enough, I'm able to get something when using response.xpath('//cite').get(), which is basically the href but incomplete.
If I do response.body I'm able to see my desired href deep into the code but I have no idea how to access it. Trying to select it through traditional methods css or xpath that would work in any other website has been futile.
The reason the xpath you're using work on your browser but no in the response, is because Google displays the page differently if JS is disabled, which is the case for scrapy but not your browser, so you'll need to use an XPath that will work for both or just the first case.
This one works for no JS but won't work in the browser (if JS is enabled):
//div[#id='ires']//h3/a[1]/#href
This will return the first URL of the first result.
Try the below.
response.xpath("//div[#class='r']").xpath("//a/#href").extract()

Xpath is correct but Scrapy doesn't work

I'm trying to download two fields from a webpage, I identify the XPath expressions for each one and then run the spider, but nothing is downloaded.
The webpage:
http://www.morningstar.es/es/funds/snapshot/snapshot.aspx?id=F0GBR04MZH
The field I want to itemize is ISIN.
The spider runs without errors, but the output is empty.
Here is the line code:
item['ISIN'] = response.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()
Try to remove tbody from XPath:
'//*[#id="overviewQuickstatsDiv"]/table//tr[5]/td[3]/text()'
Note that this tag is added by your browser while page rendering and it's absent in page source
P.S. I suggest you to use IMHO even better XPath:
'//td[.="ISIN"]/following-sibling::td[contains(#class, "text")]/text()'
I think response.selector was not given. Try this.
response.selector.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()

how to find xpath, class name of elements python selenium

My code is:
driver.get("http://www.thegoodguys.com.au/buyonline/SearchDisplay?pageSize=16&beginIndex=0&searchSource=Q&sType=SimpleSearch&resultCatEntryType=2&showResultsPage=true&pageView=image&searchTerm=laptops")
link=();
linkPrice=();
price=();
productName=[];
Site='Harvey Norman'
link=driver.find_elements_by_class_name("photo")
linkPrice=driver.find_elements_by_class_name("product-title")
price=driver.find_elements_by_xpath("//div[#class='purchase']/span/span")
I am not sure whether the supplied xpath and class_name are correct. Could some one verify them and please let me know how to find them
In firefox you can simply use the developer tools or firebug to check the html for classes and element ids. Following the link in your question I can find a class called photo but for linkPrice and price you should use other classes.
Try:
price=driver.find_elements_by_class_name("price")
linkPrice=driver.find_elements_by_class_name("addtocart")
Which gives me:
price[0].text
u'$496'
linkPrice[0].text
u'ADD TO CART'
You can verify Xpath using developer tools console in chrome e.g $x("//foo") or $(".foo")
Firebug for Firefox will also let you verify
Also browsers will suggest Xpath for you but these are often verbose and unstable so would recommend hand crafting

xpath doesn't work in this website

I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.
The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.
Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.
Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.

Categories