Xpath is correct but Scrapy doesn't work

Xpath is correct but Scrapy doesn't work - python

I'm trying to download two fields from a webpage, I identify the XPath expressions for each one and then run the spider, but nothing is downloaded.
The webpage:
http://www.morningstar.es/es/funds/snapshot/snapshot.aspx?id=F0GBR04MZH
The field I want to itemize is ISIN.
The spider runs without errors, but the output is empty.
Here is the line code:
item['ISIN'] = response.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()

Try to remove tbody from XPath:
'//*[#id="overviewQuickstatsDiv"]/table//tr[5]/td[3]/text()'
Note that this tag is added by your browser while page rendering and it's absent in page source
P.S. I suggest you to use IMHO even better XPath:
'//td[.="ISIN"]/following-sibling::td[contains(#class, "text")]/text()'

I think response.selector was not given. Try this.
response.selector.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()

Related

Xpath not returning the TEXT form the <p> tags in Scrapy Shell

Link: https://www.softwareadvice.com/hr/zenefits-profile
I am trying to scrape the description from the above link. The XPath seems correct but it doesn't return me the value in scrapy shell. (Please see the screenshot below).
I tried all methods like get(), getall(), extract(), extract_first(), extractall() but I am getting an empity list.
Kindly help me to identify the error. Thanks...
Click to see the image (XPath)
Click to see the image (Scrapy Shell)

If you disable JS you will not find that XPATH working.
That is how Scrapy loads the HTML, it loads only HTML and does not execute any JS/AJAX
Try this XPATH
response.xpath("/html/body/app-root/main/app-product/div[1]/app-product-detail/div[2]/div/div[1]/div/div/p//text()").getall()

Selecting the first link in a google search

When I inspect the website(google search), I'm able to select my desired href by searching for this //div[#class="r"]/a/#href through the finder. But when using scrapy and accessing by response.xpath('//div[#class="r"]/a/#href') this will return empty. Many other Xpath such as link title will also result empty. Strangely enough, I'm able to get something when using response.xpath('//cite').get(), which is basically the href but incomplete.
If I do response.body I'm able to see my desired href deep into the code but I have no idea how to access it. Trying to select it through traditional methods css or xpath that would work in any other website has been futile.

The reason the xpath you're using work on your browser but no in the response, is because Google displays the page differently if JS is disabled, which is the case for scrapy but not your browser, so you'll need to use an XPath that will work for both or just the first case.
This one works for no JS but won't work in the browser (if JS is enabled):
//div[#id='ires']//h3/a[1]/#href
This will return the first URL of the first result.

Try the below.
response.xpath("//div[#class='r']").xpath("//a/#href").extract()

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.

Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href

You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()

xpath doesn't work in this website

I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.

The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.

Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.

Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.

Parsing HTML with XPath, Python and Scrapy

I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.

I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.

You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.

Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Xpath is correct but Scrapy doesn't work - python

I think response.selector was not given. Try this. response.selector.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()

Related

Xpath not returning the TEXT form the <p> tags in Scrapy Shell

Selecting the first link in a google search

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

xpath doesn't work in this website

Parsing HTML with XPath, Python and Scrapy

Categories

Resources