When I inspect the website(google search), I'm able to select my desired href by searching for this //div[#class="r"]/a/#href through the finder. But when using scrapy and accessing by response.xpath('//div[#class="r"]/a/#href') this will return empty. Many other Xpath such as link title will also result empty. Strangely enough, I'm able to get something when using response.xpath('//cite').get(), which is basically the href but incomplete.
If I do response.body I'm able to see my desired href deep into the code but I have no idea how to access it. Trying to select it through traditional methods css or xpath that would work in any other website has been futile.
The reason the xpath you're using work on your browser but no in the response, is because Google displays the page differently if JS is disabled, which is the case for scrapy but not your browser, so you'll need to use an XPath that will work for both or just the first case.
This one works for no JS but won't work in the browser (if JS is enabled):
//div[#id='ires']//h3/a[1]/#href
This will return the first URL of the first result.
Try the below.
response.xpath("//div[#class='r']").xpath("//a/#href").extract()
Related
I'm trying to get the names of the users and the content of the comments that exist on this page:
User and text that I need to extract:
When I test the extraction with the chrome plugin Xpath helper, I am getting the user names with the statement:
//*[#id="livefyre"]/div/div/div/div/article/div/header/a/span
and the comments, I get them with:
//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p
When I do the test in the scrapy console, with the query:
response.xpath(//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p).extract()
I get a [];
I've also tried with:
response.xpath (//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p.text()).extract()
The same thing happens with my code.
Verifying the code of the page, I see that all those comments do not exist in the html code.
When I inspect the page, for example, I see the comment text:
But when, I check the html code of the page I do not see anything
:
Where am I making a mistake?
Thanks for help.
As you stated, there isn't any comment in the code of page, that mean website is being rendered through javascript, There are two ways you can scrape these kind of websites
First,
use scrapy-splash to render javascript
second,
find the api/network call that brings the comments, mock that request in scrapy to get your data.
I'm trying to download two fields from a webpage, I identify the XPath expressions for each one and then run the spider, but nothing is downloaded.
The webpage:
http://www.morningstar.es/es/funds/snapshot/snapshot.aspx?id=F0GBR04MZH
The field I want to itemize is ISIN.
The spider runs without errors, but the output is empty.
Here is the line code:
item['ISIN'] = response.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()
Try to remove tbody from XPath:
'//*[#id="overviewQuickstatsDiv"]/table//tr[5]/td[3]/text()'
Note that this tag is added by your browser while page rendering and it's absent in page source
P.S. I suggest you to use IMHO even better XPath:
'//td[.="ISIN"]/following-sibling::td[contains(#class, "text")]/text()'
I think response.selector was not given. Try this.
response.selector.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()
I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is:
soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read())
links = SoupStrainer('a')
print links
which yields:
a|{}
Given that I'm new to this I've been trying to work with anything that I think might work. I also tried:
mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href']
and
spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'})
lines = [span.get_text() for span in spans]
print lines
but these don't yield anything either.
I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!
If you press CTRL+U on Google Chrome or Right click > view source.
You'll see that the page is rendered by using javascript or other.
urllib is not going to be able to display/download what you're looking for.
You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS).
You can then get the information from Selenium and store it then manipulate it in anyway you see fit.
Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.
Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href
You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()
I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.
The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.
Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.
Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.