Scraping search results with Scrapy and Selenium - python

This might be a long shot, but people have always been really helpful with the questions I've posted in the past so I'm gonna try. If anyone could help me, that would be amazing...
I'm trying to use Scrapy to get search results (links) after searching for a keyword on a Chinese online newspaper - pages like this
When I inspect the html for the page in Chrome, the links to the articles seem to be there. But then when I try to grab it using a Scrapy spider, the html is much more basic and the links I want don't show up. I think this may be because the results are being drawn to the page using JavaScript? I've tried combining Scrapy with 'scrapy-selenium' to get round this, but it is still not working. I have heard Splash might work, but this seems complicated to set up.
Here is the code for my Scrapy spider:
import scrapy
from scrapy_selenium import SeleniumRequest
class QuotesSpider(scrapy.Spider):
name = "XH"
def start_requests(self):
urls = [
'http://so.news.cn/#search/0/%E4%B8%80%E5%B8%A6%E4%B8%80%E8%B7%AF/1/'
]
for url in urls:
yield SeleniumRequest(url=url, wait_time=90, callback=self.parse)
def parse(self, response):
print(response.request.meta['driver'].title)
page = response.url.split("/")[-2]
filename = 'XH-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I can also post any of the other Scrapy files, if that is helpful. I have also modified settings.py - following these instructions.
Any help would be really appreciated. I'm completely stuck with this!

In inspect tool open network tab and watch requests you will find out the data is coming from this url, so crawl this instead with normal scrapy.Request().
spider would be like this:
import scrapy
import json
class QuotesSpider(scrapy.Spider):
name = "XH"
def start_requests(self):
urls = [
'http://so.news.cn/getNews?keyword=%E4%B8%80%E5%B8%A6&curPage=1&sortField=0&searchFields=1&lang=cn'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
json_data = json.loads(response.body.decode('utf-8'))
for data in json_data['content']['results']:
yield {
'url': data['url']
}

Related

Scrapy response url not exatly the same as the one i defined on start urls

I have a spider i give it this url https://tuskys.dpo.store/#!/~/search/keyword=dairy milk
However when i try to get the url in scrapy parse method the url looks like https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk
Here is a demo code to demonstrate my problem
import scrapy
class TuskysDpoSpider(scrapy.Spider):
name = "Tuskys_dpo"
#allowed_domains = ['ebay.com']
start_urls = ['https://tuskys.dpo.store/#!/~/search/keyword=dairy milk']
def parse(self, response):
yield{'url':response.url}
results: {"url": "https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk"}
Why is my scrapy response url not exactly the same as the url i defined and is there a way to go around this?
You should use response.request.url because you are redirected from your start url, so response.url is the url you are redirected to.

Scrapy get all links from any website

I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass

Cannot get item when crawl data using scrapy

I have inspected element from chrome:
I want to get data in the red box (can be more than one) using scrapy. I used this code (I see the tutorial from the scrapy documentation):
import scrapy
class KamusSetSpider(scrapy.Spider):
name = "kamusset_spider"
start_urls = ['http://kbbi.web.id/' + 'abad']
def parse(self, response):
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
But, there is no result:
What happen? I have change it to this(use splash) but still not working:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
html = response.body
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
In this case it seems that the page content is generated dynamically --
eventhough you can see the elements present when inspecting from browser, they are not present in the HTML source (i.e. in what Scrapy sees). That's because Scrapy can't render JavaScript etc. You need to use some kind of browser to render the page and then put the result to Scrapy for processing. I recommend using Splash for it's seamless integration with Scrapy.

Unable to scrape some URLs from a webpage

I am trying to scrape all the restaurant URLs on a page. There are only 5 restaurant URLs to scrape in this particular example.
At this stage, I am just trying to print them to see if my code works. However, I am not even able to get that done: - my code is unable to find any of the URLs.
import scrapy
from hungryhouse.items import HungryhouseItem
class HungryhouseSpider(scrapy.Spider):
name = "hungryhouse"
allowed_domains = ["hungryhouse.co.uk"]
start_urls = ["https://hungryhouse.co.uk/takeaways/westhill-ab32",
]
def parse(self, response):
for href in response.xpath('//div[#class="restsRestInfo"]/a/#href'):
url = response.urljoin(href.extract())
print url
any guidance as to why the five URLs are not being found would be gratefully received.

Scrapy crawling stackoverflow questions matching multiple tags

I am trying out scrapy now. I tried the example code in http://doc.scrapy.org/en/1.0/intro/overview.html page. I tried extracting the recent questions with tag 'bigdata'. Everything worked well. But when I tried to extract questions with both tags 'bigdata' and 'python', the results were not correct, with questions having only 'bigdata' tag coming in the result. But on browser I am getting questions with both the tags correctly. Please find the code below:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata?page=1&sort=newest&pagesize=50']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
When I change start_urls as
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python?page=1&sort=newest&pagesize=50']
the results contain questions with only 'bigdata' tag. How to get questions with both the tags only?
Edit: I think what is happening is that scrapy is going into pages with tag 'bigdata' from the main page I gave because the tags are links to the main page for that tag. How can I edit this code to make scrapy not go into the tag pages and only questions in that page? I tried using rules like below but results were still not right.
rules = (Rule(LinkExtractor(restrict_css='.question-summary h3 a::attr(href)'), callback='parse_question'),)
The url you have (as well as the initial css rules) is correct; or more simply:
start_urls = ['https://stackoverflow.com/questions/tagged/python+bigdata']
Extrapolating from this, this will also work:
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata%20python']
The issue you are running into however, is that stackoverflow appears to require you to be logged in to access the multiple tag search feature. To see this, simply log out of your stackoverflow session and try the same url in your browser. It will redirect you to a page of results for the first of the two tags only.
TL;DR the only way to get the multiple tags feature appears to be logging in (enforced via session cookies)
Thus, when using scrapy, the fix is to authenticate the session (login) before doing anything else, and then proceed to parse as normal and it all works. To do this, you can use an InitSpider instead of Spider and add the appropriate login methods. Assuming you login with StackOverflow directly (as opposed to through Google or the like), I was able to get it working as expected like this:
import scrapy
import getpass
from scrapy.spiders.init import InitSpider
class StackOverflowSpider(InitSpider):
name = 'stackoverflow'
login_page = 'https://stackoverflow.com/users/login'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python']
def parse(self, response):
...
def parse_question(self, response):
...
def init_request(self):
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
return scrapy.FormRequest.from_response(response,
formdata={'email': 'yourEmailHere#foobar.com',
'password': getpass.getpass()},
callback=self.check_login_response)
def check_login_response(self, response):
if "/users/logout" in response.body:
self.log("Successfully logged in")
return self.initialized()
else:
self.log("Failed login")

Categories