I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or other reasons, but whatever). I would like to do a retry for all those fail scraping.
As it is too long to read all, I would like to summary some directions for my questions are:
1) Can I execute the retry using CustomRetryMiddleware only in one specific method
2) Can I do something after finish the first scraping
Okay let's start
The overall logic of my code is as below:
Crawl the homepage of forum
Crawl into every post from the homepage
Scrape the data from the post
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
def parse_page(self, response): //Going into all the threads
hrefs = response.xpath('blahblah')
for href in hrefs:
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_post)
def parse_post(self, response): //really scraping the content
content_empty = len(response.xpath('//table[#class="content"]') //check if the content is empty
if content_empty == 0:
//do something
item = ForumItem()
item['some_content'] = response.xpath('//someXpathCode')
yield item
I have read lots from stackoverflow, and thought I can do it in two ways (and have done some coding):
1) Create a custom RetryMiddleware
2) Do the retry just inside the spider
However I am doing both of them with no lucks. The failure reasons is as below:
For Custom RetryMiddleware, I followed this, but it will check through all the page I crawled, including robot.txt, so it always retrying. But what I want is only do the retry check inside parse_post. Is this possible?
For retry inside the spider, I have tried two approacch.
First, I added a class variable _posts_not_crawled = [] and append it with response.url if the empty check is true. Adjust the code of start_requests to do the retry of all fail scraping after finishing scraping for the first time:
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
while self._post_not_crawled:
yield scrapy.Request(self._post_not_crawled.pop(0), callback=self.parse_post)
But of course it doesn't work, because it executes before actually scraping data, so it will only execute once with an empty _post_not_crawled list before start scraping. Is it possible to do something after finish first scraping?
Second trial is to directly retry inside the parse_post()
if content_empty == 0:
logging.warning('Post was empty: ' + response.url)
retryrequest = scrapy.Request(response.url, callback=self.parse_post)
retryrequest.dont_filter = True
return retryrequest
else:
//do the scraping
Update some logs from this method
2017-09-03 05:15:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778647> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:43 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778647
2017-09-03 05:15:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778568> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:44 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778568
2017-09-03 05:15:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6774780> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:46 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6774780
But it doesn't work either, and the retryrequest was just skipped without any sign.
Thanks for reading all of this. I appreciate all of your help.
Related
I'm trying to learn scrappy with python. Ive used this website , which might be out of date a bit, but ive managed to get the links and urls as intended. ALMOST.
import scrapy, time
import random
#from random import randint
#from time import sleep
USER_AGENT = "Mozilla/5.0"
class OscarsSpider(scrapy.Spider):
name = "oscars5"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"]
def parse(self, response):
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"): #).extract(): Once you extract it, it becomes a string so the library can no longer process it - so dont extarct it/ - https://stackoverflow.com/questions/57417774/attributeerror-str-object-has-no-attribute-xpath
url = response.urljoin(href)
print(url)
time.sleep(random.random()) #time.sleep(0.1) #### https://stackoverflow.com/questions/4054254/how-to-add-random-delays-between-the-queries-sent-to-google-to-avoid-getting-blo #### https://stackoverflow.com/questions/30030659/in-python-what-is-the-difference-between-random-uniform-and-random-random
req = scrapy.Request(url, callback=self.parse_titles)
time.sleep(random.random()) #sleep(randint(10,100))
##req.meta['proxy'] = "http://yourproxy.com:178" #https://checkerproxy.net/archive/2021-03-10 (from ; https://stackoverflow.com/questions/30330034/scrapy-error-error-downloading-could-not-open-connect-tunnel)
yield req
def parse_titles(self, response):
for sel in response.css('html').extract():
data = {}
data['title'] = response.css(r"h1[id='firstHeading'] i::text").extract()
data['director'] = response.css(r"tr:contains('Directed by') a[href*='/wiki/']::text").extract()
data['starring'] = response.css(r"tr:contains('Starring') a[href*='/wiki/']::text").extract()
data['releasedate'] = response.css(r"tr:contains('Release date') li::text").extract()
data['runtime'] = response.css(r"tr:contains('Running time') td::text").extract()
yield data
The problem I have is the scraper retrieves only the 1st character of the href links which i cant wrap my head around now. I cant understand why or how to fix it .
Snippet of output when I run the spider in CMD:
2021-03-12 20:09:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners> (referer: None)
https://en.wikipedia.org/wiki/t
https://en.wikipedia.org/wiki/r
https://en.wikipedia.org/wiki/[
2021-03-12 20:09:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/T> from <GET https://en.wikipedia.org/wiki/t>
https://en.wikipedia.org/wiki/s
https://en.wikipedia.org/wiki/t
2021-03-12 20:10:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://en.wikipedia.org/wiki/t> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-03-12 20:10:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/T> (referer: https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners)
https://en.wikipedia.org/wiki/y
It does crawl, it finds those links (or starting points of the links we need), im sure. and appends, but doesnt get entire title or link. Hence I'm only scraping incorrect non-existent pages ! the output files are perfectly formatted but with no data apart from empty strings.
https://en.wikipedia.org/wiki/t in the spider output above for example should be https://en.wikipedia.org/wiki/The_Artist_(film)
and
https://en.wikipedia.org/wiki/r should and could be https://en.wikipedia.org/wiki/rain_man(film)
etc.
in scrapy shell,
response.css("h1[id='firstHeading'] i::text").extract()
returns []
confirming my fears. Its the selector.
How can I fix it?
As its not working as it should do or its was claimed to. If anyone could help I would be very grateful.
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"):
This is just doing for x in "abcde", which iterates over each letter in the string, which is why you get t, r, [, s, ...
Is this really what you intended? The parentheses sort of suggest that you intended this to be a function call. As a plain string, it makes no sense.
The problem I have is the following: I am trying to scrape a website that has multiple categories of products, and for each category of products, it has several pages with 24 products in each. I am able to get all starting urls, and scraping every page I am able to get the urls (endpoints, which I then make into full urls) of all pages.
I should say that not for every category I have product pages, and not every starting url is a category and thus it might not have the structure I am looking for. But most of them do.
My intent is: from all pages of all categories I want to extract the href of every product displayed in the page. And the code I have been using is the following one:
import scrapy
class MySpider(scrapy.spiders.CrawlSpider):
name = 'myProj'
with open('resultt.txt','r') as f:
endurls = f.read()
f.close()
endurls= endurls.split(sep=' ')
endurls = ['https://www.someurl.com'+url for url in endurls]
start_urls = endurls
def parse(self, response):
with open('allpages.txt', 'a') as f:
pages_in_category = response.xpath('//option/#value').getall()
length = len(pages_in_category)
pages_in_category = ['https://www.someurl.com'+page for page in pages_in_category]
if length == 0:
f.write(str(response.url))
else:
for page in pages_in_category:
f.write(page)
f.close()
Through scrapy shell I am able to make it work, though not iteratively. The command I run in the terminal is then
scrapy runspider ScrapyCarr.py -s USER_AGENT='my-cool-project (http://example.com)'
Since I have not initialized a proper scrapy structure (I don't need that, it is a simple project for uni and I do not care much about the structure). Unfortunately the file in which I am trying to append my products urls remains empty, even if when inputting it through scrapy shell I see it working.
The output I am currently getting is the following
2020-10-15 12:51:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/fish/typefish/N-4minn0/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-i50owa/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-1l0cnr6/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-18isujc/c> (referer: None)
The problem was that I was initializing my class MySpider with a spider.CrawlSpider. The code works when using a class spider.Spider.
SOLVED
I am very new to coding and am struggling with a web scraper I am trying to build. I am using a Lua script in order for my scrapy request to wait for any web-element (don't care about which element I just need the initial page loader to finish loading so I can access the html elements) to appear after the JavaScript on the website has loaded. The particular website I am trying to access is https://www.ladbrokes.com.au/sports/basketball/usa/nba where it has a JS initial loader page before any of the elements on the website are loaded
my current code is this:
class Ladbrokes(scrapy.Spider):
name = 'Ladbrokes'
allowed_domains = ['ladbrokes.com.au']
start_urls = ['https://www.ladbrokes.com.au/sports']
def parse (self, response):
sports_link = select_ladbrokes(response)
for link in sports_link:
url = response.urljoin(link)
yield SplashRequest(url = url, callback =self.ladbrokes_all_comps,endpoint='execute',
args={'lua_source':lua_script})
def ladbrokes_all_comps(self, response):
comps = response.xpath('//*[#id="accordion_4e099d27-0f11-4c6e-848e-965fff7ad995"]/div[2]/div[2]/div[1]/div[2]/div[1]/div/div[1]/text()').extract()
lua_script = '''
function main(splash)
assert(splash:go(splash.args.url))
while not splash:select('#page-content-left > div > div') do
splash:wait(0.1)
end
return {html=splash:html()}
end '''
When I call my spider I end up getting these errors:
2019-11-25 16:41:30 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://www.ladbrokes.com.au/sports/nrl via http://0.0.0.0:8050/execute> (referer: None)
2019-11-25 16:41:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://www.ladbrokes.com.au/sports/nrl>: HTTP status code is not handled or not allowed
It seems it is timing out on the Lua script While loop, but I am not sure if it is because I am trying to select the web-element incorrectly.
I also tried putting in a long splash wait argument in the SplashRequest function, but it seemed the initial page loader never finished loading. Any help on this would be great!
I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:
(extra newlines and white space added for readability)
[scrapy.downloadermiddlewares.retry] DEBUG:
Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
Connection to the other side was lost in a non-clean fashion: Connection lost.>]
But, when I try to crawl it on scrapy shell, it is being crawled successfully.
[scrapy.core.engine] DEBUG:
Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
(referer: None)
I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help.
Thanks!
I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example.
To help you out, this is what it's all about:
import scrapy
class CLSpider(scrapy.Spider):
name = 'CL Spider'
start_urls = ['https://tampa.craigslist.org/search/jjj?query=bookkeeper']
def parse(self, response):
for url in response.xpath('//a[#class="result-title hdrlnk"]/#href').extract():
yield scrapy.Request(response.urljoin(url), self.parse_item)
def parse_item(self, response):
# TODO: scrape item details here
return {
'url': response.url,
# ...
# ...
}
Now, this MCVE does everything you want to do in a nutshell:
visits one of the search pages
iterates through the results
visits each item for parsing
This should be your starting point for debugging, removing all the unrelated boilerplate.
Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.
UPDATE:
Adding a delay between requests can be done in two ways:
Globally for all spiders in settings.py by specifying for example DOWNLOAD_DELAY = 2 for a 2 second delay between each download.
Per-spider by defining an attribute download_delay,
for example:
class CLSpider(scrapy.Spider):
name = 'CL Spider'
download_delay = 2
Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
I'm having trouble with Python Scrapy.
I have a spider that attempts to login to a site before crawling it, however the site is configured to return response code HTTP 401 on the login page which stops the spider from continuing (even though in the body of that response, the login form is there for submitting).
This is the relevant parts of my crawler:
class LoginSpider(Spider):
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)
The above yields:
2014-02-23 11:52:09+0000 [login] DEBUG: Crawled (401) <GET https://example.com/login> (referer: None)
2014-02-23 11:52:09+0000 [login] INFO: Closing spider (finished)
However if I give it another URL to start on (not the login page) which returns a 200:
2014-02-23 11:50:19+0000 [login] DEBUG: Crawled (200) <GET https://example.com/other-page> (referer: None)
2014-02-23 11:50:19+0000 [login] INFO: Logging in...
You see it goes on to execute my parse() method and make the log entry.
How do I make Scrapy continue to work with the page despite a 401 response code?
On the off-chance this question isn't closed as a duplicate, explicitly adding 401 to handle_httpstatus_list fixed the issue
class LoginSpider(Spider):
handle_httpstatus_list = [401]
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)