I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:
(extra newlines and white space added for readability)
[scrapy.downloadermiddlewares.retry] DEBUG:
Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
Connection to the other side was lost in a non-clean fashion: Connection lost.>]
But, when I try to crawl it on scrapy shell, it is being crawled successfully.
[scrapy.core.engine] DEBUG:
Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
(referer: None)
I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help.
Thanks!
I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example.
To help you out, this is what it's all about:
import scrapy
class CLSpider(scrapy.Spider):
name = 'CL Spider'
start_urls = ['https://tampa.craigslist.org/search/jjj?query=bookkeeper']
def parse(self, response):
for url in response.xpath('//a[#class="result-title hdrlnk"]/#href').extract():
yield scrapy.Request(response.urljoin(url), self.parse_item)
def parse_item(self, response):
# TODO: scrape item details here
return {
'url': response.url,
# ...
# ...
}
Now, this MCVE does everything you want to do in a nutshell:
visits one of the search pages
iterates through the results
visits each item for parsing
This should be your starting point for debugging, removing all the unrelated boilerplate.
Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.
UPDATE:
Adding a delay between requests can be done in two ways:
Globally for all spiders in settings.py by specifying for example DOWNLOAD_DELAY = 2 for a 2 second delay between each download.
Per-spider by defining an attribute download_delay,
for example:
class CLSpider(scrapy.Spider):
name = 'CL Spider'
download_delay = 2
Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
Related
I'm trying to learn scrappy with python. Ive used this website , which might be out of date a bit, but ive managed to get the links and urls as intended. ALMOST.
import scrapy, time
import random
#from random import randint
#from time import sleep
USER_AGENT = "Mozilla/5.0"
class OscarsSpider(scrapy.Spider):
name = "oscars5"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"]
def parse(self, response):
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"): #).extract(): Once you extract it, it becomes a string so the library can no longer process it - so dont extarct it/ - https://stackoverflow.com/questions/57417774/attributeerror-str-object-has-no-attribute-xpath
url = response.urljoin(href)
print(url)
time.sleep(random.random()) #time.sleep(0.1) #### https://stackoverflow.com/questions/4054254/how-to-add-random-delays-between-the-queries-sent-to-google-to-avoid-getting-blo #### https://stackoverflow.com/questions/30030659/in-python-what-is-the-difference-between-random-uniform-and-random-random
req = scrapy.Request(url, callback=self.parse_titles)
time.sleep(random.random()) #sleep(randint(10,100))
##req.meta['proxy'] = "http://yourproxy.com:178" #https://checkerproxy.net/archive/2021-03-10 (from ; https://stackoverflow.com/questions/30330034/scrapy-error-error-downloading-could-not-open-connect-tunnel)
yield req
def parse_titles(self, response):
for sel in response.css('html').extract():
data = {}
data['title'] = response.css(r"h1[id='firstHeading'] i::text").extract()
data['director'] = response.css(r"tr:contains('Directed by') a[href*='/wiki/']::text").extract()
data['starring'] = response.css(r"tr:contains('Starring') a[href*='/wiki/']::text").extract()
data['releasedate'] = response.css(r"tr:contains('Release date') li::text").extract()
data['runtime'] = response.css(r"tr:contains('Running time') td::text").extract()
yield data
The problem I have is the scraper retrieves only the 1st character of the href links which i cant wrap my head around now. I cant understand why or how to fix it .
Snippet of output when I run the spider in CMD:
2021-03-12 20:09:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners> (referer: None)
https://en.wikipedia.org/wiki/t
https://en.wikipedia.org/wiki/r
https://en.wikipedia.org/wiki/[
2021-03-12 20:09:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/T> from <GET https://en.wikipedia.org/wiki/t>
https://en.wikipedia.org/wiki/s
https://en.wikipedia.org/wiki/t
2021-03-12 20:10:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://en.wikipedia.org/wiki/t> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-03-12 20:10:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/T> (referer: https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners)
https://en.wikipedia.org/wiki/y
It does crawl, it finds those links (or starting points of the links we need), im sure. and appends, but doesnt get entire title or link. Hence I'm only scraping incorrect non-existent pages ! the output files are perfectly formatted but with no data apart from empty strings.
https://en.wikipedia.org/wiki/t in the spider output above for example should be https://en.wikipedia.org/wiki/The_Artist_(film)
and
https://en.wikipedia.org/wiki/r should and could be https://en.wikipedia.org/wiki/rain_man(film)
etc.
in scrapy shell,
response.css("h1[id='firstHeading'] i::text").extract()
returns []
confirming my fears. Its the selector.
How can I fix it?
As its not working as it should do or its was claimed to. If anyone could help I would be very grateful.
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"):
This is just doing for x in "abcde", which iterates over each letter in the string, which is why you get t, r, [, s, ...
Is this really what you intended? The parentheses sort of suggest that you intended this to be a function call. As a plain string, it makes no sense.
I'm new to both Python, BeautifulSoup, and Scrapy, so I'm not 100% sure how to describe the problem I'm having.
I'd like to scrape the url provided by the 'next' button you can see in this image, it's in-line next to the image links 'tiff' or 'jpeg'.
The issue is that the 'next' (and in subsequent pages, the 'previous') links don't seem to present themselves via the url I provide to scrapy. When I asked a friend to check the url, she told me she didn't see the links. I confirmed this by printing the bs object associated with the tag id 'desciption':
description = soup.find('div', {'id':'description'} )
Because I generate this page from a search at the LOC website, I'm thinking I must need to pass something to my spider to indicate the search parameters. I tried the solution suggested here, by changing the referer, but it still doesn't work:
DEFAULT_REQUEST_HEADERS = {
'Referer': 'www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid'
}
I get the following output logs when I run my spider, confirming the referrer has been updated:
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.loc.gov/robots.txt> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.loc.gov/pictures/resource/fsa.8a07028/?co=fsa> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
If someone could help, I'd really appreciate it.
AFAICT, that site uses sessions to store history of your search server-side.
The search is initiated from a URL like yours.
But when visiting the image URLs afterwards, your session is active (via your cookies), and the site renders next / back links. If no session is found it doesn't (but you can still see the page). You can prove this by deleting your cookies after the initial search and watch it disappear when you refresh...
You'll need to tell Scrapy to first go the search URL, and then spider the results , making sure cookie middleware is enabled.
I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or other reasons, but whatever). I would like to do a retry for all those fail scraping.
As it is too long to read all, I would like to summary some directions for my questions are:
1) Can I execute the retry using CustomRetryMiddleware only in one specific method
2) Can I do something after finish the first scraping
Okay let's start
The overall logic of my code is as below:
Crawl the homepage of forum
Crawl into every post from the homepage
Scrape the data from the post
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
def parse_page(self, response): //Going into all the threads
hrefs = response.xpath('blahblah')
for href in hrefs:
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_post)
def parse_post(self, response): //really scraping the content
content_empty = len(response.xpath('//table[#class="content"]') //check if the content is empty
if content_empty == 0:
//do something
item = ForumItem()
item['some_content'] = response.xpath('//someXpathCode')
yield item
I have read lots from stackoverflow, and thought I can do it in two ways (and have done some coding):
1) Create a custom RetryMiddleware
2) Do the retry just inside the spider
However I am doing both of them with no lucks. The failure reasons is as below:
For Custom RetryMiddleware, I followed this, but it will check through all the page I crawled, including robot.txt, so it always retrying. But what I want is only do the retry check inside parse_post. Is this possible?
For retry inside the spider, I have tried two approacch.
First, I added a class variable _posts_not_crawled = [] and append it with response.url if the empty check is true. Adjust the code of start_requests to do the retry of all fail scraping after finishing scraping for the first time:
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
while self._post_not_crawled:
yield scrapy.Request(self._post_not_crawled.pop(0), callback=self.parse_post)
But of course it doesn't work, because it executes before actually scraping data, so it will only execute once with an empty _post_not_crawled list before start scraping. Is it possible to do something after finish first scraping?
Second trial is to directly retry inside the parse_post()
if content_empty == 0:
logging.warning('Post was empty: ' + response.url)
retryrequest = scrapy.Request(response.url, callback=self.parse_post)
retryrequest.dont_filter = True
return retryrequest
else:
//do the scraping
Update some logs from this method
2017-09-03 05:15:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778647> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:43 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778647
2017-09-03 05:15:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778568> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:44 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778568
2017-09-03 05:15:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6774780> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:46 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6774780
But it doesn't work either, and the retryrequest was just skipped without any sign.
Thanks for reading all of this. I appreciate all of your help.
everyone!
I'm new to Scrapy framework. And I need to parse wisemapping.com.
At first, I read official Scrapy tutorial and tried to get access to one of "wisemap" 's, but got an errors:
[scrapy.core.engine] DEBUG: Crawled (404) <GET https://app.wisemapping.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying
<GET https://app.wisemapping.com/c/maps/576786/public> (failed 3 times): 500 Internal Server Error
[scrapy.core.engine] DEBUG: Crawled (500) <GET https://app.wisemapping.com/c/maps/576786/public> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://app.wisemapping.com/c/maps/576786/public>: HTTP status code is not handled or not allowed
Please, give me an advice to solve problems with following code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://app.wisemapping.com/c/maps/576786/public',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'wisemape.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Navigating to https://app.wisemapping.com/c/maps/576786/public gives the error
"Outch!!. This map is not available anymore.
You do not have enough right access to see this map. This map has been changed to private or deleted."
Does this map exist? If so, try making it public.
If you know for a fact the map you're trying to access exist, verify the URL you're trying to access is the correct one.
This is a scrapy spider.This spider is supposed to collect the names of all div nodes with class attribute=5d-5 essentially making a list of all people with x name from y location.
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class fb_spider(scrapy.Spider):
name="fb"
allowed_domains = ["facebook.com"]
start_urls = [
"https://www.facebook.com/search/people/?q=jaslyn%20california"]
def parse(self,response):
x=response.xpath('//div[#class="_5d-5"]'.extract())
with open("asdf.txt",'wb') as f:
f.write(u"".join(x).encode("UTF-8"))
But the scrapy crawls a web page different from the one specified.I got this on the command prompt:
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
but the URL I specified is:
https://www.facebook.com/search/people/?q=jaslyn%20california
Scraping is not allowed on Facebook: https://www.facebook.com/apps/site_scraping_tos_terms.php
If you want to get data from Facebook, you have to use their Graph API. For example, this would be the API to search for users: https://developers.facebook.com/docs/graph-api/using-graph-api#search
It is not as powerful as the Graph Search on facebook.com though.
Facebook is redirecting the request to the new url.
it seems as though you are missing some headers in your request.
2016-08-15 14:00:14 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
as you can see the referer is None, I would advise you add some headers manually, namely the referer.