I am trying to scrape a website using Scrapy. It gets access to robots.txt file with "status code 200"
But I get this in the terminal for sending a request:
[scrapy.core.engine] DEBUG: Crawled (555) <POST https://www.<the_link>.ca/<part_of_link>/UpdateQuery> (referer: None)
b'{"d":{"Message":"SESSION_TIMEOUT","Result":null,"Succeeded":false}}'
I have tried looking up on internet about 555. However, I could not find much explanation on it.
Got some information from https://httptoolkit.tech/blog/new-http-status-code-555" but nothing very clear.
I have tried it on Postman. The first time I send request I get 555 error and the second time I send request I get status code 200.
But, how do I tackle this on script in Scrapy?
I would be glad if someone can help me to get around this.
Thank you.
Related
Logging in to (Walmart) returns a 412 error while using FormRequest.from_response().
412 describes some error in preconditions so I tried manually passing all headers. Did not worked.
Also tried passing the cookies, still didn't worked.
The said website have Form, but it passes its values to its login API so I tried making a POST request to the url of the API using postman, it did worked, transferred the idea to Scrapy, didn't worked, also, making it like that defeats the purpose of the scraper to use the functions of the website while logged in as it might not redirect to the site.
def parse(self, response):
cookie = response.headers.getlist('Set-Cookie')
yield FormRequest.from_response(
response,
formid="sign-in-form",
formdata={
"email": "email",
"password": "pass"
},
headers={'Cookie': cookie},
callback=self.after_login
)
Here's the stacktrace of the scrapy. I hid the date and time
<date_hidden> [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/account/login> (referer: None)
<date_hidden> [scrapy.core.engine] DEBUG: Crawled (412) <POST https://www.walmart.com/account/electrode/api/signin> (referer: https://www.walmart.com/account/login)
412 (Precondition Failed)
The 412 error response indicates that the client specified one or more preconditions in its request headers, effectively telling the REST API to carry out its request only if certain conditions were met. A 412 response indicates that those conditions were not met, so instead of carrying out the request, the API sends this status code. This means that Your request missing something. Try to add all headers that Your browser sending. I had same issue but with GET request and adding retries with using proxies also helped me.
I'm new to both Python, BeautifulSoup, and Scrapy, so I'm not 100% sure how to describe the problem I'm having.
I'd like to scrape the url provided by the 'next' button you can see in this image, it's in-line next to the image links 'tiff' or 'jpeg'.
The issue is that the 'next' (and in subsequent pages, the 'previous') links don't seem to present themselves via the url I provide to scrapy. When I asked a friend to check the url, she told me she didn't see the links. I confirmed this by printing the bs object associated with the tag id 'desciption':
description = soup.find('div', {'id':'description'} )
Because I generate this page from a search at the LOC website, I'm thinking I must need to pass something to my spider to indicate the search parameters. I tried the solution suggested here, by changing the referer, but it still doesn't work:
DEFAULT_REQUEST_HEADERS = {
'Referer': 'www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid'
}
I get the following output logs when I run my spider, confirming the referrer has been updated:
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.loc.gov/robots.txt> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.loc.gov/pictures/resource/fsa.8a07028/?co=fsa> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
If someone could help, I'd really appreciate it.
AFAICT, that site uses sessions to store history of your search server-side.
The search is initiated from a URL like yours.
But when visiting the image URLs afterwards, your session is active (via your cookies), and the site renders next / back links. If no session is found it doesn't (but you can still see the page). You can prove this by deleting your cookies after the initial search and watch it disappear when you refresh...
You'll need to tell Scrapy to first go the search URL, and then spider the results , making sure cookie middleware is enabled.
I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:
(extra newlines and white space added for readability)
[scrapy.downloadermiddlewares.retry] DEBUG:
Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
Connection to the other side was lost in a non-clean fashion: Connection lost.>]
But, when I try to crawl it on scrapy shell, it is being crawled successfully.
[scrapy.core.engine] DEBUG:
Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
(referer: None)
I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help.
Thanks!
I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example.
To help you out, this is what it's all about:
import scrapy
class CLSpider(scrapy.Spider):
name = 'CL Spider'
start_urls = ['https://tampa.craigslist.org/search/jjj?query=bookkeeper']
def parse(self, response):
for url in response.xpath('//a[#class="result-title hdrlnk"]/#href').extract():
yield scrapy.Request(response.urljoin(url), self.parse_item)
def parse_item(self, response):
# TODO: scrape item details here
return {
'url': response.url,
# ...
# ...
}
Now, this MCVE does everything you want to do in a nutshell:
visits one of the search pages
iterates through the results
visits each item for parsing
This should be your starting point for debugging, removing all the unrelated boilerplate.
Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.
UPDATE:
Adding a delay between requests can be done in two ways:
Globally for all spiders in settings.py by specifying for example DOWNLOAD_DELAY = 2 for a 2 second delay between each download.
Per-spider by defining an attribute download_delay,
for example:
class CLSpider(scrapy.Spider):
name = 'CL Spider'
download_delay = 2
Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
I am working in scrapy.
I am fetching a site which consists of a list of urls.
So I requested the main url in start_url and I got all the href tags(links to fetch data) in a list, I again requested each and every url in the list further for fetching data, but some of the urls are redirecting like below:
Redirecting (301) to <GET example.com/sch/mobile-68745.php> from Redirecting (301) to <GET example.com/sch/mobile-8974.php>
I came to know that scrapy ignores the redirected links, but I want to catch the redirected url and want to scrape the same like the url with 200 status
Is there anyway to catch that redirect url and scrape the data from them, I mean do we need disable redirect middleware? Or do we need to use any meta tag in Request command, can youu provide me an example of that?
I’ve got no experience with Scrapy, but apparently, you can define middlewares that change they way Scrapy works when resolving content.
There is the RedirectMiddleware that supports and handles redirects out of the box, so all you’d need to do is to enable that.
DOWNLOADER_MIDDLEWARES = {
'apy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 123,
}
(I have asked this question on the Scrapy google-group without luck.)
I am trying to log into Facebook using Scrapy. I tried the following in the interactive shell:
I set the headers and created a request as follows:
header_vals={'Accept-Language': ['en'], 'Content-Type': ['application/
x-www-form-urlencoded'], 'Accept-Encoding': ['gzip,deflate'],
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/
*;q=0.8'], 'User-Agent': ['Mozilla/5.0 Gecko/20070219 Firefox/
2.0.0.2']}
login_request=Request('https://www.facebook.com/login.php',headers=header_vals)
fetch(login_request)
I get redirected:
2011-08-11 13:54:54+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?_fb_noscript=1> from <GET
https://www.facebook.com/login.php>
.
.
.
[s] request <GET https://www.facebook.com/login.php>
[s] response <200 https://www.facebook.com/login.php?_fb_noscript=1>
I guess it shouldn't be redirected there if I am supplying the right
headers?
I still attempt to go ahead and supply login details using the
FormRequest as follows:
new_request=FormRequest.from_response(response,formname='login_form',formdata={'email':'...#email.com','pass':'password'},headers=header_vals)
new_request.meta['download_timeout']=180
new_request.meta['redirect_ttl']=30
fetch(new_request) results in:
2011-08-11 14:05:45+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
from <POST https://www.facebook.com/login.php?login_attempt=1>
.
.
[s] response <200 https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
.
What am I missing here? Thanks for any suggestions and help.
I'll add that I've also tried this with a BaseSpider to see if this was a result of the cookies not being passed along in the shell, but it doesn't work there either.
I was able to use Mechanize to log on successfully. Can I take advantage of this to somehow pass cookies on to Scrapy?
Notice that "meta redirect" text near redirecting. Facebook has a noscript tag to automatically redirect clients without javascript to "/login.php?_fb_noscript=1". The problem is that you're posting to "/login.php" instead and always getting redirected by meta refresh header.
Even if you get over this problem it's against Facebook robots.txt, so you shouldn't really be doing this.
Why don't you just use Facebook Graph API?