Crawling redirected url in scrapy - python

I am working in scrapy.
I am fetching a site which consists of a list of urls.
So I requested the main url in start_url and I got all the href tags(links to fetch data) in a list, I again requested each and every url in the list further for fetching data, but some of the urls are redirecting like below:
Redirecting (301) to <GET example.com/sch/mobile-68745.php> from Redirecting (301) to <GET example.com/sch/mobile-8974.php>
I came to know that scrapy ignores the redirected links, but I want to catch the redirected url and want to scrape the same like the url with 200 status
Is there anyway to catch that redirect url and scrape the data from them, I mean do we need disable redirect middleware? Or do we need to use any meta tag in Request command, can youu provide me an example of that?

I’ve got no experience with Scrapy, but apparently, you can define middlewares that change they way Scrapy works when resolving content.
There is the RedirectMiddleware that supports and handles redirects out of the box, so all you’d need to do is to enable that.
DOWNLOADER_MIDDLEWARES = {
'apy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 123,
}

Related

Scrapy login using FormRequest.from_response() returns 412 error even with headers

Logging in to (Walmart) returns a 412 error while using FormRequest.from_response().
412 describes some error in preconditions so I tried manually passing all headers. Did not worked.
Also tried passing the cookies, still didn't worked.
The said website have Form, but it passes its values to its login API so I tried making a POST request to the url of the API using postman, it did worked, transferred the idea to Scrapy, didn't worked, also, making it like that defeats the purpose of the scraper to use the functions of the website while logged in as it might not redirect to the site.
def parse(self, response):
cookie = response.headers.getlist('Set-Cookie')
yield FormRequest.from_response(
response,
formid="sign-in-form",
formdata={
"email": "email",
"password": "pass"
},
headers={'Cookie': cookie},
callback=self.after_login
)
Here's the stacktrace of the scrapy. I hid the date and time
<date_hidden> [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/account/login> (referer: None)
<date_hidden> [scrapy.core.engine] DEBUG: Crawled (412) <POST https://www.walmart.com/account/electrode/api/signin> (referer: https://www.walmart.com/account/login)
412 (Precondition Failed)
The 412 error response indicates that the client specified one or more preconditions in its request headers, effectively telling the REST API to carry out its request only if certain conditions were met. A 412 response indicates that those conditions were not met, so instead of carrying out the request, the API sends this status code. This means that Your request missing something. Try to add all headers that Your browser sending. I had same issue but with GET request and adding retries with using proxies also helped me.

BeautifulSoup/Scrapy: different BeautifulSoup html from Source HTML viewed in Firefox

I'm new to both Python, BeautifulSoup, and Scrapy, so I'm not 100% sure how to describe the problem I'm having.
I'd like to scrape the url provided by the 'next' button you can see in this image, it's in-line next to the image links 'tiff' or 'jpeg'.
The issue is that the 'next' (and in subsequent pages, the 'previous') links don't seem to present themselves via the url I provide to scrapy. When I asked a friend to check the url, she told me she didn't see the links. I confirmed this by printing the bs object associated with the tag id 'desciption':
description = soup.find('div', {'id':'description'} )
Because I generate this page from a search at the LOC website, I'm thinking I must need to pass something to my spider to indicate the search parameters. I tried the solution suggested here, by changing the referer, but it still doesn't work:
DEFAULT_REQUEST_HEADERS = {
'Referer': 'www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid'
}
I get the following output logs when I run my spider, confirming the referrer has been updated:
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.loc.gov/robots.txt> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
2018-07-31 15:41:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.loc.gov/pictures/resource/fsa.8a07028/?co=fsa> (referer: www.loc.gov/pictures/collection/fsa/search/?co=fsa&q=1935&st=grid)
If someone could help, I'd really appreciate it.
AFAICT, that site uses sessions to store history of your search server-side.
The search is initiated from a URL like yours.
But when visiting the image URLs afterwards, your session is active (via your cookies), and the site renders next / back links. If no session is found it doesn't (but you can still see the page). You can prove this by deleting your cookies after the initial search and watch it disappear when you refresh...
You'll need to tell Scrapy to first go the search URL, and then spider the results , making sure cookie middleware is enabled.

Scrapy response is a different language from request and resposne url

I'm trying to scrape search results from this page
http://eur-lex.europa.eu/search.html?qid=1437402891621&DB_TYPE_OF_ACT=advGeneral&CASE_LAW_SUMMARY=false&DTS_DOM=EU_LAW&typeOfActStatus=ADV_GENERAL&type=advanced&lang=fr&SUBDOM_INIT=EU_CASE_LAW&DTS_SUBDOM=EU_CASE_LAW
The language according to the url is french, and that is what I see in the scrapy shell, following 'crawled (200) '
If I try response.url I also get a url with lang=fr.
Viewing the page in a browser shows me french results.
However, the body of the response is English.
I've tried disabling cookies in my scrapy settings.py file.
I've also set the DEFAULT_REQUEST HEADERS to 'Accept-Language': 'fr'.
Any ideas?
In the upper right corner of the webpage there's a drop down field to choose the language of the website. Selecting french there will add another parameter to the url: &locale=fr.
So - add that parameter to your start_url.

Scrapy Crawl all websites in start_url even if redirect

I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.
DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>
I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:
DEBUG: Filtered offsite request to 'www.foo.com'
Here is my attempt to dynamically add allowed_domains:
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return Request = (response.url,callback=self.parse_callback)
else:
return self.parse_it(response,1)
My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.
I figured out the answer to my own question.
I edited the offsite middleware to get the updated list of allowed domains before it filters and I dynamically add to the allowed domain list in parse_start_url method.
I added this function to OffisteMiddleware
def update_regex(self,spider):
self.host_regex = self.get_host_regex(spider)
I also edited this function inside OffsiteMiddleware
def should_follow(self, request, spider):
#Custom code to update regex
self.update_regex(spider)
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
host = urlparse_cached(request).hostname or ''
return bool(regex.search(host))
Lastly for my use case I added this code to my spider
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return self.parse_it(response,1)
This code will add the redirected domain for any start_urls that get redirected and then will crawl those redirected sites.

Problem logging into Facebook with Scrapy

(I have asked this question on the Scrapy google-group without luck.)
I am trying to log into Facebook using Scrapy. I tried the following in the interactive shell:
I set the headers and created a request as follows:
header_vals={'Accept-Language': ['en'], 'Content-Type': ['application/
x-www-form-urlencoded'], 'Accept-Encoding': ['gzip,deflate'],
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/
*;q=0.8'], 'User-Agent': ['Mozilla/5.0 Gecko/20070219 Firefox/
2.0.0.2']}
login_request=Request('https://www.facebook.com/login.php',headers=header_vals)
fetch(login_request)
I get redirected:
2011-08-11 13:54:54+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?_fb_noscript=1> from <GET
https://www.facebook.com/login.php>
.
.
.
[s] request <GET https://www.facebook.com/login.php>
[s] response <200 https://www.facebook.com/login.php?_fb_noscript=1>
I guess it shouldn't be redirected there if I am supplying the right
headers?
I still attempt to go ahead and supply login details using the
FormRequest as follows:
new_request=FormRequest.from_response(response,formname='login_form',formdata={'email':'...#email.com','pass':'password'},headers=header_vals)
new_request.meta['download_timeout']=180
new_request.meta['redirect_ttl']=30
fetch(new_request) results in:
2011-08-11 14:05:45+0530 [default] DEBUG: Redirecting (meta refresh)
to <GET https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
from <POST https://www.facebook.com/login.php?login_attempt=1>
.
.
[s] response <200 https://www.facebook.com/login.php?login_attempt=1&_fb_noscript=1>
.
What am I missing here? Thanks for any suggestions and help.
I'll add that I've also tried this with a BaseSpider to see if this was a result of the cookies not being passed along in the shell, but it doesn't work there either.
I was able to use Mechanize to log on successfully. Can I take advantage of this to somehow pass cookies on to Scrapy?
Notice that "meta redirect" text near redirecting. Facebook has a noscript tag to automatically redirect clients without javascript to "/login.php?_fb_noscript=1". The problem is that you're posting to "/login.php" instead and always getting redirected by meta refresh header.
Even if you get over this problem it's against Facebook robots.txt, so you shouldn't really be doing this.
Why don't you just use Facebook Graph API?

Categories