How to add Headers to Scrapy CrawlSpider Requests? - python

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.
As per this question, I checked
response.request.headers.get('Referer', None)
in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).
I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.
Does anyone know how to modify request headers dynamically?

You can pass REFERER manually to each request using headers argument:
yield Request(parse=..., headers={'referer':...})
RefererMiddleware does the same, automatically taking the referrer url from the previous response.

You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware
In short, you need to add this middleware to your project's settings file.
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
Then in your response parsing method, you can use, response.request.headers.get('Referrer', None), to get the referer.

Related

Python Request not allowing redirects

I am using Python request library to scrape robots.txt data from a list of URLs:
for url in urls:
url = urllib.parse.urljoin(url, "robots.txt")
try:
r = requests.get(url, headers=headers, allow_redirects=False)
r.raise_for_status()
extract_robots(r)
except (exceptions.RequestException, exceptions.HTTPError, exceptions.Timeout) as err:
handle_exeption(err)
In my list of urls, I have this webpage: https://reward.ff.garena.com. When I am requesting https://reward.ff.garena.com/robots.txt, I am directly redirected to https://reward.ff.garena.com/en. However, I specified in my request parameters that I don't want redirects allow_redirects=False.
How can I skip this kind of redirect and make sure I only have domain/robots.txt data calling my extract_robots(data) method?
Do you know for sure that there is a robots.txt at that location?
I note that if I request https://reward.ff.garena.com/NOSUCHFILE.txt that I get the same result as for robots.txt
The allow_redirects=False only stops requests from automatically following 302/location= responses - i.e. it doesn’t stop the server you’re trying to access from returning a redirect as the response to the request you’re making.
If you get this type of response I guess it indicates the file you requested isn’t available, or some other error preventing you accessing it, perhaps in the general case of file access this might indicate need for authentication but for robots.txt that shouldn’t be the problem - simplest to assume the robots.txt isn’t there.

print specific request header python

I am attempting to extract a specific request header using python, the 'plib' to be exact after logging into my account to automate the login. I have successful logged in and printed out all the request headers using seleniumwire but i need one saved to a variable.
for request in driver.requests:
#print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
#print(request.response.headers) # <-- Response headers
that is what i am using to print all the request header but i just need one. Can someone please assist ?
Thanks,
You can index into the headers property like you would with a dictionary.
print(request.headers["plib"])
To quote from the documentation:
headers
A dictionary-like object of request headers. Headers are case-insensitive and duplicates are permitted. Asking for
request.headers['user-agent'] will return the value of the User-Agent
header. If you wish to replace a header, make sure you delete the
existing header first with del request.headers['header-name'],
otherwise you’ll create a duplicate.

Using Cookies To Access HTML

I'm trying to access a site (for which I have a login) through a .get(url) request. However, I tried passing the cookies that should authenticate my request but I keep getting a 401 error. I tried passing the cookies in the .get argument like so
requests.post('http://eventregistry.org/json/article?action=getArticles&articlesConceptLang=eng&articlesCount=25&articlesIncludeArticleConcepts=true&articlesIncludeArticleImage=true&articlesIncludeArticleSocialScore=true&articlesPage=1&articlesSortBy=date&ignoreKeywords=&keywords=soybean&resultType=articles', data = {"connect.sid': "long cookie found on chrome settings")
(Scroll over to see how cookies were used. Apologies for super long URL)
Am I approaching the cookie situation the wrong way? Should I login in with my username or password instead of passing the cookies? Or did I misinterpret my Chrome's cookie?
Thanks!
Solved:
import requests
payload = {
'email': '####gmail.com', #find the right name for the forms from HTML of site
'pass': '###'}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('loginURL')
r = s.get('restrictedURL')
print(r) #etc
I just wanted to let you know that we've updated the package to access the Event Registry data so now you can actually make requests without using the cookies. Instead you can just append the parameter apiKey=XXXX in the url. You can find details on the documentation page:
http://eventregistry.org/documentation

Unable to login to my nytimes account using requests in python

I am trying to send a post request using the nice Requests library in Python. I am sending the payload, as shown in the code, however, the r.text print statement shows the html dump of the myaccount.nytimes.com page, which is not what I want. Any one knows what's happening?
payload = {
'userid': 'myemail',
'password': 'mypass'
}
s = requests.session()
r = s.post('https://myaccount.nytimes.com/auth/login/?URI=http://www.nytimes.com/2014/09/13/opinion/on-long-island-a-worthy-plan-for-coastal-flooding.html?partner=rss', data=payload)
print(r.text)
There are a couple of hidden <input> fields that you are omitting from your form:
is_continue
expires
token
token looks like it would be required, maybe the others aren't.
And possibly remember which is the "remember me" tickbox at the bottom of the form.
Starting with token try incrementally adding fields until it works.
Edit from comment: Token is provided to you when you first access the login page. Thus you need to do an initial GET to https://myaccount.nytimes.com/auth/login/, parse the HTML (BeautifulSoup?) to get the token (and other fields), then POST back to the server. Or you could use mechanize to handle this more easily.

Relogin to Scraped Website on Resuming a Scrapy Job

Is there a way to have a Scrapy spider log in to a website on resuming a previously paused scraping job?
EDIT: To clarify, my question is really about Scrapy spiders rather than cookies in general. Perhaps a better question is whether there's any method which is called when a Scrapy spider is revived after being frozen in a job directory.
Yes, you can.
You should be more clearer about the exact workflow of your scraper.
Anyways, I assume you are going to login when you are scraping for the first time, and want to use the same cookie while you resume the scraping.
You could use the httplib2 library to do something like this. Here is a code sample from their examples page, I have added comments for more clarity.
import urllib
import httplib2
http = httplib2.Http()
url = 'http://www.example.com/login'
body = {'USERNAME': 'foo', 'PASSWORD': 'bar'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}
//submitting form data for logging into the website
response, content = http.request(url, 'POST', headers=headers, body=urllib.urlencode(body))
//Now the 'response' object contains the cookie the website sends
//which can be used for visiting the website again
//setting the cookie for the new 'headers'
headers_2 = {'Cookie': response['set-cookie']}
url = 'http://www.example.com/home'
// using the 'headers_2' object to visit the website,
response, content = http.request(url, 'GET', headers=headers_2)
In case you are not clear how cookies work, do a search. Shortly put, 'Cookies' is a client-side technology which helps servers maintain sessions.

Categories