Python requests with redirection on Heroku - python

I'm making a get request with the requests module of Python. This request returns a 302 code and redirects to another url, from local I am able to capture that new url with:
r = requests.get(URL)
finalURL = r.url
But when this code is executed in Heroku, the redirection is not carried out and the original url is returned to me.
I've tried everything, including forcing the redirection with this code:
r = requests.get(URL, allow_redirects=True)
I have also tried to pick up the url from the response headers, such as location or X-Originating-URL, but when the request is made from Heroku, the response does not return those values in the header.

Redirects are on by default, try viewing content of it, there has to be something that is redirecting u also u can go try and fiddle it.

Related

Python Request not allowing redirects

I am using Python request library to scrape robots.txt data from a list of URLs:
for url in urls:
url = urllib.parse.urljoin(url, "robots.txt")
try:
r = requests.get(url, headers=headers, allow_redirects=False)
r.raise_for_status()
extract_robots(r)
except (exceptions.RequestException, exceptions.HTTPError, exceptions.Timeout) as err:
handle_exeption(err)
In my list of urls, I have this webpage: https://reward.ff.garena.com. When I am requesting https://reward.ff.garena.com/robots.txt, I am directly redirected to https://reward.ff.garena.com/en. However, I specified in my request parameters that I don't want redirects allow_redirects=False.
How can I skip this kind of redirect and make sure I only have domain/robots.txt data calling my extract_robots(data) method?
Do you know for sure that there is a robots.txt at that location?
I note that if I request https://reward.ff.garena.com/NOSUCHFILE.txt that I get the same result as for robots.txt
The allow_redirects=False only stops requests from automatically following 302/location= responses - i.e. it doesn’t stop the server you’re trying to access from returning a redirect as the response to the request you’re making.
If you get this type of response I guess it indicates the file you requested isn’t available, or some other error preventing you accessing it, perhaps in the general case of file access this might indicate need for authentication but for robots.txt that shouldn’t be the problem - simplest to assume the robots.txt isn’t there.

I am tryng to write a website form filling automation script in Python

When i run this code there are no errors but it doesnt do what i want it to do because i am supposed to receive an email if everything goes right.
I tried using mechanize but it always shows that the field name i specified doesnt exist on a website which is not a case with requests
import requests
url = 'https://silo-airsoft.com/giveaway/'
payload = {'ne':'sergejgolac#gmail.com'}
r = requests.post(url, params=payload)
I am doing all of this on a Terminal if it means anything
Probably, you mean to do this
r = requests.post(url, data=payload)
Passing params means that your payload will be in get parameters, so you basically do POST to https://silo-airsoft.com/giveaway/?ne=sergejgolac#gmail.com with no payload
I assume you want to send request to https://silo-airsoft.com/giveaway/ with
{'ne':'sergejgolac#gmail.com'}

Redirected URL in python

I'm trying to get redirected URL, but something doesn't work.
I tried two methods:
from urllib import request as uReq
import requests
#method 1
url_str = 'http://google.us/'
resp = req.urlopen(url_str)
print(resp.geturl())
#method 2
url_str = "http://google.us/"
resp = requests.get(url_str)
print(resp.url)
Both work and give result >>> https://www.google.com
However, when I try to add this URL: http://www.kontrakt.szczecin.pl/lista-ofert/?f_listingId=351238&f=&submit=Szukaj as url_string nothing happens. When one's go to this site via browser he'll get that link:
http://www.kontrakt.szczecin.pl/mieszkanie-wynajem-41m2-1850pln-janusza-kusocinskiego-centrum-szczecin-zachodniopomorskie,351238
It is important for me to get a link, because I need info from it.
With allow_redirects=False you can make the url to stay on the page which you want even though it was intended to redirect.
resp = requests.get(url_str, allow_redirects=False)
You can find more such usage here

Unable to get the redirected URLs in Python. Tried using requests, urllib, urllib2, and mechanize

I have a huge list of URLs which redirect to different URLs.
I am supplying them in for loop from a list, and trying to print the redirected URLs
The first redirected URL prints fine.
But from the second one - requests just stops giving me redirected URLs, and just prints the given URL
I tried implementing with urllib, urllib2, and mechanize.
They give the first redirected url fine, and then throws an error at second one and stops.
Can anyone please let me know why this is happening?
Below is the pseudo code/implementation:
for given_url in url_list:
print ("Given URL: " + given_url)
s = requests.Session()
r = requests.get(given_url, allow_redirects=True)
redirected_url = r.url
print ("Redirected URL: " + redirected_url)
Output:
Given URL: www.xyz.com
Redirected URL: www.123456789.com
Given URL: www.abc.com
Redirected URL: www.abc.com
Given URL: www.pqr.com
Redirected URL: www.pqr.com
Try a HEAD request, it won't follow redirects or download the entire body:
r = requests.head('http://www.google.com/')
print r.headers['Location']
There is nothing wrong with the code snippet you provided, but as you mentioned in the comments you are getting HTTP 400 and 401 responses. HTTP 401 means Unauthorized, which means the site is blocking you. HTTP 400 means Bad Request which typically means the site doesn't understand your request, but it can also be returned when you are being blocked, which I suspect is the case on those too.
When I run your code for the ABC website I get redirected properly, which leads me to believe they are blocking your ip address for sending too many requests in a short period of time and/or for having no User-Agent set.
Since you mentioned you can open the links correctly in a browser, you can try setting your User-Agent string to match that of a browser, however this is not guaranteed to work since it is one of many parameters a site may use to detect whether you are a bot or not.
For example:
headers = {'User-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)

Fetch a page with cookies using Python requests library

I'm just studying the requests library(http://docs.python-requests.org/en/latest/),
and got a problem on how to fetch a page with cookies using requests.
for example:
url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()
when I use the codes above to fetch the page, it just saves the login page to 'before.html' instead of logined page, it refers that actually I haven't logged in successfully.
But if I use URLlib2 to fetch the page, it works properly as expected.
parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()
When I use these codes, it saves the logined page in before_urllib2.html.
--
Are there any mistakes in my code?
Any reply would be grateful.
You can use Session object to get what you desire:
url2='http://passport.baidu.com'
session = requests.Session() # create a Session object
cookie = requests.utils.cookiejar_from_dict(parsedCookies)
session.cookies.update(cookie) # set the cookies of the Session object
req = session.get(url2, headers=headers,allow_redirects=True)
If you use the requests.get function, it doesn't send cookies for the redirected page. Instead, if you use the Session().get function, it will maintain and send cookies for all http requests, this is what the concept "session" exactly means.
Let me try to elaborate to you what happens here:
When I sent cookies to http://passport.baidu.com/center and set the parameter allow_redirects as false, the returned status code is 302 and one of the headers of the response is 'location': '/center?_t=1380462657' (This is a dynamic value generated by server, you can replace it with what you get from server):
url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers
But when I set the parameter allow_redirects as True, it still doesn't redirect to the page (http://passport.baidu.com/center?_t=1380462657) and the server return the login page. The reason is that the requests.get doesn't send cookies for the redirected page, here is http://passport.baidu.com/center?_t=1380462657, so we can login successfully. That is why we need the Session object.
If I set url2 = http://passport.baidu.com/center?_t=1380462657, it will return the page you want. One solution is use the above code to get the dynamic location value and form a path to you account like http://passport.baidu.com/center?_t=1380462657 , then you can get the desired page.
url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )
But this is cumbersome, so when dealing with cookies, Session object do excellent job for us!

Categories