Regarding page response - python

When I use requests library, I get different status code (error 503) but when I inspect page and lookout for status it shows (error 404).
I tried this:
import requests
print(requests.get("https://www.amazon.de/dp/1015").status_code)
How can I get exact response?

The reason that you are getting different responses is that your browser is sending headers. Headers are extra information about a request. Browsers always send specific headers, and one is the User-Agent header.
If amazon sees this header, it thinks that there is a browser asking for the page, so it is programmed to send a 404 response.
Without this header, it sends a 503 response because it knows you aren't using a browser. That's just what it's programmed to do.
To add headers, use:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}
req = requests.get("https://www.amazon.de/dp/1015", headers=headers)
print(req.status_code)
which prints 404.
The User-Agent header tells amazon details about your browser and computer. To find your own User-Agent header, look at the headers in the Network part of your browser's inspector tab.

Related

get request returns 403 status code even after using header

I'm trying to scrape data from autotrader page and I managed to grab link to every offer on that page but when I'm trying to get data from every offer I get 403 requests status even though I'm using a header.
What more can I do to get past it?
headers = {"User Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.121 Safari/537.36'}
page = requests.get("https://www.autotrader.co.uk/car-details/202010145012219", headers=headers)
print(page.status_code) # 403 forbidden
content_of_page = page.content
soup = bs4.BeautifulSoup(content_of_page, 'lxml')
title = soup.find('h1', {'class': 'advert-heading__title atc-type-insignia atc-type-insignia--medium '})
print(title.text)
[for people that are in the same position: autotrader uses cloudflare to protect every "car-detail" page, so I would suggest using selenium for example]
If you can manage to get the data via your browser, i.e. you somehow see this data in a website, then you can likely replicate that with requests.
Briefly, you need headers in your request to match the Browser's request:
Open dev tools in you browser (e.g. F12 or cmd+opt+I or click on menu)
Open Network tab
Reload the page (the whole website or the target request's url only, whatever provides a desired response from the server)
Find a http request to the desired url in the Network tab. Right click it, click 'Copy...', and choose the option (e.g. curl) you need.
Your browser sends tons of extra headers, you never know which ones are actually checked by the server so this technique will save you much time.
However, this might fail if there's some protection against blunt request copies, e.g. some temporary tokens, so the requests cannot be reused. In this case you need Selenium (browser emulation/automation), it's not difficult so it worth using.

Python requests: Url will show table in browser but not when I use requests

I am trying to scrape a table in a webpage, or even download the .xlsx file of this table, using the requests library.
Normal workflow:
I log into the site. Go to my reporting page, choose report, click button that says "Test" and a second window opens up with my table and gives me the option to download the .xlsx file.
When I try to access this url I can copy and paste it into any chrome browser that I am currently logged into. When I try with requests, even when passing an auth into my get() i get a 200 response but it is a simple page with one line of text telling me to "contact my tech staff to receive the proper url to enter your username and password". This is the same as when i paste the url into a browser where I am not logged into the site. Except when I do that i am redirected to a new url that has the same sentence.
So I imagine there is a slug for the organization that is not passed in the url but somewhere in the headers or cookies when I access this site in my browser. How do i identify this parameter in the HTTP header? Then how do I send it to requests so I can get my table and move on to try and automate downloading the .xlsx.
import requests
url = 'myorganization.com/adhocHTML.xsl?x=adhoc.AdHocFilter-listAdhocData&filterID=45678&source=live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
data = requests.get(url, headers=headers, auth=('username', 'Password'))
Any help would be greatly appreciated as I am new to the requests library and just trying to automate some data flow before it ever gets to analyzing it.
You need to login with requests. You can do this with making a session and make other requests with this session (it will save all cookies and other stuffs).
Before going for code you should do a few steps:
make sure you are logged out. open browser Inspect in log in page. go to network tab. log in and find a POST request in network tab that is related to your login request. at the end of this tab you find some parameters for login. make does parameters a dictionary (login_data) in your code and go as below:
session = requests.Session()
session.post('url_to_login_page', login_data)
data = session.get(url, headers=headers)
Login data for each website are different from others so I can't give you a specific example. You should be able to find it as I said above. If you had problem with those, tell me.

I Call API from PYTHON I get the response 406 Not Acceptable

I created a API in my site and I'm trying to call an API from python but I always get 406 as a response, however, if I put the url in the browser with the parameters, I can see the correct answer
I already did some test in pages where you can tests you own API, I test it in the browser and work fine.
I already followed up a manual that explains how to call an API from python but I do not get the correct response :(
This is the URL of the API with the params:
https://icassy.com/api/login.php?usuario_email=warles34%40gmail.com&usuario_clave=123
This is the code I use to call the API from Python
import requests
urlLogin = "https://icassy.com/api/login.php"
params = {'usuario_email': 'warles34#gmail.com', 'usuario_clave': '123'}
r = requests.get(url=urlLogin, data=params)
print(r)
print(r.content)
and I get:
<Response [406]>
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
I should receive in JSON format the success message and the apikey like this:
{"message":"Successful login.","apikey":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJodHRwOlwvXC9leGFtcGxlLm9yZyIsImF1ZCI6Imh0dHA6XC9cL2ljYXNzeS5jb20iLCJpYXQiOjEzNTY5OTk1MjQsIm5iZiI6MTM1NzAwMDAwMCwiZGF0YSI6eyJ1c3VhcmlvX2lkIjoiMzQiLCJ1c3VhcmlvX25vbWJyZSI6IkNhcmxvcyIsInVzdWFyaW9fYXBlbGxpZG8iOiJQZXJleiIsInVzdWFyaW9fZW1haWwiOiJ3YXJsZXMzNEBnbWFpbC5jb20ifX0.bOhrC-vXhQEHtbbZGmhLByCxvJY7YxDrLhVOfy9zeFc"}
Looks like there is a validation on the server to check if request is made from some browser. Adding a user-agent header should do it -
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url=urlLogin, params=params, headers=headers)
This link of user agents might come handy in future.
I turned out that the service I was doing a request to was hosted on Akamai that has a bot manager. It looks at the requests (where it comes from) and if it determines that it is a bot you get a 406 error.
The solution was to ask for the server IP to be whitelisted, or to send a special header to all server communication.
In my case, I had
'Accept': 'text/plain'
and it worked after I replaced it with
'Accept': 'application/json'
I didn't need to use user-agent at all

python-requests give me diffrent response from what I see in the browser, Why?

I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.

Using URLFetch in Python GAE to fetch a complete document

I am using urlfetch.fetch in App engine using Python 2.7.
I tried fetching 2 URLs belonging to 2 different domains. For the first one, the result of urlfetch.fetch includes results after resolving XHR queries that are made for getting recommended products.
However for the other page belonging to another domain, the XHR queries are not resolved and I just get the plain HTML for the most part. The XHR queries for this page are also made for purposes of getting recommended products to show, etc.
Here is how I use urlfetch:
fetch_result = urlfetch.fetch(url, deadline=5, validate_certificate=True)
URL 1 (the one where XHR is resolved and the response is complete)
https://www.walmart.com/ip/HP-15-f222wm-ndash-15.6-Laptop-Touchscreen-Windows-10-Home-Intel-Pentium-Quad-Core-Processor-4GB-Memory-500GB-Hard-Drive/53853531
URL 2 (the one where I just get the plain HTML for the most part)
https://www.flipkart.com/oricum-blue-486-loafers/p/itmezfrvwtwsug9w?pid=SHOEHZWJUMMTEYRU
Can someone please advice what I may be missing in regards to the inconsistency.
The server is serving different output based on the user-agent string supplied in the request headers.
By default, urlfetch.fetch will send requests with the user agent header set to something like AppEngine-Google; (+http://code.google.com/appengine; appid: myapp.appspot.com.
A browser will send a user agent header like this: Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
If you override the default headers for urlfetch.fetch
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}
urlfetch.fetch(url, headers=headers)
you will find that the html that you receive is almost identical to that served to the browser.

Categories