Using URLFetch in Python GAE to fetch a complete document

Using URLFetch in Python GAE to fetch a complete document - python

I am using urlfetch.fetch in App engine using Python 2.7.
I tried fetching 2 URLs belonging to 2 different domains. For the first one, the result of urlfetch.fetch includes results after resolving XHR queries that are made for getting recommended products.
However for the other page belonging to another domain, the XHR queries are not resolved and I just get the plain HTML for the most part. The XHR queries for this page are also made for purposes of getting recommended products to show, etc.
Here is how I use urlfetch:
fetch_result = urlfetch.fetch(url, deadline=5, validate_certificate=True)
URL 1 (the one where XHR is resolved and the response is complete)
https://www.walmart.com/ip/HP-15-f222wm-ndash-15.6-Laptop-Touchscreen-Windows-10-Home-Intel-Pentium-Quad-Core-Processor-4GB-Memory-500GB-Hard-Drive/53853531
URL 2 (the one where I just get the plain HTML for the most part)
https://www.flipkart.com/oricum-blue-486-loafers/p/itmezfrvwtwsug9w?pid=SHOEHZWJUMMTEYRU
Can someone please advice what I may be missing in regards to the inconsistency.

The server is serving different output based on the user-agent string supplied in the request headers.
By default, urlfetch.fetch will send requests with the user agent header set to something like AppEngine-Google; (+http://code.google.com/appengine; appid: myapp.appspot.com.
A browser will send a user agent header like this: Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
If you override the default headers for urlfetch.fetch
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}
urlfetch.fetch(url, headers=headers)
you will find that the html that you receive is almost identical to that served to the browser.

Related

Regarding page response

When I use requests library, I get different status code (error 503) but when I inspect page and lookout for status it shows (error 404).
I tried this:
import requests
print(requests.get("https://www.amazon.de/dp/1015").status_code)
How can I get exact response?

The reason that you are getting different responses is that your browser is sending headers. Headers are extra information about a request. Browsers always send specific headers, and one is the User-Agent header.
If amazon sees this header, it thinks that there is a browser asking for the page, so it is programmed to send a 404 response.
Without this header, it sends a 503 response because it knows you aren't using a browser. That's just what it's programmed to do.
To add headers, use:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}
req = requests.get("https://www.amazon.de/dp/1015", headers=headers)
print(req.status_code)
which prints 404.
The User-Agent header tells amazon details about your browser and computer. To find your own User-Agent header, look at the headers in the Network part of your browser's inspector tab.

How to catch redirects with curl or requests

I am new to web scraping and have stumbled upon an unexpected challenge. The goal is to input an incomplete URL string for a website and "catch" the corrected URL output returned by the website's redirect function. The specific website that I referring to is Marine Traffic.
When searching for a specific vessel profile, a proper query string should contain the paramters shipid, mmsi and imo. For example, this link will return a webpage with the profile for a specific vessel:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA/_:97e0de64144a0d7abfc154ea3bd1010e
As it turns out, a query string with only the imo parameter will redirect to the exact same url. So, for example, the following query will redirect to the same one as above:
https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
My question is, using cURL in bash or another such tool such as the python requests library, how could one catch the redirect URL in an automated way? Curling the first URL returns the full html, while curling the second URL throws an Access Denied error. Why is this allowed in the browser? What is the workaround for this, if any, and what are some best practices for catching redirect URLs (using either python or bash)?
curl https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
#returns Access Denied
Note: Adding a user agent to curl --user-agent 'Chrome/79' does not get around the issue. The error is avoided but nothing is returned.

You can try .url on response object:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9337987"
r = requests.get(url, headers=headers)
print(r.url)
Prints:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA

python requests.get(url) times out but works in browser (chrome); how can I tailor the request headers for a certain host?

I am trying to download a file using the python requests module, my code works for some urls/hosts but I've come across one that does not work.
Based on other similar questions it may be related to the User-Agent request header, I have tried to remedy by adding the chrome user-agent but the connection still times out for this particular url (it does work for others).
I have tested opening the url in chrome browser (which works all OK) and inspecting the request headers, but I still can't figure out why my code is failing:
import requests
url = 'http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-2020-03.csv'
headers = {'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
session = requests.Session()
session.headers.update(headers)
response = session.get(url, stream=True)
# !!! code fails here for this particular url !!!
with open('test.csv', "wb") as fh:
for x in response.iter_content(chunk_size=1024):
if x: fh.write(x)
Update 2020-08-14
I have figured out what was wrong; on the instances where the code was working the urls were using https protocol. This url is http protocol, and my proxy settings were not configured for http only https. After providing a http proxy to requests my code did work as written.

The code you posted worked for me, it saved the file (129007 lines). It could be that the host is rate-limiting you, try again later to see if it works.
# count lines
$ wc -l test.csv
129007 test.csv
# inspect headers
$ head -n 4 test.csv
Date,Region_Name,Area_Code,Index
1968-04-01,Wales,W92000004,2.11932727
1968-04-01,Scotland,S92000003,2.108087275
1968-04-01,Northern Ireland,N92000001,3.300419757

You can disable requests' timeouts by passing timeout=None. Here is the official documentation: https://requests.readthedocs.io/en/master/user/advanced/#timeouts

Python getting HTML content via 'requests' returns partial response

I'm reading a web site content using following 3 liners. I used an example domain for sale which doesn't have many content.
url = "http://localbusiness.com/"
response = requests.get(url)
html = response.text
It returns following html content where the website contains more html when you check through view source. Am I doing something wrong here
Python version 2.7
<html><head></head><body><!-- vbe --></body></html>

Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.

#jason answered it correctly so I am extending his answer for the reason
Why It happens
Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool
a web site (come handy when the site is using some short of
authentication cookies) A small tutorial
Use selenium to actually implement a browser

urllib2 misbehaving with dynamically loaded content

Some Code
headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'
request = urllib.request.Request(sURL, headers = headers)
try:
response = urllib.request.urlopen(request)
except error.HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: {0}'.format(e.code))
except error.URLError as e:
print('We failed to reach a server.')
print('Reason: {0}'.format(e.reason))
else:
f = open('output/{0}.html'.format(sFileName),'w')
f.write(response.read().decode('utf-8'))
A url
http://groupon.cl/descuentos/santiago-centro
The situation
Here's what I did:
enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2 (for those of you who have just tuned in, javascript has now been disabled)
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off
results
In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4
question
What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how.
What am I missing and how could I pull this off?
To paraphrase
If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.
The seo friendly site wants me to get what I want because that's what seo is all about.
How would one go about getting plain HTML content given the situation I outlined?
To do it manually I would turn off js, navigate to the page, view source, ctrl-a, ctrl-c, ctrl-v(somewhere useful).
To get a script to do it for me I would...?
stuff I've tried
I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.
Here are the headers my browser sends:
Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Here are the headers my script sends:
Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
The differences are:
my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...
watch this space, I'll update the question as new info comes up

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.