Strange Output from Python urllib2 - python

I would like to read to source code of a webpage using urllib2; however, I'm seeing a strange output that I've not seen before. Here's the code (Python 2.7, Linux):
import urllib2
open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/")
site_html = open_url.read()
site_html[50:]
Which gives the output:
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xe5\\ms\xdb\xb6\xb2\xfel\xcf\xe4?\xc0<S[\x9a\x8a\xa4^\xe28u,\xa5\x8e\x93\xf4\xa4\x93&\x99:9\xbdw\x9a\x8e\x07"'
Does anyone know why it's showing this as the output and not the correct HTML?

The http response being sent by the site is actually gzipped content and hence the strange output. urllib does not automatically decode the gzip cntent. There are two ways to solve this -
1) Decode zipped content before printing -
import urllib2
import io
import gzip
open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/")
site_html = open_url.read()
bi = io.BytesIO(site_html)
gf = gzip.GzipFile(fileobj=bi, mode="rb")
s = gf.read()
print s[50:]
2) Use Requests library -
import requests
r = requests.get('http://www.elegantthemes.com/gallery/')
print r.content

Related

Weird json value urllib python

I'm trying to manipulate a dynamic JSON from this site:
http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do
It has 3 elements, imagem, a base64, labelValorCaptcha, just a message, and uuidCaptcha, a value to pass by parameter to play a sound in this link bellow:
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_e7b072e1fce5493cbdc46c9e4738ab8a
When I enter in the first site through a browser and put in the second link the uuidCaptha after the equal ("..uuidCaptcha="), the sound plays normally. I wrote a simple code to catch this elements.
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
But I dont know what's happening, the caught value of the uuidCaptcha doesn't work. Open a error web page.
Someone knows?
Thanks!
It works for me.
$ cat a.py
#!/usr/bin/env python
# encoding: utf-8
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
$ python a.py
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_efc8d4bc3bdb428eab8370c4e04ab42c
As I said #Charlie Harding, the best way is download the page and get the JSON values, because this JSON is dynamic and need an opened web link to exist.
More info here.

JSON serialization Error in Python 3.2

I am using JSON library and trying to import a page feed to an CSV file. Tried many a ways to get the result however every time code execute it Gives JSON not serialzable. No Facebook use auth code which I have and used it so connection string will change however if you use a page which has public privacy you will still be able to get the result from below code.
following is the code
import urllib3
import json
import requests
#from pprint import pprint
import csv
from urllib.request import urlopen
page_id = "abcd" # username or id
api_endpoint = "https://graph.facebook.com"
fb_graph_url = api_endpoint+"/"+page_id
try:
#api_request = urllib3.Requests(fb_graph_url)
#http = urllib3.PoolManager()
#api_response = http.request('GET', fb_graph_url)
api_response = requests.get(fb_graph_url)
try:
#print (list.sort(json.loads(api_response.read())))
obj = open('data', 'w')
# write(json_dat)
f = api_response.content
obj.write(json.dumps(f))
obj.close()
except Exception as ee:
print(ee)
except Exception as e:
print( e)
Tried many approach but not successful. hope some one can help
api_response.content is the text content of the API, not a Python object so you won't be able to dump it.
Try either:
f = api_response.content
obj.write(f)
Or
f = api_response.json()
obj.write(json.dumps(f))
requests.get(fb_graph_url).content
is probably a string. Using json.dumps on it won't work. This function expects a list or a dictionary as the argument.
If the request already returns JSON, just write it to the file.

sending utf-8 adress to urlretrieve in python

While trying to access a file whose name contain utf-8 chars from browser I get the error
The requested URL /images/0/04/×¤×ª×¨×•× ×•×ª_תרגילי×_על_משטחי×_דיפ'_2014.pdf was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.`
In order to access the files I wrote the following python script:
# encoding: utf8
__author__ = 'Danis'
__date__ = '20/10/14'
import urllib
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
urllib.urlretrieve(link, 'home/danisf/targil4.pdf')
but when I run the code I get the error URLError:<curr_link appears here> contains non-ASCII characters
How can I fix the code to get him work? (by the way I don't have access to the server or to the webmaster) maybe the browser failed not because the bad encoding of the name for the file?
You cannot just pass Unicode URLs into urllib functions; URLs must be valid bytestrings instead. You'll need to encode to UTF-8, then url quote the path of your URL:
import urllib
import urlparse
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
urllib.urlretrieve(encoded_link, 'home/danisf/targil4.pdf')
The specific URL you provided in your question produces a 404 error however.
Demo:
>>> import urllib
>>> import urlparse
>>> curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
>>> parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> print parsed_link.geturl()
http://math-wiki.com/images/0/04/2014_%27%D7%93%D7%99%D7%A4_%D7%9E%D7%A9%D7%98%D7%97%D7%99%D7%9D_%D7%A2%D7%9C_%D7%A4%D7%AA%D7%A8%D7%95%D7%A0%D7%95%D7%AA.nn%20uft8pdf
Your browser usually decodes UTF-8 bytes encoded like this, to present a readable URL, but when sending the URL to the server to retrieve, it is encoded in the exact same manner.

Given a URL to a text file, what is the simplest way to read the contents of the text file?

In Python, when given the URL for a text file, what is the simplest way to access the contents off the text file and print the contents of the file out locally line-by-line without saving a local copy of the text file?
TargetURL=http://www.myhost.com/SomeFile.txt
#read the file
#print first line
#print second line
#etc
Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2
data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff
for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is
I'm a newbie to Python and the offhand comment about Python 3 in the accepted solution was confusing. For posterity, the code to do this in Python 3 is
import urllib.request
data = urllib.request.urlopen(target_url)
for line in data:
...
or alternatively
from urllib.request import urlopen
data = urlopen(target_url)
Note that just import urllib does not work.
The requests library has a simpler interface and works with both Python 2 and 3.
import requests
response = requests.get(target_url)
data = response.text
There's really no need to read line-by-line. You can get the whole thing like this:
import urllib
txt = urllib.urlopen(target_url).read()
import urllib2
for line in urllib2.urlopen("http://www.myhost.com/SomeFile.txt"):
print line
Another way in Python 3 is to use the urllib3 package.
import urllib3
http = urllib3.PoolManager()
response = http.request('GET', target_url)
data = response.data.decode('utf-8')
This can be a better option than urllib since urllib3 boasts having
Thread safety.
Connection pooling.
Client-side SSL/TLS verification.
File uploads with multipart encoding.
Helpers for retrying requests and dealing with HTTP redirects.
Support for gzip and deflate encoding.
Proxy support for HTTP and SOCKS.
100% test coverage.
import urllib2
f = urllib2.urlopen(target_url)
for l in f.readlines():
print l
For me, none of the above responses worked straight ahead. Instead, I had to do the following (Python 3):
from urllib.request import urlopen
data = urlopen("[your url goes here]").read().decode('utf-8')
# Do what you need to do with the data.
requests package works really well for simple ui
as #Andrew Mao suggested
import requests
response = requests.get('http://lib.stat.cmu.edu/datasets/boston')
data = response.text
for i, line in enumerate(data.split('\n')):
print(f'{i} {line}')
o/p:
0 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
1 prices and the demand for clean air', J. Environ. Economics & Management,
2 vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
3 ...', Wiley, 1980. N.B. Various transformations are used in the table on
4 pages 244-261 of the latter.
5
6 Variables in order:
Checkout kaggle notebook on how to extract dataset/dataframe from URL
I do think requests is the best option. Also note the possibility of setting encoding manually.
import requests
response = requests.get("http://www.gutenberg.org/files/10/10-0.txt")
# response.encoding = "utf-8"
hehe = response.text
Just updating here the solution suggested by #ken-kinder for Python 2 to work with Python 3:
import urllib
urllib.request.urlopen(target_url).read()
You can use this, as well for simple methodology:
import requests
url_res = requests.get(url= "http://www.myhost.com/SomeFile.txt")
with open(filename + ".txt", "wb") as file:
file.write(url_res.content)

Python error when using urllib.open

When I run this:
import urllib
feed = urllib.urlopen("http://www.yahoo.com")
print feed
I get this output in the interactive window (PythonWin):
<addinfourl at 48213968 whose fp = <socket._fileobject object at 0x02E14070>>
I'm expecting to get the source of the above URL. I know this has worked on other computers (like the ones at school) but this is on my laptop and I'm not sure what the problem is here. Also, I don't understand this error at all. What does it mean? Addinfourl? fp? Please help.
Try this:
print feed.read()
See Python docs here.
urllib.urlopen actually returns a file-like object so to retrieve the contents you will need to use:
import urllib
feed = urllib.urlopen("http://www.yahoo.com")
print feed.read()
In python 3.0:
import urllib
import urllib.request
fh = urllib.request.urlopen(url)
html = fh.read().decode("iso-8859-1")
fh.close()
print (html)

Categories