I'm downloading the file airports.net from github with urllib3 and read it as a graph object with networkx.read_pajek as follows:
import urllib3
import networkx as nx
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
G = nx.read_pajek(f.data(), encoding = 'UTF-8')
print(G)
Then there is an error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-7728c1228755> in <module>
13 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
14 f = http.request('GET', url)
---> 15 G = nx.read_pajek(f.data(), encoding = 'UTF-8')
16 print(G)
17
TypeError: 'bytes' object is not callable
Could you please elaborate on how to do so?
Update: If I change f.data() to f.data, then a new error appears
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-e96ad6eb1bfb> in <module>()
6 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
7 f = http.request('GET', url)
----> 8 G = nx.read_pajek(f.data, encoding = 'UTF-8')
9 print(G)
<decorator-gen-781> in read_pajek(path, encoding)
4 frames
/usr/local/lib/python3.6/dist-packages/networkx/readwrite/pajek.py in <genexpr>(.0)
159 for format information.
160 """
--> 161 lines = (line.decode(encoding) for line in path)
162 return parse_pajek(lines)
163
AttributeError: 'int' object has no attribute 'decode'
As can be inferred from the error message and also read in the docs, HTTPResponse.data is a property of type bytes rather than a method. So you need f.data rather than f.data() in order to retrieve the value.
Update
Regarding the AttributeError: as can be verified in the network docs, function read_pajek expects its first argument to be a path to a file with the data, not the actual data. So you could dump the bytes to a file, then pass the path to that file as the argument. There are several options:
Just use a hardcoded filename. This is arguably the simplest and doesn't require additional imports.
import urllib3
import networkx as nx
FILE_NAME = "/tmp/test.net"
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
with open(FILE_NAME, "w") as fh:
fh.write(f.data.decode())
G = nx.read_pajek(FILE_NAME, encoding='UTF-8')
print(f"G='{G}', G.size={G.size()}")
Use the tempfile standard library module to manage the file for you (i.e. give it a randomized name, then remove it after it is no longer used).
import tempfile
import urllib3
import networkx as nx
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
with tempfile.NamedTemporaryFile() as fh:
fh.write(f.data)
G = nx.read_pajek(fh.name, encoding='UTF-8')
print(f"G='{G}', G.size={G.size()}")
Use io.BytesIO or io.StringIO ("in-memory file"). This creates an object which is stored in memory (RAM) but has an API like a regular file stored on the disk. Accessing things stored in RAM is much (much!) faster so this is useful for performance reasons. Of course, you can't always use it because you only have so much RAM, but in your particular case you already have the data in memory, so it would be a huge waste of time to have to dump it to disk, just to have networkx read it back to memory. Although in your particular case you probably won't notice the difference because you seem to only be downloading 1 not too large file once, but maybe it will come in handy in the future.
import io
import urllib3
import networkx as nx
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
data = io.BytesIO(f.data)
G = nx.read_pajek(data, encoding = 'UTF-8')
print(f"G='{G}', G.size={G.size()}")
Related
I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already
I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already
I am trying to use this code to download an image from the given URL
import urllib.request
resource = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
output = open("file01.jpg","wb")
output.write(resource)
output.close()
However, I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-39-43fe4522fb3b> in <module>()
41 resource = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
42 output = open("file01.jpg","wb")
---> 43 output.write(resource)
44 output.close()
TypeError: a bytes-like object is required, not 'tuple'
I get that its the wrong data type for the .write() object but I don't know how to feed resource into output
Right, Using urllib.request.urlretrieve like this way:
import urllib.request
resource, headers = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
image_data = open(resource, "rb").read()
with open("file01.jpg", "wb") as f:
f.write(image_data)
PS: urllib.request.urlretrieve return a tuple, the first element is the location of temp file, you could try to get the bytes of temp file, and save it to a new file.
In Official document:
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
So I would recommend you to use urllib.request.urlopen,try code below:
import urllib.request
resource = urllib.request.urlopen("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
output = open("file01.jpg", "wb")
output.write(resource.read())
output.close()
When trying to apply some code I found on the internet in iPython, it's coming up with an error:
TypeError Traceback (most recent call last)
<ipython-input-4-36ec95de9a5d> in <module>()
13 all[i] = r.json()
14
---> 15 cPickle.dump(all, outfile)
TypeError: argument must have 'write' attribute
Here's what I have done in order:
outfile = "C:\John\Footy Bants\R COMPLAEX MATHS"
Then, I pasted in the following code:
import requests, cPickle, shutil, time
all = {}
errorout = open("errors.log", "w")
for i in range(600):
playerurl = "http://fantasy.premierleague.com/web/api/elements/%s/"
r = requests.get(playerurl % i)
# skip non-existent players
if r.status_code != 200: continue
all[i] = r.json()
cPickle.dump(all, outfile)
Here's the original article to give you an idea of what I'm trying to achieve:
http://billmill.org/fantasypl/
The second argument to cPickle.dump() must be a file object. You passed in a string containing a filename instead.
You need to use the open() function to open a file object for that filename, then pass the file object to cPickle:
with open(outfile, 'wb') as pickle_file:
cPickle.dump(all, pickle_file)
See the Reading and Writing Files section of the Python tutorial, including why using with when opening a file is a good idea (it'll be closed for you automatically).
I'm uploading potentially large files to a web server. Currently I'm doing this:
import urllib2
f = open('somelargefile.zip','rb')
request = urllib2.Request(url,f.read())
request.add_header("Content-Type", "application/zip")
response = urllib2.urlopen(request)
However, this reads the entire file's contents into memory before posting it. How can I have it stream the file to the server?
Reading through the mailing list thread linked to by systempuntoout, I found a clue towards the solution.
The mmap module allows you to open file that acts like a string. Parts of the file are loaded into memory on demand.
Here's the code I'm using now:
import urllib2
import mmap
# Open the file as a memory mapped string. Looks like a string, but
# actually accesses the file behind the scenes.
f = open('somelargefile.zip','rb')
mmapped_file_as_string = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Do the request
request = urllib2.Request(url, mmapped_file_as_string)
request.add_header("Content-Type", "application/zip")
response = urllib2.urlopen(request)
#close everything
mmapped_file_as_string.close()
f.close()
The documentation doesn't say you can do this, but the code in urllib2 (and httplib) accepts any object with a read() method as data. So using an open file seems to do the trick.
You'll need to set the Content-Length header yourself. If it's not set, urllib2 will call len() on the data, which file objects don't support.
import os.path
import urllib2
data = open(filename, 'r')
headers = { 'Content-Length' : os.path.getsize(filename) }
response = urllib2.urlopen(url, data, headers)
This is the relevant code that handles the data you supply. It's from the HTTPConnection class in httplib.py in Python 2.7:
def send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
else:
self.sock.sendall(data)
Have you tried with Mechanize?
from mechanize import Browser
br = Browser()
br.open(url)
br.form.add_file(open('largefile.zip'), 'application/zip', 'largefile.zip')
br.submit()
or, if you don't want to use multipart/form-data, check this old post.
It suggests two options:
1. Use mmap, Memory Mapped file object
2. Patch httplib.HTTPConnection.send
Try pycurl. I don't have anything setup will accept a large file that isn't in a multipart/form-data POST, but here's a simple example that reads the file as needed.
import os
import pycurl
class FileReader:
def __init__(self, fp):
self.fp = fp
def read_callback(self, size):
return self.fp.read(size)
c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.UPLOAD, 1)
c.setopt(pycurl.READFUNCTION, FileReader(open(filename, 'rb')).read_callback)
filesize = os.path.getsize(filename)
c.setopt(pycurl.INFILESIZE, filesize)
c.perform()
c.close()
Using the requests library you can do
with open('massive-body', 'rb') as f:
requests.post('http://some.url/streamed', data=f)
as mentioned here in their docs
Below is the working example for both Python 2 / Python 3:
try:
from urllib2 import urlopen, Request
except:
from urllib.request import urlopen, Request
headers = { 'Content-length': str(os.path.getsize(filepath)) }
with open(filepath, 'rb') as f:
req = Request(url, data=f, headers=headers)
result = urlopen(req).read().decode()
The requests module is great, but sometimes you cannot install any extra modules...