I try to download a serie of text files from different websites. I am using urllib.request with Python. I want to expend the list of URL without making the code long.
The working sequence is
import urllib.request
url01 = 'https://web.site.com/this.txt'
url02 = 'https://web.site.com/kind.txt'
url03 = 'https://web.site.com/of.txt'
url04 = 'https://web.site.com/link.txt'
[...]
urllib.request.urlretrieve(url01, "Liste n°01.txt")
urllib.request.urlretrieve(url02, "Liste n°02.txt")
urllib.request.urlretrieve(url03, "Liste n°03.txt")
[...]
The number of file to download is increasing and I want to keep the second part of the code short.
I tried
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0+"i"+.txt")
It doesn't work and I am thinking that a while loop can be use for string but not for request.
So I was thinking of making it a function.
def newfunction(i)
return urllib.request.urlretrieve(url"i", "Liste n°0"+1+".txt")
But it seem that I am missing a big chunk of it.
This request is working but it seem I cannot transform it for long list or URL.
As a general suggestion, I'd recommend the requests module for Python, rather than urllib.
Based on that, some naive code for a possible function:
import requests
def get_file(site, filename):
target = site + "/" + filename
try:
r = requests.get(target, allow_redirects=True)
open(filename, 'wb').write(r.content)
return r.status_code
except requests.exceptions.RequestException as e:
print("File not downloaded, error: {}".format(e))
You can then call the function, passing in parameters of site and file name:
get_file('https://web.site.com', 'this.txt')
The function will raise an exception, but not stop execution, if it cannot download a file. You could expand exception handling to deal with files not being writable, but this should be a start.
It seems as if you're not casting the variable i to an integer before your concatenating it to the url string. That may be the reason why you're code isn't working. The while-loop/for-loop approach shouldn't effect whether or not the requests get sent out. I recommend using the requests module for making requests as well. Mike's post covers what the function should kind of look like. I also recommend creating a sessions object if you're going to be making a whole lot of requests in a piece of code. The sessions object will keep the underlying TCP connection open while you make your requests, which should reduce latency, CPU usage, and network congestion (https://en.wikipedia.org/wiki/HTTP_persistent_connection#Advantages). The code would look something like this:
import requests
with requests.Session() as s:
for i in range(10):
s.get(str(i)+'.com') # make request
# write to file here
To cast to a string you would want something like this:
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0" + str(i) + ".txt")
Related
EDIT2: Solved! Thanks to Michael Butscher's comment, I made a shallow copy of req_params by renaming the argument req_params to req_params_arg and then adding req_params = req_params_arg.copy() in get_assets_api_for_range().
The variable was shared between the threads, so copying before use solved the problem.
EDIT: It seems that "python requests" doesn't like being called simultaneously by several threads, I activated debug mode and I can see that the requests sent to the API are sometimes equal (same range asked) which leads to the duplicates. Weird behavior... Do you think I need to use aiohttp or asyncio??
I'm struggling with the concurrent.futures module in order to fetch a large volume of data from an API.
The API I'm using is limiting results to 100 results per page, then I'm calling it multiple times in order to get all the data.
To accelerate the process I tried to use multithreading (ThreadPoolExecutor) with a maximum of 10 threads.
It works fine and it's very quick but the results are different each time... Sometimes I get duplicates, sometimes not.
It seems not to be thread-safe somewhere but I cannot figure out where... Maybe it is hidden into the functions that uses pandas behind?
I tried to echo the no of pages it's getting and it's pretty correct (not in order but normal):
Fetching 300-400
Fetching 700-800
Fetching 800-900
Fetching 500-600
Fetching 400-500
Fetching 0-100
Fetching 200-300
Fetching 100-200
Fetching 900-1000
Fetching 600-700
Fetching 1100-1159
Fetching 1000-1100
Another weird behavior is here: when I put the line req_sess = authenticate() after the line print("Fetching {}".format(req_params['range'])) in get_assets_api_for_range function, only one page (the last I believe) is fetched multiple times.
Thanks for your help!!
Here is my code (I removed some parts, I should be enough I think), the main function called is get_assets_from_api_in_df():
from functools import partial
import pandas as pd
import requests
import concurrent.futures as cfu
def get_assets_api_for_range(range_to_fetch, req_params):
req_sess = authenticate()
req_params.update({'range': range_to_fetch})
print("Fetching {}".format(req_params['range']))
r = req_sess.get(url=auth_config['endpoint_url'] + '/assets',
params=req_params)
if r.status_code != 200:
raise ConnectionError("API Get assets error: {}".format(r))
json_response = r.json()
# This function in return, processes the list into a dataframe
return process_get_assets_from_api_in_df(json_response["asset_list"])
def get_assets_from_api_in_df() -> pd.DataFrame:
GET_NB_MAX = 100
# First, fetch 1 value to get nb to fetch
req_sess = authenticate()
r = req_sess.get(url=auth_config['endpoint_url'] + '/assets',
params={'range': '0-1'})
if r.status_code != 200:
raise ConnectionError("API Get assets error: {}".format(r))
json_response = r.json()
nb_to_fetch_total = json_response['total']
print("Nb to fetch: {}".format(nb_to_fetch_total))
# Building a queue of ranges to fetch
ranges_to_fetch_queue = []
for nb in range(0, nb_to_fetch_total, GET_NB_MAX):
if nb + GET_NB_MAX < nb_to_fetch_total:
range_str = str(nb) + '-' + str(nb + GET_NB_MAX)
else:
range_str = str(nb) + '-' + str(nb_to_fetch_total)
ranges_to_fetch_queue.append(range_str)
params = {
}
func_to_call = partial(get_assets_api_for_range,
req_params=params)
with cfu.ThreadPoolExecutor(max_workers=10) as executor:
result = list(executor.map(func_to_call, ranges_to_fetch_queue))
print("Fetch finished, merging data...")
return pd.concat(result, ignore_index=True)
I'm using Django 1.8.1 with Python 3.4 and i'm trying to use requests to download a processed file. The following code works perfect for a normal request.get command to download the exact file at the server location, or unprocessed file.
The file needs to get processed based on the passed data (shown below as "data"). This data will need to get passed into the Django backend, and based off the text pass variables to run an internal program from the server and output .gcode instead .stl filetype.
python file.
import requests, os, json
SERVER='http://localhost:8000'
authuser = 'admin#google.com'
authpass = 'passwords'
#data not implimented
##############################################
data = {FirstName:Steve,Lastname:Escovar}
############################################
category = requests.get(SERVER + '/media/uploads/9128342/141303729.stl', auth=(authuser, authpass))
#download to path file
path = "/home/bradman/Downloads/requestdata/newfile.stl"
if category.status_code == 200:
with open(path, 'wb') as f:
for chunk in category:
f.write(chunk)
I'm very confused about this, but I think the best course of action is to pass the data along with request.get, and somehow make some function to grab them inside my views.py for Django. Anyone have any ideas?
To use data in request you can do
get( ... , params=data)
(and you get data as parameters in url)
or
post( ... , data=data).
(and you send data in body - like HTML form)
BTW. some APIs need params= and data= in one request of GET or POST to send all needed information.
Read requests documentation
I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?
Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.
Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)
I am trying to write a class in Python to open a specific URL given and return the data of that URL...
class Openurl:
def download(self, url):
req = urllib2.Request( url )
content = urllib2.urlopen( req )
data = content.read()
content.close()
return data
url = 'www.somesite.com'
dl = openurl()
data = dl.download(url)
Could someone correct my approach? I know one might ask why not just directly open it, but I want to show a message while it is being downloaded. The class will only have one instance.
You have a few problems.
One that I'm sure is not in your original code is the failure to import urllib2.
The second problem is that dl = openurl() should be dl = Openurl(). This is because Python is case sensitive.
The third problem is that your URL needs http:// before it. This gets rid of an unknown url type error. After that, you should be good to go!
It should be dl = Openurl(), python is case sensitive
In Python, when given the URL for a text file, what is the simplest way to access the contents off the text file and print the contents of the file out locally line-by-line without saving a local copy of the text file?
TargetURL=http://www.myhost.com/SomeFile.txt
#read the file
#print first line
#print second line
#etc
Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2
data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff
for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is
I'm a newbie to Python and the offhand comment about Python 3 in the accepted solution was confusing. For posterity, the code to do this in Python 3 is
import urllib.request
data = urllib.request.urlopen(target_url)
for line in data:
...
or alternatively
from urllib.request import urlopen
data = urlopen(target_url)
Note that just import urllib does not work.
The requests library has a simpler interface and works with both Python 2 and 3.
import requests
response = requests.get(target_url)
data = response.text
There's really no need to read line-by-line. You can get the whole thing like this:
import urllib
txt = urllib.urlopen(target_url).read()
import urllib2
for line in urllib2.urlopen("http://www.myhost.com/SomeFile.txt"):
print line
Another way in Python 3 is to use the urllib3 package.
import urllib3
http = urllib3.PoolManager()
response = http.request('GET', target_url)
data = response.data.decode('utf-8')
This can be a better option than urllib since urllib3 boasts having
Thread safety.
Connection pooling.
Client-side SSL/TLS verification.
File uploads with multipart encoding.
Helpers for retrying requests and dealing with HTTP redirects.
Support for gzip and deflate encoding.
Proxy support for HTTP and SOCKS.
100% test coverage.
import urllib2
f = urllib2.urlopen(target_url)
for l in f.readlines():
print l
For me, none of the above responses worked straight ahead. Instead, I had to do the following (Python 3):
from urllib.request import urlopen
data = urlopen("[your url goes here]").read().decode('utf-8')
# Do what you need to do with the data.
requests package works really well for simple ui
as #Andrew Mao suggested
import requests
response = requests.get('http://lib.stat.cmu.edu/datasets/boston')
data = response.text
for i, line in enumerate(data.split('\n')):
print(f'{i} {line}')
o/p:
0 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
1 prices and the demand for clean air', J. Environ. Economics & Management,
2 vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
3 ...', Wiley, 1980. N.B. Various transformations are used in the table on
4 pages 244-261 of the latter.
5
6 Variables in order:
Checkout kaggle notebook on how to extract dataset/dataframe from URL
I do think requests is the best option. Also note the possibility of setting encoding manually.
import requests
response = requests.get("http://www.gutenberg.org/files/10/10-0.txt")
# response.encoding = "utf-8"
hehe = response.text
Just updating here the solution suggested by #ken-kinder for Python 2 to work with Python 3:
import urllib
urllib.request.urlopen(target_url).read()
You can use this, as well for simple methodology:
import requests
url_res = requests.get(url= "http://www.myhost.com/SomeFile.txt")
with open(filename + ".txt", "wb") as file:
file.write(url_res.content)