I am downloading a lot of files from a website and want them to run parallel because they are heavy. Unfourtanetly I can't really share the website because to access the files I need a username and password which I can't share. The code below is my code, I know it can't really be run without the website and my username and password but I am 99% sure I am not allowed to share that information
import os
import requests
from multiprocessing import Process
dataset="dataset_name"
################################
def down_file(dspath, file, savepath, ret):
webfilename = dspath+file
file_base = os.path.basename(file)
file = join(savepath, file_base)
print('...Downloading',file_base)
req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
filesize = int(req.headers['Content-length'])
with open(file, 'wb') as outfile:
chunk_size=1048576
for chunk in req.iter_content(chunk_size=chunk_size):
outfile.write(chunk)
return None
################################
##Download files
def download_files(filelist, c_DateNow):
## Authenticate
url = 'url'
values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
ret = requests.post(url, data=values)
## Path to files
dspath = 'datasetwebpath'
savepath = join(path_script, dataset, c_DateNow)
makedirs(savepath, exist_ok = True)
#"""
processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
# kick them off
for process in processes:
print("\n", process)
process.start()
# now wait for them to finish
for process in processes:
process.join()
#"""
####### This works and it's what i want to parallelize
"""
##Download files
for file in filelist:
down_file(dspath, file, savepath, ret)
#"""
################################
def main(c_DateNow, c_DateIni, c_DateFin):
## Other code
files=["list of web file addresses"]
print(" ...Files being downladed\n ", "\n ".join(files), "\n")
## Doanlad files
download_files(files, c_DateNow)
I want to download 25 files. When I run the code all the print lines that have been printed before in the code are being reprinted even though the Process execution is not even near them. I am also getting the following error constantly
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
I googled the error and don't know how to fix it. Does it have to do with there not being enough cores? Is there a way to stop the Process depending on how many cores I have available? Or is it something else entirely?
In a question here, I read that the Process has to be within the __main__ function but this code is a module that gets imported in another code so when I run it I run it as
import this_code
import another1_code
import another2_code
#Step1
another1_code.main()
#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()
#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)
#step4
## More code
So I need the process to be within a function and not in __main__
I appreciate any help or suggestions on how to correctly parallelize the above code in a way that allows me to use it as a module in another code.
Related
I write this post because I have not found solutions for my specific case. I refer to this article, which, however, did not work for me on Windows 10 version 1909.
I programmed a "python_code_a.py" script that has the task of uploading, one at a time, all the images contained in a local folder on a converter server and to download them, always one at a time, from the server to my PC in another folder. How the script works depends on the server, which is public and not owned by me, so it is possible, approximately every two and a half hours, that the script crashes due to an unexpected connection error. Obviously, it is not possible to consider the fact that he stays all day observing the Python shell and acting in case the script stops.
As reported in the article above, I compiled a second file with the name "python_code_b.py", which had the task of acting in case "python_code_a.py" had stopped by restarting the latter. When I try to get it to run from the "python.exe" CMD, however, the latter responds to the input with "...", nothing else.
I attach a general example of "python_code_a.py":
processnumber= 0
photosindex= 100000
photo = 0
path = 0
while photosindex<"number of photos in folder":
photo = str('your_path'+str(photoindex)+'.png')
path = str('your_path'+str(photoindex)+'.jpg')
print ('It\'s converting: '+ photo)
import requests
r = requests.post(
"converter_site",
files={
'image': open(photo , 'rb'),
},
headers={'api-key': 'your_api_key'}
)
file= r.json()
json_output = file['output_url']
import urllib.request
while photosindex<'number of photos in folder':
urllib.request.urlretrieve( json_output , path )
print('Finished process number: '+str(processnumber))
break
photosindex= photosindex +1
processnumber= processnumber +1
print(
)
print('---------------------------------------------------')
print('Every pending job has been completed.')
print(
)
How can I solve it?
you can use error capturing:
while photosindex<"number of photos in folder":
try:
#Your code
except:
print("Something else went wrong")
https://www.w3schools.com/python/python_try_except.asp
I am using imapala shell to compute some stats over a text file containing the table names.
I am using Python multiprocessing module to pool the processes.
The thing is thing task is very time consuming, so I need to keep track of how many files have been completed to see the job progress.
So let me give you some ideas about the functions that I am using.
job_executor is the function that takes a list of tables and perform the tasks.
main() is the functions, that takes file location, no of executors(pool_workers), converts the file containing table to list of tables and does the multiprocessing thing
I want to see the progress like how much file has been processed by job_executor, but I can't find a solution . Using a counter also doesn't work.
def job_executor(text):
impala_cmd = "impala-shell -i %s -q 'compute stats %s.%s'" % (impala_node, db_name, text)
impala_cmd_res = os.system(impala_cmd) #runs impala Command
#checks for execution type(success or fail)
if impala_cmd_res == 0:
print ("invalidated the metadata.")
else:
print("error while performing the operation.")
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines() # this will return list of all the tables in the file.
process_pool = Pool(NUM_OF_EXECUTORS)
try:
process_pool.map(job_executor, text_file_rows)
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
def parse_args():
"""
function to take scraping arguments from test_hr.sh file
"""
parser = argparse.ArgumentParser(description='Main Process file that will start the process and session too.')
parser.add_argument("text_file_path",
help='provide text file path/location to be read. ') # text file fath
parser.add_argument("pool_executors",
help='please provide pool executors as an initial argument') # pool_executor path
return parser.parse_args() # returns list/tuple of all arguments.
if __name__ == "__main__":
mail_message_start()
main(parse_args())
mail_message_end()
If you insist on needlessly doing it via multiprocessing.pool.Pool(), the easiest way to keep a track of what's going on is to use a non-blocking mapping (i.e. multiprocessing.pool.Pool.map_async()):
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines()
total_processes = len(text_file_rows) # keep the number of lines for reference
process_pool = Pool(NUM_OF_EXECUTORS)
try:
print('Processing {} lines.'.format(total_processes))
processing = process_pool.map_async(job_executor, text_file_rows)
processes_left = total_processes # number of processing lines left
while not processing.ready(): # start a loop to wait for all to finish
if processes_left != processing._number_left:
processes_left = processing._number_left
print('Processed {} out of {} lines...'.format(
total_processes - processes_left, total_processes))
time.sleep(0.1) # let it breathe a little, don't forget to `import time`
print('All done!')
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
This will check every 100ms if some of the processes finished processing and if something changed since the last check it will print out the number of lines processed so far. If you need more insight into what's going on with your subprocesses, you can use some of the shared structures like multiprocessing.Queue() or multiprocessing.Manager() structures to directly report from within your processes.
I want to put a list into my threading script, but I am facing a problem.
Contents of list file (example):
http://google.com
http://yahoo.com
http://bing.com
http://python.org
My script:
import codecs
import threading
import sys
import requests
from time import time as timer
from timeout import timeout
import time
try:
with codecs.open(sys.argv[1], mode='r', encoding='ascii', errors='ignore') as iiz:
iiz=iiz.read().splitlines()
except IOError:
pass
oz = list(iiz)
def nnn(url):
hzz = {'param1': sys.argv[2], 'param2': sys.argv[3]}
po = requests.post(url,data=hzz)
if po:
print("ok \n")
if __name__ == '__main__':
threads = []
for i in range(1):
t = threading.Thread(target=nnn, args=(oz,))
threads.append(t)
t.start()
Can you please clarify what elaborate on exactly what you're trying to achieve.
I'm guessing that you're trying to request urls to load into a web browser or the terminal...
Also you shouldn't need to put the urls into a list because when you opened up the file containing the urls, it automatically sorted it into a list. So in other words, the contents in iiz are already in the list format.
Personally, I haven't worked much with the modules you're using (apart from time), but I'll try my best to help you and hopefully other users will try and help you too.
I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?
Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.
Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)
I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer