I am using threadpool to upload using S3. However, I suspect that sometimes one thread might fail. Therefore, I want to restart the thread that has failed. I want some pointers on how can i achieve this:
pool = ThreadPool(processes=10)
pool.map(uploadS3_helper, args)
def uploadS3_helper(args):
return uploadS3(*args)
def uploadS3(myfile, bucket_name, key_root, path_root, use_rel, define_region):
if define_region:
conn = S3Connection(aws_access_key_id=S3_ACCESS_KEY, aws_secret_access_key=S3_SECRET_KEY,
host='s3-us-west-2.amazonaws.com')
else:
conn = S3Connection(aws_access_key_id=S3_ACCESS_KEY, aws_secret_access_key=S3_SECRET_KEY)
bucket = conn.get_bucket(bucket_name)
print key_root + myfile
print path_root
print os.path.join(path_root, myfile)
if use_rel:
bucket.new_key(key_root + myfile).set_contents_from_file(open(os.path.join(path_root, myfile[1:])))
else:
bucket.new_key(key_root + myfile).set_contents_from_file(open(path_root))
To expand on #Martin James comment: consider a "retry decorator". Add the linked code to your project, flag the uploadS3 function, and it'll try to upload multiple times, waiting longer after each failure.
Related
I am downloading a lot of files from a website and want them to run parallel because they are heavy. Unfourtanetly I can't really share the website because to access the files I need a username and password which I can't share. The code below is my code, I know it can't really be run without the website and my username and password but I am 99% sure I am not allowed to share that information
import os
import requests
from multiprocessing import Process
dataset="dataset_name"
################################
def down_file(dspath, file, savepath, ret):
webfilename = dspath+file
file_base = os.path.basename(file)
file = join(savepath, file_base)
print('...Downloading',file_base)
req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
filesize = int(req.headers['Content-length'])
with open(file, 'wb') as outfile:
chunk_size=1048576
for chunk in req.iter_content(chunk_size=chunk_size):
outfile.write(chunk)
return None
################################
##Download files
def download_files(filelist, c_DateNow):
## Authenticate
url = 'url'
values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
ret = requests.post(url, data=values)
## Path to files
dspath = 'datasetwebpath'
savepath = join(path_script, dataset, c_DateNow)
makedirs(savepath, exist_ok = True)
#"""
processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
# kick them off
for process in processes:
print("\n", process)
process.start()
# now wait for them to finish
for process in processes:
process.join()
#"""
####### This works and it's what i want to parallelize
"""
##Download files
for file in filelist:
down_file(dspath, file, savepath, ret)
#"""
################################
def main(c_DateNow, c_DateIni, c_DateFin):
## Other code
files=["list of web file addresses"]
print(" ...Files being downladed\n ", "\n ".join(files), "\n")
## Doanlad files
download_files(files, c_DateNow)
I want to download 25 files. When I run the code all the print lines that have been printed before in the code are being reprinted even though the Process execution is not even near them. I am also getting the following error constantly
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
I googled the error and don't know how to fix it. Does it have to do with there not being enough cores? Is there a way to stop the Process depending on how many cores I have available? Or is it something else entirely?
In a question here, I read that the Process has to be within the __main__ function but this code is a module that gets imported in another code so when I run it I run it as
import this_code
import another1_code
import another2_code
#Step1
another1_code.main()
#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()
#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)
#step4
## More code
So I need the process to be within a function and not in __main__
I appreciate any help or suggestions on how to correctly parallelize the above code in a way that allows me to use it as a module in another code.
I'm not sure if I worded the question correctly, but for example I want to return the response without returning the function.
My context here is, the user asks for a large excel file to be generate, so a link will be returned to him, and when the excel is done an email will also be sent.
Pseudo example :
from flask import Flask
from flask import send_file
from someXlsLib import createXls
from someIoLib import deleteFile
from someMailLib import sendMail
import uuid
app = Flask(__name__)
host = 'https://myhost.com/myApi'
#app.route('/getXls')
def getXls:
fileName = uuid.uuid4().hex + '.xls'
downloadLink = host + '/tempfiles/' + fileName
#Returning the downloadLink for the user to acces when xls file ready
return downloadLink
#But then this code is unreachable
generateXls(fileName)
def generateXls(fileName, downloadLink)
createXls('/tempfiles/' + fileName)
sendMail(downloadLink)
#app.route('/tempfiles/<fileName>')
def getTempFile:
#Same problem here, I need the user to finish the download before deleting the file
return send_file('/tempfiles/' + fileName, attachment_filename=fileName)
deleteFile('/tempfiles/' + fileName)
Other commenters are right that you need to use something to manage asynchronous actions. One of the most popular options, and one that comes with lots of tools for completing delayed, scheduled, and asynchronous actions is Celery. You can do what you want using celery with something like the following:
from celery import Celery
...
# This is for Redis on the local host. You can also use RabbitMQ or AWS SQS.
celery = Celery(app.name, broker='redis://localhost:6379/0')
celery.conf.update(app.config)
...
# Create your Celery task
#celery.task(bind=True)
def generateXls(file_name, downloadLink):
createXls('/tempfiles/' + fileName)
sendMail(downloadLink)
#app.route('/getXls')
def getXls:
fileName = uuid.uuid4().hex + '.xls'
downloadLink = host + '/tempfiles/' + fileName
# Asynchronously call your Celery task.
generateXls.delay(file_name, downloadLink)
return downloadLink
This will return the download link immediately while continuing with generateXls in its own thread.
I am using imapala shell to compute some stats over a text file containing the table names.
I am using Python multiprocessing module to pool the processes.
The thing is thing task is very time consuming, so I need to keep track of how many files have been completed to see the job progress.
So let me give you some ideas about the functions that I am using.
job_executor is the function that takes a list of tables and perform the tasks.
main() is the functions, that takes file location, no of executors(pool_workers), converts the file containing table to list of tables and does the multiprocessing thing
I want to see the progress like how much file has been processed by job_executor, but I can't find a solution . Using a counter also doesn't work.
def job_executor(text):
impala_cmd = "impala-shell -i %s -q 'compute stats %s.%s'" % (impala_node, db_name, text)
impala_cmd_res = os.system(impala_cmd) #runs impala Command
#checks for execution type(success or fail)
if impala_cmd_res == 0:
print ("invalidated the metadata.")
else:
print("error while performing the operation.")
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines() # this will return list of all the tables in the file.
process_pool = Pool(NUM_OF_EXECUTORS)
try:
process_pool.map(job_executor, text_file_rows)
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
def parse_args():
"""
function to take scraping arguments from test_hr.sh file
"""
parser = argparse.ArgumentParser(description='Main Process file that will start the process and session too.')
parser.add_argument("text_file_path",
help='provide text file path/location to be read. ') # text file fath
parser.add_argument("pool_executors",
help='please provide pool executors as an initial argument') # pool_executor path
return parser.parse_args() # returns list/tuple of all arguments.
if __name__ == "__main__":
mail_message_start()
main(parse_args())
mail_message_end()
If you insist on needlessly doing it via multiprocessing.pool.Pool(), the easiest way to keep a track of what's going on is to use a non-blocking mapping (i.e. multiprocessing.pool.Pool.map_async()):
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines()
total_processes = len(text_file_rows) # keep the number of lines for reference
process_pool = Pool(NUM_OF_EXECUTORS)
try:
print('Processing {} lines.'.format(total_processes))
processing = process_pool.map_async(job_executor, text_file_rows)
processes_left = total_processes # number of processing lines left
while not processing.ready(): # start a loop to wait for all to finish
if processes_left != processing._number_left:
processes_left = processing._number_left
print('Processed {} out of {} lines...'.format(
total_processes - processes_left, total_processes))
time.sleep(0.1) # let it breathe a little, don't forget to `import time`
print('All done!')
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
This will check every 100ms if some of the processes finished processing and if something changed since the last check it will print out the number of lines processed so far. If you need more insight into what's going on with your subprocesses, you can use some of the shared structures like multiprocessing.Queue() or multiprocessing.Manager() structures to directly report from within your processes.
I have just learned the basics of Python, and I am trying to make a few projects so that I can increase my knowledge of the programming language.
Since I am rather paranoid, I created a script that uses PycURL to fetch my current IP address every x seconds, for VPN security. Here is my code[EDITED]:
import requests
enterIP = str(input("What is your current IP address?"))
def getIP():
while True:
try:
result = requests.get("http://ipinfo.io/ip")
print(result.text)
except KeyboardInterrupt:
print("\nProccess terminated by user")
return result.text
def checkIP():
while True:
if enterIP == result.text:
pass
else:
print("IP has changed!")
getIP()
checkIP()
Now I would like to expand the idea, so that the script asks the user to enter their current IP, saves that octet as a string, then uses a loop to keep running it against the PycURL function to make sure that their IP hasn't changed? The only problem is that I am completely stumped, I cannot come up with a function that would take the output of PycURL and compare it to a string. How could I achieve that?
As #holdenweb explained, you do not need pycurl for such a simple task, but nevertheless, here is a working example:
import pycurl
import time
from StringIO import StringIO
def get_ip():
buffer = StringIO()
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://ipinfo.io/ip")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
return buffer.getvalue()
def main():
initial = get_ip()
print 'Initial IP: %s' % initial
try:
while True:
current = get_ip()
if current != initial:
print 'IP has changed to: %s' % current
time.sleep(300)
except KeyboardInterrupt:
print("\nProccess terminated by user")
if __name__ == '__main__':
main()
As you can see I moved the logic of getting the IP to separate function: get_ip and added few missing things, like catching the buffer to a string and returning it. Otherwise it is pretty much the same as the first example in pycurl quickstart
The main function is called below, when the script is accessed directly (not by import).
First off it calls the get_ip to get initial IP and then runs the while loop which checks if the IP has changed and lets you know if so.
EDIT:
Since you changed your question, here is your new code in a working example:
import requests
def getIP():
result = requests.get("http://ipinfo.io/ip")
return result.text
def checkIP():
initial = getIP()
print("Initial IP: {}".format(initial))
while True:
current = getIP()
if initial == current:
pass
else:
print("IP has changed!")
checkIP()
As I mentioned in the comments above, you do not need two loops. One is enough. You don't even need two functions, but better do. One for getting the data and one for the loop. In the later, first get initial value and then run the loop, inside which you check if value has changed or not.
It seems, from reading the pycurl documentation, like you would find it easier to solve this problem using the requests library. Curl is more to do with file transfer, so the library expects you to provide a file-like object into which it writes the contents. This would greatly complicate your logic.
requests allows you to access the text of the server's response directly:
>>> import requests
>>> result = requests.get("http://ipinfo.io/ip")
>>> result.text
'151.231.192.8\n'
As #PeterWood suggested, a function would be more appropriate than a class for this - or if the script is going to run continuously, just a simple loop as the body of the program.
My questions basically is is there a best practice approach to db interaction and am I doing something silly / wrong in the below that is costing processing time.
My program pulls data from a website and writes to a SQL database. Speed is very important and I want to be able to refresh the data as quickly as possible. I've tried a number of ways and I feel its still way too slow - i.e. could be much better with a better approach / design to interaction with the db and I'm sure I'm making all sorts of mistakes. I can download the data to memory very quickly but the writes to the db take much much longer.
The 3 main approaches I've tried are:
Threads that pull the data and populate a list of SQL commands, when
threads complete run sql in main thread
Threads that pull data and push to SQL (as per below code)
Threads that pull data and populate a q with separate thread(s)
polling the q and pushing to the db.
Code as below:
import MySQLdb as mydb
class DatabaseUtility():
def __init__(self):
"""set db parameters"""
def updateCommand(self, cmd):
"""run SQL commands and return number of matched rows"""
try:
self.cur.execute(cmd)
return int(re.search('Rows matched: (\d+)', self.cur._info).group(1))
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
def addCommand(self, cmd):
"""write SQL command to db"""
try:
self.cur.execute(cmd)
return self.cur.rowcount
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
I've created a class that instantiates a db connection and is called as below:
from Queue import Queue
from threading import Thread
import urllib2
import json
from databasemanager import DatabaseUtility as dbU
from datalinks import getDataLink, allDataLinks
numThreads = 3
q = Queue()
dbu = dbU()
class OddScrape():
def __init__(self, name, q):
self.name = name
self.getOddsData(self.name, q)
def getOddsData(self, i, q):
"""Worker thread - parse each datalink and update / insert to db"""
while True:
#get datalink, create db connection
self.dbu = dbU()
matchData = q.get()
#load data link using urllib2 and do a bunch of stuff
#to parse the data to the required format
#try to update in db and insert if not found
sql = "sql to update %s" %(params)
update = self.dbu.updateCommand(sql)
if update < 1:
sql = "sql to insert" %(params)
self.dbu.addCommand(sql)
q.task_done()
self.dbu.dbConClose()
print eventlink
def threadQ():
#set up some threads
for i in range(numThreads):
worker = Thread(target=OddScrape, args=(i, q,))
worker.start()
#get urldata for all matches required and add to q
matchids = dbu.runCommand("sql code to determine scope of urls")
for match in matchids:
sql = "sql code to get url data %s" %match
q.put(dbu.runCommand(sql))
q.join()
I've also added an index to the table I'm writing too which seemed to help a tiny bit but not noticeably:
CREATE INDEX `idx_oddsdata_bookid_datalinkid`
ON `dbname`.`oddsdata` (bookid, datalinkid) COMMENT '' ALGORITHM DEFAULT LOCK DEFAULT;
Multiple threads implies multiple connections. Although getting a connection is "fast" in MySQL, it is not instantaneous. I do not know the relative speed of getting a connection versus running a query, but I doubt if you multi-threaded idea will win.
Could you show us examples of the actual queries (SQL, not python code) you need to run. We may have suggestions on combining queries, improved indexes, etc. Please provide SHOW CREATE TABLE, too. (You mentioned a CREATE INDEX, but it is useless out of context.)
It looks like you are doing a multi-step process that could be collapsed into INSERT ... ON DUPLICATE KEY UPDATE ....