I have a list containing ID Number's, I want to implement every unique ID Number in an API call for each Multiprocessor whilst running the same corresponding functions, implementing the same conditional statements to each processor etc. I have tried to make sense of it but there is not a lot online about this procedure.
I thought to use a for loop but I don't want every processor running this for loop picking up every item in a list. I just need each item to be associated to each processor.
I was thinking something like this:
from multiprocessing import process
import requests, json
ID_NUMBERS = ["ID 1", "ID 2", "ID 3".... ETC]
BASE_URL = "www.api.com"
KEY = {"KEY": "12345"}
a = 0
for x in ID_NUMBERS:
def[a]():
while Active_live_data == True:
# continuously loops over, requesting data from the website
unique_api_call = "{}/livedata[{}]".format(BASE_URL, x)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
#some extra conditional code...
a += 1
processes = []
b = 0
for _ in range(len(ID_NUMBERS))
p = multiprocessing.Process(target = b)
p.start()
processes.append(p)
b += 1
Any help would be greatly appreciated!
Kindest regards,
Andrew
You can use the map function:
import multiprocessing as mp
num_cores = mp.cpu_count()
pool = mp.Pool(processes=num_cores)
results = pool.map(your_function, list_of_IDs)
This will execute the function your_function, each time with a different item from the list list_of_IDs, and the values returned by your_function will be stored in a list of values (results).
Same approach as #AlessiaM but uses the high-level api in the concurrent.futures module.
import concurrent.futures as mp
import requests, json
BASE_URL = ''
KEY = {"KEY": "12345"}
ID_NUMBERS = ["ID 1", "ID 2", "ID 3"]
def job(id):
unique_api_call = "{}/livedata[{}]".format(BASE_URL, id)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
return show_it
# Default to as many workers as there are processors,
# But since your job is IO bound (vs CPU bound),
# you could increase this to an even bigger figure by giving the `max_workers` parameter
with mp.ProcessPoolExecutor() as pool:
results = pool.map(job,ID_NUMBERS)
# Process results here
Related
I wrote some code that uses OCR to extract text from screenshots of follower lists and then transfer them into a data frame.
The reason I have to do the hustle with "name" / "display name" and removing blank lines is that the initial text extraction looks something like this:
Screenname 1
name 1
Screenname 2
name 2
(and so on)
So I know in which order each extraction will be.
My code works well for 1-30 images, but if I take more than that its gets a bit slow. My goal is to run around 5-10k screenshots through it at once. I'm pretty new to programming so any ideas/tips on how to optimize the speed would be very appreciated! Thank you all in advance :)
from PIL import Image
from pytesseract import pytesseract
import os
import pandas as pd
from itertools import chain
list_final = [""]
list_name = [""]
liste_anzeigename = [""]
list_raw = [""]
anzeigename = [""]
name = [""]
sort = [""]
f = r'/Users/PycharmProjects/pythonProject/images'
myconfig = r"--psm 4 --oem 3"
os.listdir(f)
for file in os.listdir(f):
f_img = f+"/"+file
img = Image.open(f_img)
img = img.crop((240, 400, 800, 2400))
img.save(f_img)
for file in os.listdir(f):
f_img = f + "/" + file
test = pytesseract.image_to_string(PIL.Image.open(f_img), config=myconfig)
lines = test.split("\n")
list_raw = [line for line in lines if line.strip() != ""]
sort.append(list_raw)
name = {list_raw[0], list_raw[2], list_raw[4],
list_raw[6], list_raw[8], list_raw[10],
list_raw[12], list_raw[14], list_raw[16]}
list_name.append(name)
anzeigename = {list_raw[1], list_raw[3], list_raw[5],
list_raw[7], list_raw[9], list_raw[11],
list_raw[13], list_raw[15], list_raw[17]}
liste_anzeigename.append(anzeigename)
reihenfolge_name = list(chain.from_iterable(list_name))
index_anzeigename = list(chain.from_iterable(liste_anzeigename))
sortieren = list(chain.from_iterable(sort))
print(list_raw)
sort_name = sorted(reihenfolge_name, key=sortieren.index)
sort_anzeigename = sorted(index_anzeigename, key=sortieren.index)
final = pd.DataFrame(zip(sort_name, sort_anzeigename), columns=['name', 'anzeigename'])
print(final)
Use a multiprocessing.Pool.
Combine the code under the for-loops, and put it into a function process_file.
This function should accept a single argument; the name of a file to process.
Next using listdir, create a list of files to process.
Then create a Pool and use its map method to process the list;
import multiprocessing as mp
def process_file(name):
# your code goes here.
return anzeigename # Or watever the result should be.
if __name__ is "__main__":
f = r'/Users/PycharmProjects/pythonProject/images'
p = mp.Pool()
liste_anzeigename = p.map(process_file, os.listdir(f))
This will run your code in parallel in as many cores as your CPU has.
For a N-core CPU this will take approximately 1/N times the time as doing it without multiprocessing.
Note that the return value of the worker function should be pickleable; it has to be returned from the worker process to the parent process.
I've heard that Python multi-threading is a bit tricky, and I am not sure what is the best way to go about implementing what I need. Let's say I have a function called IO_intensive_function that does some API call which may take a while to get a response.
Say the process of queuing jobs can look something like this:
import thread
for job_args in jobs:
thread.start_new_thread(IO_intense_function, (job_args))
Would the IO_intense_function now just execute its task in the background and allow me to queue in more jobs?
I also looked at this question, which seems like the approach is to just do the following:
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(2)
results = pool.map(IO_intensive_function, jobs)
As I don't need those tasks to communicate with each other, the only goal is to send my API requests as fast as possible. Is this the most efficient way? Thanks.
Edit:
The way I am making the API request is through a Thrift service.
I had to create code to do something similar recently. I've tried to make it generic below. Note I'm a novice coder, so please forgive the inelegance. What you may find valuable, however, is some of the error processing I found it necessary to embed to capture disconnects, etc.
I also found it valuable to perform the json processing in a threaded manner. You have the threads working for you, so why go "serial" again for a processing step when you can extract the info in parallel.
It is possible I will have mis-coded in making it generic. Please don't hesitate to ask follow-ups and I will clarify.
import requests
from multiprocessing.dummy import Pool as ThreadPool
from src_code.config import Config
with open(Config.API_PATH + '/api_security_key.pem') as f:
my_key = f.read().rstrip("\n")
f.close()
base_url = "https://api.my_api_destination.com/v1"
headers = {"Authorization": "Bearer %s" % my_key}
itm = list()
itm.append(base_url)
itm.append(headers)
def call_API(call_var):
base_url = call_var[0]
headers = call_var[1]
call_specific_tag = call_var[2]
endpoint = f'/api_path/{call_specific_tag}'
connection_tries = 0
for i in range(3):
try:
dat = requests.get((base_url + endpoint), headers=headers).json()
except:
connection_tries += 1
print(f'Call for {api_specific_tag} failed after {i} attempt(s). Pausing for 240 seconds.')
time.sleep(240)
else:
break
tag = list()
vars_to_capture_01 = list()
vars_to_capture_02 = list()
connection_tries = 0
try:
if 'record_id' in dat:
vars_to_capture_01.append(dat['record_id'])
vars_to_capture_02.append(dat['second_item_of_interest'])
else:
vars_to_capture_01.append(call_specific_tag)
print(f'Call specific tag {call_specific_tag} is unavailable. Successful pull.')
vars_to_capture_02.append(-1)
except:
print(f'{call_specific_tag} is unavailable. Unsuccessful pull.')
vars_to_capture_01.append(call_specific_tag)
vars_to_capture_02.append(-1)
time.sleep(240)
pack = list()
pack.append(vars_to_capture_01)
pack.append(vars_to_capture_02)
return pack
vars_to_capture_01 = list()
vars_to_capture_02 = list()
i = 0
max_i = len(all_tags)
while i < max_i:
ind_rng = range(i, min((i + 10), (max_i)), 1)
itm_lst = (itm.copy())
call_var = [itm_lst + [all_tags[q]] for q in ind_rng]
#packed = call_API(call_var[0]) # for testing of function without pooling
pool = ThreadPool(len(call_var))
packed = pool.map(call_API, call_var)
pool.close()
pool.join()
for pack in packed:
try:
vars_to_capture_01.append(pack[0][0])
except:
print(f'Unpacking error for {all_tags[i]}.')
vars_to_capture_02.append(pack[1][0])
For network API request you can use asyncio. Have a look at this article https://realpython.com/python-concurrency/#asyncio-version for an example how to implement it.
I store QuertyText within a pandas dataframe. Once I've loaded all the queries into I want to conduct an analysis again each query. Currently, I have ~50k to evaluate. So, doing it one by one, will take a long time.
So, I wanted to implement concurrent.futures. How do I take the individual QueryText stored within fullAnalysis as pass it to concurrent.futures and return the output as a variable?
Here is my entire code:
import pandas as pd
import time
import gensim
import sys
import warnings
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
fullAnalysis = pd.DataFrame()
def fetch_data(jFile = 'ProcessingDetails.json'):
print("Fetching data...please wait")
#read JSON file for latest dictionary file name
baselineDictionaryFileName = 'Dictionary/Dictionary_05-03-2020.json'
#copy data to pandas dataframe
labelled_data = pd.read_json(baselineDictionaryFileName)
#Add two more columns to get the most similar text and score
labelled_data['SimilarText'] = ''
labelled_data['SimilarityScore'] = float()
print("Data fetched from " + baselineDictionaryFileName + " and there are " + str(labelled_data.shape[0]) + " rows to be evalauted")
return labelled_data
def calculateScore(inputFunc):
warnings.filterwarnings("ignore", category=DeprecationWarning)
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
inp = inputFunc
print(inp)
out = dict()
strEvaluation = inp.split("most_similar ",1)[1]
#while inp != 'quit':
split_inp = inp.split()
try:
if split_inp[0] == 'help':
pass
elif split_inp[0] == 'similarity' and len(split_inp) >= 3:
pass
elif split_inp[0] == 'most_similar' and len(split_inp) >= 2:
for pair in model.most_similar(positive=[split_inp[1]]):
out.update({pair[0]: pair[1]})
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
return out
def main():
with ThreadPoolExecutor(max_workers=5) as executor:
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
#for item in executor.map(calculateScore, arg):
output = executor.map(calculateScore, arg)
return output
if __name__ == "__main__":
fullAnalysis = fetch_data()
results = main()
print(f'results: {results}')
The Python Global Interpreter Lock or GIL allows only one thread to hold control of the Python interpreter. Since your function calculateScore might be cpu-bound and requires the interpreter to execute its byte code, you may be gaining little by using threading. If, on the other hand, it were doing mostly I/O operations, it would be giving up the GIL for most of its running time allowing other threads to run. But that does not seem to be the case here. You probably should be using the ProcessPoolExecutor from concurrent.futures (try it both ways and see):
def main():
with ProcessPoolExecutor(max_workers=None) as executor:
the_futures = {}
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures[future] = i # map future to request
for future in as_completed(the_futures): # results as they become available not necessarily the order of submission
i = the_futures[future] # the original index
result = future.result() # the result
If you omit the max_workers parameter (or specify a value of None) from the ProcessPoolExecutor constructor, the default will be the number of processors you have on your machine (not a bad default). There is no point in specifying a value larger than the number of processors you have.
If you do not need to tie the future back to the original request, then the_futures can just be a list to which But simplest yest in not even to bother to use the as_completed method:
def main():
with ProcessPoolExecutor(max_workers=5) as executor:
the_futures = []
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures.append(future)
# wait for the completion of all the results and return them all:
results = [f.result() for f in the_futures()] # results in creation order
return results
It should be mentioned that code that launches the ProcessPoolExecutor functions should be in a block governed by a if __name__ = '__main__':. If it isn't you will get into a recursive loop with each subprocess launching the ProcessPoolExecutor. But that seems to be the case here. Perhaps you meant to use the ProcessPoolExecutor all along?
Also:
I don't know what the line ...
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
... in function calculateStore does. It may be the one i/o-bound statement. But this appears to be something that does not vary from call to call. If that is the case and model is not being modified in the function, shouldn't this statement be moved out of the function and computed just once? Then this function would clearly run faster (and be clearly cpu-bound).
Also:
The exception block ...
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
... is puzzling. You are inputting a value that will never be used right before returning. If this is to pause execution, there is no error message being output.
With Booboo assistance, I was able to update code to include ProcessPoolExecutor. Here is my updated code. Overall, processing has been speed up by more than 60%.
I did run into a processing issue and found this topic BrokenPoolProcess that addresses the issue.
output = {}
thePool = {}
def main(labelled_data, dictionaryRevised):
args = sys.argv[1:]
with ProcessPoolExecutor(max_workers=None) as executor:
for i in range(len(labelled_data)):
text = labelled_data['QueryText'][i]
arg = 'most_similar'+ ' '+ text
output = winprocess.submit(
executor, calculateScore, arg
)
thePool[output] = i #original index for future to request
for output in as_completed(thePool): # results as they become available not necessarily the order of submission
i = thePool[output] # the original index
text = labelled_data['QueryText'][i]
result = output.result() # the result
maximumKey = max(result.items(), key=operator.itemgetter(1))[0]
maximumValue = result.get(maximumKey)
labelled_data['SimilarText'][i] = maximumKey
labelled_data['SimilarityScore'][i] = maximumValue
return labelled_data, dictionaryRevised
if __name__ == "__main__":
start = time.perf_counter()
print("Starting to evaluate Query Text for labelling...")
output_Labelled_Data, output_dictionary_revised = preProcessor()
output,dictionary = main(output_Labelled_Data, output_dictionary_revised)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')
The program is supposed to use the translate-python API to translate ENGLISH_DICT into 60+ languages (ENGLISH_DICT has been shorted a lot and so has LANG_CODES). Translating a huge dictionary into 60+ languages takes a little close to 2 hours with synchronized coding which is why I wanted to use threads.
My thread pool is supposed to be size 4, but I sometimes get 10 threads running without the previous threads completing (Found this out by putting a print statement on the first and last line of the thread handler). Also, the pool will run multiple threads, but as soon as a few threads complete the entire program terminates and I get a 0 exit code. Lastly, if my max pool size is 10 and I have less than 10 threads join then the program terminated immediately.
More than 4 threads running without previous threads completing
Only 8 threads finished running out of 65 that were scheduled to run
9 threads were created but the max thread pool size is 10. The threads started to run but main program exited with a 0 exit code
import copy
import os
import json
import threading
from multiprocessing.dummy import Pool
from queue import Queue
from translate import Translator
LANG_CODES = {"af", "ar", "bn", "bs", "bg", "yue", "ca", "fi", "fr"}
VERIFIED_LANGUAGES = {'en', 'es', 'zh'}
TOTAL_TRANSLATIONS = len(LANG_CODES) - len(VERIFIED_LANGUAGES)
NUM_OF_THREADS = 100
DIR_NAME = 'translations'
#Iterate through nested dictionaries and translate string values
#Then prints the final dictionary as JSON
def translate(english_words: dict, dest_lang: str) -> str:
stack = []
cache = {}
T = Translator(provider='microsoft', from_lang='en', to_lang=dest_lang, secret_access_key=API_SECRET1)
translated_words = copy.deepcopy(english_words)
##Populate dictionary with top-level keys or translate top-level words
for key in translated_words.keys():
value = translated_words[key]
if type(value) == dict:
stack.append(value)
else:
if value in cache:
translated_words[key] = cache[key]
else:
translation = T.translate(value)
translated_words[key] = translation
cache[translation] = translation
while len(stack):
dic = stack.pop()
for key in dic.keys():
value = dic[key]
if type(value) == dict:
stack.append(value)
else:
if value in cache:
dic[key] = cache[value]
else:
# print('Translating "' + value +'" for', dest_lang)
translation = T.translate(value)
# print('Done translating "' + value +'" for', dest_lang)
# print('Translated', value, '->', translation)
cache[translation] = translation
dic[key] = translation
return json.dumps(translated_words, indent=4)
##GENERATES A FOLDER CALLED 'translations' WITH LOCALE JSON FILES IN THE WORKING DIRECTORY THE SCRIPT IS LAUNCHED IN WITH MULTIPLE THREADS WORKING ON DIFFERENT LANGUAGES
def generate_translations(english_dict: dict):
if not os.path.exists(DIR_NAME):
os.mkdir(DIR_NAME)
finished_langs = set(map(lambda file_name: file_name.split('.json')[0], os.listdir(DIR_NAME)))
LANG_CODES.difference_update(finished_langs)
pool = Pool(NUM_OF_THREADS)
thread_params = [(english_dict, lang_code) for lang_code in sorted(LANG_CODES) if not lang_code.split('-')[0] in VERIFIED_LANGUAGES]
pool.map_async(thread_handler, thread_params)
pool.close()
pool.join()
print('DONE GENERATING')
##TRANSLATES AN ENTIRE DICTIONARY AND THEN WRITES IT TO A FILE IN THE TRANSLATION FOLDER
def thread_handler(params: tuple):
english_dict, lang_code = params
print('Translating for lang_code: ', lang_code)
translated_string_json = translate(english_dict, lang_code)
print('done getting string for', lang_code)
file = open(DIR_NAME + '/' + lang_code + '.json', 'w')
file.write(translated_string_json)
file.close()
num_of_langs_remaining = TOTAL_TRANSLATIONS - len(os.listdir(DIR_NAME))
print('Done translating for lang_code: ' + lang_code +'.', num_of_langs_remaining, 'remaining.\n\n')
ENGLISH_DICT = {
"changePassword": {
"yourCurrentPassword": "Your current password",
"newPassword": "New password",
"reenterNewPassword": "Re-enter new password",
"changePassword": "Change Password",
"yourProfile": "Your Profile",
"emptyFieldAlert": {
"header": "All fields must not be empty",
"body": "Please fill in all the fields"
}
}
}
if __name__ == '__main__':
generate_translations(ENGLISH_DICT)
Threads in Python are not equal to Threads in e.g. Java.
They don't actually use multiple CPU cores to execute your code parallel.
Multiprocessing is used instead.
multiprocessing.dummy just uses the API from multiprocessing but is actually a wrapper for threading.
You should use
from multiprocessing import Pool
instead for actual parallelization and better performance.
You should count the amount of threads with
print(len(active_children()))
If you don't use the AsyncResult or a callback then you should just use
map(thread_handler, thread_params)
instead of
map_async(thread_handler, thread_params)
because it works parallel anyway and even blocks until it completes the given task.
The context manager protocol also works with Pool:
with Pool(NUM_OF_THREADS) as pool:
thread_params = [(english_dict, lang_code) for lang_code in sorted(LANG_CODES) if not lang_code.split('-')[0] in VERIFIED_LANGUAGES]
pool.map(thread_handler, thread_params)
print('DONE GENERATING')
After some lookup into google and posts of stackoverflow and other sites, i'm still confused on how i can apply a queue and threading on my code:
import psycopg2
import sys
import re
# for threading and queue
import multiprocessing
from multiprocessing import Queue
# for threading and queue
import time
from datetime import datetime
class Database_connection():
def db_call(self,query,dbHost,dbName,dbUser,dbPass):
try:
con = None
con = psycopg2.connect(host=dbHost,database=dbName,
user=dbUser,password=dbPass)
cur = con.cursor()
cur.execute(query)
data = cur.fetchall()
resultList = []
for data_out in data:
resultList.append(data_out)
return resultList
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
finally:
if con:
con.close()
w = Database_connection()
sql = "select stars from galaxy"
startTime = datetime.now()
for result in w.db_call(sql, "x", "x", "x", "x"):
print result[0]
print "Runtime: " + str(datetime.now()-startTime)
lets supose the result will be 100+ values. How can i, put those 100+ results on queue and execute ( print, for example ) then 5 at time using queue and multiprocessing module?
What do you want this code to do?
You get no output from this code because get() returns the next item from the queue (doc). You are putting the letters from the sql response into the queue one letter at a time. the i in for i... is looping over the list returned by w.db_call. Those items are (I assume) strings, which you are then iterating over and adding one at a time to the queue. The next thing you do is to remove the element you just added to the queue from the queue, which leaves the queue unchanged over each pass through the loop. If you put a print statement in the loop it prints out the letter it just got from the queue.
Queues are used to pass information between processes. I think you are trying to set-up a producer/consumer pattern where you have one process add things to the queue and multiple other processes which consume things from the queue. See working example of multiprocessing.Queue and links contained there in (example, main documentation).
Probably the simplest way to get this working, as long as you don't need it to run in an interactive shell, is to use Pool (lifted almost verbatim from the documentation of multiprocess)
from multiprocessing import Pool
p = Pool(5) # sets the number of worker threads you want
def f(res):
# put what ever you want to do with each of the query results in here
return res
result_lst = w.db_call(sql, "x", "x", "x", "x")
proced_results = p.map(f, result_lst)
which apply what ever you want to do to each result (written into the function f) and returns the results of that manipulation as a list. The number of sub-processes to use is set by the argument to Pool.
This is my suggestion...
import Queue
from threading import Thread
class Database_connection:
def db_call(self,query,dbHost,dbName,dbUser,dbPass):
# your code here
return
# in this example each thread will execute this function
def processFtpAddrMt(queue):
# loop will continue until queue containing FTP addresses is empty
while True:
# get an ftp address, a exception will be called when the
# queue is empty and the loop will break
try: ftp_addr = queue.get()
except: break
# put code to process the ftp address here
# let queue know this task is done
queue.task_done()
w = Database_connection()
sql = "select stars from galaxy"
ftp_addresses = w.db_call(sql, "x", "x", "x", "x")
# put each result of the SQL call in a Queue class
ftp_addr_queue = Queue.Queue()
for addr in ftp_addresses:
ftp_addr_queue.put(addr)
# create five threads where each one will run analyzeFtpResult
# pass the queue to the analyzeFtpResult function
for x in range(0,5):
t = Thread(target=processFtpAddrMt,args=(ftp_addr_queue,))
t.setDaemon(True)
t.start()
# blocks further execution of the script until all queue items have been processed
ftp_addr_queue.join()
It uses the Queue class to store your SQL results and then the Thread class to process the queue. Five thread classes are created and each one uses a processFtpAddrMt function which take ftp addresses from the queue until the queue is empty. All you have to do is add the code for processing the ftp address. Hope this helps.
I was able to solve the problem with the following:
def worker():
w = Database_connection()
sql = "select stars from galaxy"
for result in w.db_call(sql, "x", "x", "x", "x"):
if result:
jobs = []
startTime = datetime.now()
for i in range(1):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
print "Runtime: " + str(datetime.now()-startTime)
I belive it is not the best way to do it, but for now solved my problem :)