I have a process that loops over a list of IP addresses and returns some information about them. The simple for loop works great, my issue is running this at scale due to Python's Global Interpreter lock (GIL).
My goal is to have this function run in parallel and take full use of my 4 cores. This way when I run 100K of these it won't take me 24 hours via a normal for loop.
After reading others answers on here, particularly this one, How do I parallelize a simple Python loop?, I decided to use joblib. When I run 10 records thru it(example above), it took over 10 minutes to run. This doesn't sound like it's working right. I know there is something i'm doing wrong or not understanding. Any help is greatly appreciated!
import pandas as pd
import numpy as np
import os as os
from ipwhois import IPWhois
from joblib import Parallel, delayed
import multiprocessing
num_core = multiprocessing.cpu_count()
iplookup = ['174.192.22.197',\
'70.197.71.201',\
'174.195.146.248',\
'70.197.15.130',\
'174.208.14.133',\
'174.238.132.139',\
'174.204.16.10',\
'104.132.11.82',\
'24.1.202.86',\
'216.4.58.18']
Normal for loop which works fine!
asn=[]
asnid=[]
asncountry=[]
asndesc=[]
asnemail = []
asnaddress = []
asncity = []
asnstate = []
asnzip = []
asndesc2 = []
ipaddr=[]
b=1
totstolookup=len(iplookup)
for i in iplookup:
i = str(i)
print("Running #{} out of {}".format(b,totstolookup))
try:
obj=IPWhois(i,timeout=15)
result=obj.lookup_whois()
asn.append(result['asn'])
asnid.append(result['asn_cidr'])
asncountry.append(result['asn_country_code'])
asndesc.append(result['asn_description'])
try:
asnemail.append(result['nets'][0]['emails'])
asnaddress.append(result['nets'][0]['address'])
asncity.append(result['nets'][0]['city'])
asnstate.append(result['nets'][0]['state'])
asnzip.append(result['nets'][0]['postal_code'])
asndesc2.append(result['nets'][0]['description'])
ipaddr.append(i)
except:
asnemail.append(0)
asnaddress.append(0)
asncity.append(0)
asnstate.append(0)
asnzip.append(0)
asndesc2.append(0)
ipaddr.append(i)
except:
pass
b+=1
Function to to pass to joblib to run on all cores!
def run_ip_process(iplookuparray):
asn=[]
asnid=[]
asncountry=[]
asndesc=[]
asnemail = []
asnaddress = []
asncity = []
asnstate = []
asnzip = []
asndesc2 = []
ipaddr=[]
b=1
totstolookup=len(iplookuparray)
for i in iplookuparray:
i = str(i)
print("Running #{} out of {}".format(b,totstolookup))
try:
obj=IPWhois(i,timeout=15)
result=obj.lookup_whois()
asn.append(result['asn'])
asnid.append(result['asn_cidr'])
asncountry.append(result['asn_country_code'])
asndesc.append(result['asn_description'])
try:
asnemail.append(result['nets'][0]['emails'])
asnaddress.append(result['nets'][0]['address'])
asncity.append(result['nets'][0]['city'])
asnstate.append(result['nets'][0]['state'])
asnzip.append(result['nets'][0]['postal_code'])
asndesc2.append(result['nets'][0]['description'])
ipaddr.append(i)
except:
asnemail.append(0)
asnaddress.append(0)
asncity.append(0)
asnstate.append(0)
asnzip.append(0)
asndesc2.append(0)
ipaddr.append(i)
except:
pass
b+=1
ipdataframe = pd.DataFrame({'ipaddress':ipaddr,
'asn': asn,
'asnid':asnid,
'asncountry':asncountry,
'asndesc': asndesc,
'emailcontact': asnemail,
'address':asnaddress,
'city':asncity,
'state': asnstate,
'zip': asnzip,
'ipdescrip':asndesc2})
return ipdataframe
run process using all cores via joblib
Parallel(n_jobs=num_core)(delayed(run_ip_process)(iplookuparray) for i in iplookup)
Related
I use python multiprocessing to compute some sort of scores on DNA sequences from a large file.
For that I write and use the script below.
I use a Linux machine with 48 cpu in python 3.8 environment.
Th code work fine, and terminate the work correctly and print the processing time at the end.
Problem: when I use the htop command, I find that all 48 processes are still alive.
I don't know why, and I don't know what to add to my script to avoid this.
import csv
import sys
import concurrent.futures
from itertools import combinations
import psutil
import time
nb_cpu = psutil.cpu_count(logical=False)
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
score_dist = compute_score_dist(seq_1[1], seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
def help_fun_job(nested_pair):
return fun_job(nested_pair[0], nested_pair[1])
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor(max_workers=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')
def main():
print("nb_cpu in this machine : ", nb_cpu)
file_path = sys.argv[1]
dict_ids_seqs = get_dict_ids_seqs(file_path)
list_ids = list(dict_ids_seqs) # This will convert the dict_keys to a list
list_combined_ids = list(combinations(list_ids, 2))
compute_using_multi_processing(list_combined_ids, dict_ids_seqs)
if __name__ == '__main__':
main()
Thank you for your help.
Edit : add the complete code for fun_job (after #Booboo answer)
from Bio import Align
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
score_dist = aligner.score(seq_1[1],seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
When the with ... as executor: block exits, there is an implicit call to executor.shutdown(wait=True). This will wait for all pending futures to to be done executing "and the resources associated with the executor have been freed", which presumably includes terminating the processes in the pool (if possible?). Why your program terminates (or does it?) or at least you say all the futures have completed executing, while the processes have not terminated is a bit of a mystery. But you haven't provided the code for fun_job, so who can say why this is so?
One thing you might try is to switch to using the multiprocessing.pool.Pool class from the multiprocessing module. It supports a terminate method, which is implicitly called when its context manager with block exits, that explicitly attempts to terminate all processes in the pool:
#import concurrent.futures
import multiprocessing
... # etc.
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with multiprocessing.Pool(processes=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')
I've heard that Python multi-threading is a bit tricky, and I am not sure what is the best way to go about implementing what I need. Let's say I have a function called IO_intensive_function that does some API call which may take a while to get a response.
Say the process of queuing jobs can look something like this:
import thread
for job_args in jobs:
thread.start_new_thread(IO_intense_function, (job_args))
Would the IO_intense_function now just execute its task in the background and allow me to queue in more jobs?
I also looked at this question, which seems like the approach is to just do the following:
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(2)
results = pool.map(IO_intensive_function, jobs)
As I don't need those tasks to communicate with each other, the only goal is to send my API requests as fast as possible. Is this the most efficient way? Thanks.
Edit:
The way I am making the API request is through a Thrift service.
I had to create code to do something similar recently. I've tried to make it generic below. Note I'm a novice coder, so please forgive the inelegance. What you may find valuable, however, is some of the error processing I found it necessary to embed to capture disconnects, etc.
I also found it valuable to perform the json processing in a threaded manner. You have the threads working for you, so why go "serial" again for a processing step when you can extract the info in parallel.
It is possible I will have mis-coded in making it generic. Please don't hesitate to ask follow-ups and I will clarify.
import requests
from multiprocessing.dummy import Pool as ThreadPool
from src_code.config import Config
with open(Config.API_PATH + '/api_security_key.pem') as f:
my_key = f.read().rstrip("\n")
f.close()
base_url = "https://api.my_api_destination.com/v1"
headers = {"Authorization": "Bearer %s" % my_key}
itm = list()
itm.append(base_url)
itm.append(headers)
def call_API(call_var):
base_url = call_var[0]
headers = call_var[1]
call_specific_tag = call_var[2]
endpoint = f'/api_path/{call_specific_tag}'
connection_tries = 0
for i in range(3):
try:
dat = requests.get((base_url + endpoint), headers=headers).json()
except:
connection_tries += 1
print(f'Call for {api_specific_tag} failed after {i} attempt(s). Pausing for 240 seconds.')
time.sleep(240)
else:
break
tag = list()
vars_to_capture_01 = list()
vars_to_capture_02 = list()
connection_tries = 0
try:
if 'record_id' in dat:
vars_to_capture_01.append(dat['record_id'])
vars_to_capture_02.append(dat['second_item_of_interest'])
else:
vars_to_capture_01.append(call_specific_tag)
print(f'Call specific tag {call_specific_tag} is unavailable. Successful pull.')
vars_to_capture_02.append(-1)
except:
print(f'{call_specific_tag} is unavailable. Unsuccessful pull.')
vars_to_capture_01.append(call_specific_tag)
vars_to_capture_02.append(-1)
time.sleep(240)
pack = list()
pack.append(vars_to_capture_01)
pack.append(vars_to_capture_02)
return pack
vars_to_capture_01 = list()
vars_to_capture_02 = list()
i = 0
max_i = len(all_tags)
while i < max_i:
ind_rng = range(i, min((i + 10), (max_i)), 1)
itm_lst = (itm.copy())
call_var = [itm_lst + [all_tags[q]] for q in ind_rng]
#packed = call_API(call_var[0]) # for testing of function without pooling
pool = ThreadPool(len(call_var))
packed = pool.map(call_API, call_var)
pool.close()
pool.join()
for pack in packed:
try:
vars_to_capture_01.append(pack[0][0])
except:
print(f'Unpacking error for {all_tags[i]}.')
vars_to_capture_02.append(pack[1][0])
For network API request you can use asyncio. Have a look at this article https://realpython.com/python-concurrency/#asyncio-version for an example how to implement it.
I have a list containing ID Number's, I want to implement every unique ID Number in an API call for each Multiprocessor whilst running the same corresponding functions, implementing the same conditional statements to each processor etc. I have tried to make sense of it but there is not a lot online about this procedure.
I thought to use a for loop but I don't want every processor running this for loop picking up every item in a list. I just need each item to be associated to each processor.
I was thinking something like this:
from multiprocessing import process
import requests, json
ID_NUMBERS = ["ID 1", "ID 2", "ID 3".... ETC]
BASE_URL = "www.api.com"
KEY = {"KEY": "12345"}
a = 0
for x in ID_NUMBERS:
def[a]():
while Active_live_data == True:
# continuously loops over, requesting data from the website
unique_api_call = "{}/livedata[{}]".format(BASE_URL, x)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
#some extra conditional code...
a += 1
processes = []
b = 0
for _ in range(len(ID_NUMBERS))
p = multiprocessing.Process(target = b)
p.start()
processes.append(p)
b += 1
Any help would be greatly appreciated!
Kindest regards,
Andrew
You can use the map function:
import multiprocessing as mp
num_cores = mp.cpu_count()
pool = mp.Pool(processes=num_cores)
results = pool.map(your_function, list_of_IDs)
This will execute the function your_function, each time with a different item from the list list_of_IDs, and the values returned by your_function will be stored in a list of values (results).
Same approach as #AlessiaM but uses the high-level api in the concurrent.futures module.
import concurrent.futures as mp
import requests, json
BASE_URL = ''
KEY = {"KEY": "12345"}
ID_NUMBERS = ["ID 1", "ID 2", "ID 3"]
def job(id):
unique_api_call = "{}/livedata[{}]".format(BASE_URL, id)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
return show_it
# Default to as many workers as there are processors,
# But since your job is IO bound (vs CPU bound),
# you could increase this to an even bigger figure by giving the `max_workers` parameter
with mp.ProcessPoolExecutor() as pool:
results = pool.map(job,ID_NUMBERS)
# Process results here
Is there a possible way to speed up my code using multiprocessing interface?
i have data array that include password i would like to run some requests togther.
import requests
data = ['test','test1','test2']
counter=0
for x in data:
counter+=1
burp0_data = "<methodCall>\r\n<methodName>wp.getUsersBlogs</methodName>\r\n<params>\r\n<param>
<value>zohar</value></param>\r\n<param><value>"+x+"</value>
</param>\r\n</params>\r\n</methodCall>\r\n"
s=requests.post(burp0_url, headers=burp0_headers, data=burp0_data)
if not (s.text.__contains__("403")):
print(s.text)
print(x)
exit()
Python multiprocessing module is what you are looking for. For instance, it has a parallel map function, which will run all requests asynchronously. Here is roughly what your code would look like:
import requests
from multiprocessing import Pool
def post(x):
burp0_data = "<methodCall>\r\n<methodName>wp.getUsersBlogs</methodName>\r\n<params>\r\n<param>
<value>zohar</value></param>\r\n<param><value>"+x+"</value>
</param>\r\n</params>\r\n</methodCall>\r\n"
s=requests.post(burp0_url, headers=burp0_headers, data=burp0_data)
if not (s.text.__contains__("403")):
return s.text, x
return None, None
if __name__ == '__main__':
data = ['test','test1','test2']
counter=0
with Pool(processes=len(data)) as pool:
results = pool.map(post, data, 1)
for res in results:
if res[0] is not None:
print(res[0])
print(res[1])
exit()
For more information please refer to the Python docs on multiprocessing.
I store QuertyText within a pandas dataframe. Once I've loaded all the queries into I want to conduct an analysis again each query. Currently, I have ~50k to evaluate. So, doing it one by one, will take a long time.
So, I wanted to implement concurrent.futures. How do I take the individual QueryText stored within fullAnalysis as pass it to concurrent.futures and return the output as a variable?
Here is my entire code:
import pandas as pd
import time
import gensim
import sys
import warnings
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
fullAnalysis = pd.DataFrame()
def fetch_data(jFile = 'ProcessingDetails.json'):
print("Fetching data...please wait")
#read JSON file for latest dictionary file name
baselineDictionaryFileName = 'Dictionary/Dictionary_05-03-2020.json'
#copy data to pandas dataframe
labelled_data = pd.read_json(baselineDictionaryFileName)
#Add two more columns to get the most similar text and score
labelled_data['SimilarText'] = ''
labelled_data['SimilarityScore'] = float()
print("Data fetched from " + baselineDictionaryFileName + " and there are " + str(labelled_data.shape[0]) + " rows to be evalauted")
return labelled_data
def calculateScore(inputFunc):
warnings.filterwarnings("ignore", category=DeprecationWarning)
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
inp = inputFunc
print(inp)
out = dict()
strEvaluation = inp.split("most_similar ",1)[1]
#while inp != 'quit':
split_inp = inp.split()
try:
if split_inp[0] == 'help':
pass
elif split_inp[0] == 'similarity' and len(split_inp) >= 3:
pass
elif split_inp[0] == 'most_similar' and len(split_inp) >= 2:
for pair in model.most_similar(positive=[split_inp[1]]):
out.update({pair[0]: pair[1]})
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
return out
def main():
with ThreadPoolExecutor(max_workers=5) as executor:
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
#for item in executor.map(calculateScore, arg):
output = executor.map(calculateScore, arg)
return output
if __name__ == "__main__":
fullAnalysis = fetch_data()
results = main()
print(f'results: {results}')
The Python Global Interpreter Lock or GIL allows only one thread to hold control of the Python interpreter. Since your function calculateScore might be cpu-bound and requires the interpreter to execute its byte code, you may be gaining little by using threading. If, on the other hand, it were doing mostly I/O operations, it would be giving up the GIL for most of its running time allowing other threads to run. But that does not seem to be the case here. You probably should be using the ProcessPoolExecutor from concurrent.futures (try it both ways and see):
def main():
with ProcessPoolExecutor(max_workers=None) as executor:
the_futures = {}
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures[future] = i # map future to request
for future in as_completed(the_futures): # results as they become available not necessarily the order of submission
i = the_futures[future] # the original index
result = future.result() # the result
If you omit the max_workers parameter (or specify a value of None) from the ProcessPoolExecutor constructor, the default will be the number of processors you have on your machine (not a bad default). There is no point in specifying a value larger than the number of processors you have.
If you do not need to tie the future back to the original request, then the_futures can just be a list to which But simplest yest in not even to bother to use the as_completed method:
def main():
with ProcessPoolExecutor(max_workers=5) as executor:
the_futures = []
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures.append(future)
# wait for the completion of all the results and return them all:
results = [f.result() for f in the_futures()] # results in creation order
return results
It should be mentioned that code that launches the ProcessPoolExecutor functions should be in a block governed by a if __name__ = '__main__':. If it isn't you will get into a recursive loop with each subprocess launching the ProcessPoolExecutor. But that seems to be the case here. Perhaps you meant to use the ProcessPoolExecutor all along?
Also:
I don't know what the line ...
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
... in function calculateStore does. It may be the one i/o-bound statement. But this appears to be something that does not vary from call to call. If that is the case and model is not being modified in the function, shouldn't this statement be moved out of the function and computed just once? Then this function would clearly run faster (and be clearly cpu-bound).
Also:
The exception block ...
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
... is puzzling. You are inputting a value that will never be used right before returning. If this is to pause execution, there is no error message being output.
With Booboo assistance, I was able to update code to include ProcessPoolExecutor. Here is my updated code. Overall, processing has been speed up by more than 60%.
I did run into a processing issue and found this topic BrokenPoolProcess that addresses the issue.
output = {}
thePool = {}
def main(labelled_data, dictionaryRevised):
args = sys.argv[1:]
with ProcessPoolExecutor(max_workers=None) as executor:
for i in range(len(labelled_data)):
text = labelled_data['QueryText'][i]
arg = 'most_similar'+ ' '+ text
output = winprocess.submit(
executor, calculateScore, arg
)
thePool[output] = i #original index for future to request
for output in as_completed(thePool): # results as they become available not necessarily the order of submission
i = thePool[output] # the original index
text = labelled_data['QueryText'][i]
result = output.result() # the result
maximumKey = max(result.items(), key=operator.itemgetter(1))[0]
maximumValue = result.get(maximumKey)
labelled_data['SimilarText'][i] = maximumKey
labelled_data['SimilarityScore'][i] = maximumValue
return labelled_data, dictionaryRevised
if __name__ == "__main__":
start = time.perf_counter()
print("Starting to evaluate Query Text for labelling...")
output_Labelled_Data, output_dictionary_revised = preProcessor()
output,dictionary = main(output_Labelled_Data, output_dictionary_revised)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')