I have a large customer data set (10 million+) , that I am running my loop calculation. I am trying to add multiprocessing, but it seems to take longer when I use multiprocessing, by splitting data1 into chunks running it in sagemaker studio. I am not sure what I am doing wrong but the calculation takes longer when using multiprocessing, please help.
input data example:
state_list = ['A','B','C','D','E'] #possible states
data1 = pd.DataFrame({"cust_id": ['x111','x112'], #customer data
"state": ['B','E'],
"amount": [1000,500],
"year":[3,2],
"group":[10,10],
"loan_rate":[0.12,0.13]})
data1['state'] = pd.Categorical(data1['state'],
categories=state_list,
ordered=True).codes
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': [0.1, 0.1, 0.1, 0.1, 0.1,0.1, 0.1, 0.1, 0.1, 0.1]}).set_index(['year'])
matrix_data = np.arange(250).reshape(10,5,5) #3d matrix by state(A-E) and year(1-10)
end = pd.Timestamp(year=2021, month=9, day=1) # creating a list of dates
df = pd.DataFrame({"End": pd.date_range(end, periods=10, freq="M")})
df['End']=df['End'].dt.day
End=df.values
end_dates = End.reshape(-1) # array([30, 31, 30, 31, 31, 28, 31, 30, 31, 30]); just to simplify access to the end date values
calculation:
num_processes = 4
# Split the customer data into chunks
chunks = np.array_split(data1, num_processes)
queue = mp.Queue()
def calc(chunk):
results1={}
for cust_id, state, amount, start, group, loan_rate in chunks.itertuples(name=None, index=False):
res1 = [amount * matrix_data[start-1, state, :]]
for year in range(start+1, len(matrix_data)+1,):
res1.append(lookup1.loc[year].iat[0] * np.array(res1[-1]))
res1.append(res1[-1] * loan_rate * end_dates[year-1]/365) # year - 1 here
res1.append(res1[-1]+ 100)
res1.append(np.linalg.multi_dot([res1[-1],matrix_data[year-1]]))
results1[cust_id] = res1
queue.put(results1)
processes = [mp.Process(target=calc, args=(chunk,)) for chunk in chunks]
for p in processes:
p.start()
for p in processes:
p.join()
results1 = {}
while not queue.empty():
results1.update(queue.get())
I think it would be easier to use a multiprocessing pool with the map method, which will submit tasks in chunks anyway but your worker function calc just needs to deal with individuals tuples since the chunking is done in a transparent function. The pool will compute what it thinks is an optimal number of rows to be chunked together based on the total number of rows and the number of processes in the pool, but you can override this. So a solution would look something like the following. Since you have not tagged your question with the OS you are running under, the code below should run under Windows, Linux or MacOS in the most efficient way for that platform. But as I mentioned in a comment, multiprocessing may actually slow down getting your results if calc is not sufficiently CPU-intensive.
from multiprocessing import Pool
import pandas as pd
import numpy as np
def init_pool_processes(*args):
global lookup1, matrix_data, end_dates
lookup1, matrix_data, end_dates = args # unpack
def calc(t):
cust_id, state, amount, start, group, loan_rate = t # unpack
results1 = {}
res1 = [amount * matrix_data[start-1, state, :]]
for year in range(start+1, len(matrix_data)+1,):
res1.append(lookup1.loc[year].iat[0] * np.array(res1[-1]))
res1.append(res1[-1] * loan_rate * end_dates[year-1]/365) # year - 1 here
res1.append(res1[-1] + 100)
return (cust_id, res1) # return tuple
def main():
state_list = ['A','B','C','D','E'] #possible states
data1 = pd.DataFrame({"cust_id": ['x111','x112'], #customer data
"state": ['B','E'],
"amount": [1000,500],
"year":[3,2],
"group":[10,10],
"loan_rate":[0.12,0.13]})
data1['state'] = pd.Categorical(data1['state'],
categories=state_list,
ordered=True).codes
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': [0.1, 0.1, 0.1, 0.1, 0.1,0.1, 0.1, 0.1, 0.1, 0.1]}).set_index(['year'])
matrix_data = np.arange(250).reshape(10,5,5) #3d matrix by state(A-E) and year(1-10)
end = pd.Timestamp(year=2021, month=9, day=1) # creating a list of dates
df = pd.DataFrame({"End": pd.date_range(end, periods=10, freq="M")})
df['End']=df['End'].dt.day
End=df.values
end_dates = End.reshape(-1) # array([30, 31, 30, 31, 31, 28, 31, 30, 31, 30]); just to simplify access to the end date values
with Pool(initializer=init_pool_processes, initargs=(lookup1, matrix_data, end_dates)) as pool:
results = {cust_id: arr for cust_id, arr in pool.map(calc, data1.itertuples(name=None, index=False))}
for cust_id, arr in results.items():
print(cust_id, arr)
if __name__ == '__main__':
main()
Prints:
x111 [array([55000, 56000, 57000, 58000, 59000]), array([5500., 5600., 5700., 5800., 5900.]), array([56.05479452, 57.0739726 , 58.09315068, 59.11232877, 60.13150685]), array([156.05479452, 157.0739726 , 158.09315068, 159.11232877,
160.13150685]), array([15.60547945, 15.70739726, 15.80931507, 15.91123288, 16.01315068]), array([0.15904763, 0.16008635, 0.16112507, 0.1621638 , 0.16320252]), array([100.15904763, 100.16008635, 100.16112507, 100.1621638 ,
100.16320252]), array([10.01590476, 10.01600864, 10.01611251, 10.01621638, 10.01632025]), array([0.09220121, 0.09220216, 0.09220312, 0.09220407, 0.09220503]), array([100.09220121, 100.09220216, 100.09220312, 100.09220407,
100.09220503]), array([10.00922012, 10.00922022, 10.00922031, 10.00922041, 10.0092205 ]), array([0.10201178, 0.10201178, 0.10201178, 0.10201178, 0.10201178]), array([100.10201178, 100.10201178, 100.10201178, 100.10201178,
100.10201178]), array([10.01020118, 10.01020118, 10.01020118, 10.01020118, 10.01020118]), array([0.09873075, 0.09873075, 0.09873075, 0.09873075, 0.09873075]), array([100.09873075, 100.09873075, 100.09873075, 100.09873075,
100.09873075]), array([10.00987308, 10.00987308, 10.00987308, 10.00987308, 10.00987308]), array([0.10201843, 0.10201843, 0.10201843, 0.10201843, 0.10201843]), array([100.10201843, 100.10201843, 100.10201843, 100.10201843,
100.10201843]), array([10.01020184, 10.01020184, 10.01020184, 10.01020184, 10.01020184]), array([0.09873076, 0.09873076, 0.09873076, 0.09873076, 0.09873076]), array([100.09873076, 100.09873076, 100.09873076, 100.09873076,
100.09873076])]
x112 [array([22500, 23000, 23500, 24000, 24500]), array([2250., 2300., 2350., 2400., 2450.]), array([24.04109589, 24.57534247, 25.10958904, 25.64383562, 26.17808219]), array([124.04109589, 124.57534247, 125.10958904, 125.64383562,
126.17808219]), array([12.40410959, 12.45753425, 12.5109589 , 12.56438356, 12.61780822]), array([0.13695496, 0.13754483, 0.1381347 , 0.13872456, 0.13931443]), array([100.13695496, 100.13754483, 100.1381347 , 100.13872456,
100.13931443]), array([10.0136955 , 10.01375448, 10.01381347, 10.01387246, 10.01393144]), array([0.11056217, 0.11056282, 0.11056347, 0.11056413, 0.11056478]), array([100.11056217, 100.11056282, 100.11056347, 100.11056413,
100.11056478]), array([10.01105622, 10.01105628, 10.01105635, 10.01105641, 10.01105648]), array([0.09983629, 0.09983629, 0.09983629, 0.09983629, 0.09983629]), array([100.09983629, 100.09983629, 100.09983629, 100.09983629,
100.09983629]), array([10.00998363, 10.00998363, 10.00998363, 10.00998363, 10.00998363]), array([0.11052119, 0.11052119, 0.11052119, 0.11052119, 0.11052119]), array([100.11052119, 100.11052119, 100.11052119, 100.11052119,
100.11052119]), array([10.01105212, 10.01105212, 10.01105212, 10.01105212, 10.01105212]), array([0.10696741, 0.10696741, 0.10696741, 0.10696741, 0.10696741]), array([100.10696741, 100.10696741, 100.10696741, 100.10696741,
100.10696741]), array([10.01069674, 10.01069674, 10.01069674, 10.01069674, 10.01069674]), array([0.11052906, 0.11052906, 0.11052906, 0.11052906, 0.11052906]), array([100.11052906, 100.11052906, 100.11052906, 100.11052906,
100.11052906]), array([10.01105291, 10.01105291, 10.01105291, 10.01105291, 10.01105291]), array([0.10696741, 0.10696741, 0.10696741, 0.10696741, 0.10696741]), array([100.10696741, 100.10696741, 100.10696741, 100.10696741,
100.10696741])]
If you wish to save memory, you could use method imap_unordered:
def main():
... # code omitted
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
from multiprocessing import cpu_count
pool_size = cpu_count()
iterable_size = 100_000 # Your best estimate
chunksize = compute_chunksize(iterable_size, pool_size)
with Pool(pool_size, initializer=init_pool_processes, initargs=(lookup1, matrix_data, end_dates)) as pool:
it = pool.imap_unordered(calc, data1.itertuples(name=None, index=False), chunksize=chunksize)
"""
# Create dictionary in memory:
results = {cust_id: arr for cust_id, arr in it}
"""
# Or to save memory, iterate the results:
for cust_id, arr in it:
print(cust_id, arr)
if __name__ == '__main__':
main()
I'm a bit stuck, i'm trying to pass thread names given by the system to my function so that i can print the start and end time of the current thread working in the function, i'm using global variables name for that. The user has to input a number in the given interval. The thread names work fine when i inputed 1001 but if i input numbers like 1200 or 10001 the names do not fit anymore. I put examples of the output, output 1 is not what i'm looking, output 2 is what i need. I'm not sure what is causing the name change. If any additional information is needed i'm happy to provide it
import os
from posixpath import abspath
import time
import sys
import signal
import threading
import platform
import subprocess
from pathlib import Path
import math
lokot = threading.Lock()
lista = []
name = 0
name3 = 0
def divisor(start,end):
lokot.acquire()
start = time.time()
print('{} started working at {}'.format(name, start))
for i in range(int(start),int(end)+1):
if int(end) % i == 0:
lista.append(i)
end = time.time()
print('{} ended working at {}'.format(name, end))
lokot.release()
def new_lista():
lokot.acquire()
start = time.time()
nlista = []
for i in lista:
if i % 2 == 0:
nlista.append(i)
print(nlista)
print('{} was executed in time frame {}'.format(name3,time.time()-start))
lokot.release()
def f4():
while (True):
print ('Input a non negative number in given range <1000,200000>')
number = input()
if number.isalpha() or not number or int(number) not in range(1000,200000) :
print('Number entered is not in the interval <1000,200000>')
continue
else:
global name
global name3
x = int(number) / 2
t1 = threading.Thread(target=divisor, args=(1, x))
t2 = threading.Thread(target=divisor, args=(1, number))
t3 = threading.Thread(target=nova_lista)
name = t1.name
t1.start()
name = t2.name
t2.start()
name3 = t3.name
t3.start()
t1.join()
t2.join()
t3.join()
break
Input 1:
100001
Output 1:
Thread-1 started working at 1624538800.4813018
Thread-2 ended working at 1624538800.4887686
Thread-2 started working at 1624538800.4892647
Thread-2 ended working at 1624538800.5076165
[2, 4, 8, 10, 16, 20, 40, 50, 80, 100, 200, 250, 400, 500, 1000, 1250, 2000, 2500, 5000, 6250, 10000, 12500, 25000, 50000]
Thread-3 dwas executed in time frame 0.0
Input 2:
1001
Output 2:
Thread-1 started working at 1624538882.90607
Thread-1 ended working at 1624538882.9070616
Thread-2 started working at 1624538882.9074266
Thread-2 ended working at 1624538882.9089162
[2, 4, 8, 10, 20, 40, 50, 100, 200, 250, 500, 1000, 1250, 2500, 5000]
Thread-3 dretva se izvodila 0.0
This won't necessarily work:
name = t1.name
t1.start()
name = t2.name
Nothing prevents the second assignment from happening before the t1 thread accesses the name variable.
Q: Why don't you just assign names when you create the threads instead of letting the Threading library assign them? E.g.;
t1 = threading.Thread(target=divisor, args=(1, x), name="t1")
So i need to store input from system and trigger code block when given command via input matches with condition. Given commands are randomly produced by system and its not same everytime when codes are executed. What i do below is; i store input in a list until input become blankspace and which shows commands are over and it specifically stated in statement that commands will end with blankspace after last command. Read commands, input and values from that command list until there is no command to perform. I know this is bad practice. Since I am newb in this language i need some advice to change my code. Thanks in advance. Btw i cant change conditions in if statements as given commands via input is not the same but like this and more:
append_it 15
insert_it 0 25
remove_it 30
Code works just fine i need advice to make it good code practice to improve myself in Python.
i = 0
command_list = []
while True:
command = input('')
if command == '':
break
command_list.append(command)
i += 1
b = 0
arr = []
while i != b:
command1 = command_list[b]
b += 1
if command1[0:8] == "append_it":
value = int(command1[9:])
arr.append(value)
elif command1[0:4] == "insert_it":
index = int(command1[5:7])
value = int(command1[7:])
arr.insert(index, value)
elif command1[0:3] == "remove_it":
value = int(command1[4:])
if value in liste:
arr.remove(value)
elif command1[0:] == "print_it":
print(arr)
elif command1[0:] == "reverse_it":
arr.reverse()
elif command1[0:] == "sort_it":
arr.sort()
elif command1[0:] == "pop_it":
arr.pop()
You can improve by defining actions to do in a dictionary, adding the inputted values as splitted list and call the appropriate function for the appropriate input:
def appendit(a, *prms):
v = int(prms[0])
a.append(v)
def insertit(a, *prms):
i = int(prms[0])
v = int(prms[1])
a.insert(i,v)
def removeit(a, *prms):
v = int(prms[0])
a.remove(v) # no need to test
def reverseit(a): a.reverse()
def sortit(a): a.sort()
def popit(a): a.pop()
# define what command to run for what input
cmds = {"append_it" : appendit,
"insert_it" : insertit,
"remove_it" : removeit,
"print_it" : print, # does not need any special function
"reverse_it": reverseit,
"sort_it" : sortit,
"pop_it" : popit}
command_list = []
while True:
command = input('')
if command == '':
break
c = command.split() # split the command already
# only allow commands you know into your list - they still might have the
# wrong amount of params given - you should check that in the functions
if c[0] in cmds:
command_list.append(c)
arr = []
for (command, *prms) in command_list:
# call the correct function with/without params
if prms:
cmds[command](arr, *prms)
else:
cmds[command](arr)
Output:
# inputs from user:
append_it 42
append_it 32
append_it 52
append_it 62
append_it 82
append_it 12
append_it 22
append_it 33
append_it 12
print_it # 1st printout
sort_it
print_it # 2nd printout sorted
reverse_it
print_it # 3rd printout reversed sorted
pop_it
print_it # one elem popped
insert_it 4 99
remove_it 42
print_it # 99 inserted and 42 removed
# print_it - outputs
[42, 32, 52, 62, 82, 12, 22, 33, 12]
[12, 12, 22, 32, 33, 42, 52, 62, 82]
[82, 62, 52, 42, 33, 32, 22, 12, 12]
[82, 62, 52, 42, 33, 32, 22, 12]
[82, 62, 52, 99, 33, 32, 22, 12]
I'm writing my first multiprocessing program in python.
I want to create a list of values to be processed, and 8 processes (number os CPU cores) will consume and process the list of values.
I wrote the following python code:
__author__ = 'Rui Martins'
from multiprocessing import cpu_count, Process, Lock, Value
def proc(lock, number_of_active_processes, valor):
lock.acquire()
number_of_active_processes.value+=1
print "Active processes:", number_of_active_processes.value
lock.release()
# DO SOMETHING ...
for i in range(1, 100):
valor=valor**2
# (...)
lock.acquire()
number_of_active_processes.value-=1
lock.release()
if __name__ == '__main__':
proc_number=cpu_count()
number_of_active_processes=Value('i', 0)
lock = Lock()
values=[11, 24, 13, 40, 15, 26, 27, 8, 19, 10, 11, 12, 13]
values_processed=0
processes=[]
for i in range(proc_number):
processes+=[Process()]
while values_processed<len(values):
while number_of_active_processes.value < proc_number and values_processed<len(values):
for i in range(proc_number):
if not processes[i].is_alive() and values_processed<len(values):
processes[i] = Process(target=proc, args=(lock, number_of_active_processes, values[values_processed]))
values_processed+=1
processes[i].start()
while number_of_active_processes.value == proc_number:
# BUG: always number_of_active_processes.value == 8 :(
print "Active processes:", number_of_active_processes.value
print ""
print "Active processes at END:", number_of_active_processes.value
And, I have the following problem:
The program never stop
I get out of RAM
Simplifying your code to the following:
def proc(lock, number_of_active_processes, valor):
lock.acquire()
number_of_active_processes.value += 1
print("Active processes:", number_of_active_processes.value)
lock.release()
# DO SOMETHING ...
for i in range(1, 100):
print(valor)
valor = valor **2
# (...)
lock.acquire()
number_of_active_processes.value -= 1
lock.release()
if __name__ == '__main__':
proc_number = cpu_count()
number_of_active_processes = Value('i', 0)
lock = Lock()
values = [11, 24, 13, 40, 15, 26, 27, 8, 19, 10, 11, 12, 13]
values_processed = 0
processes = [Process() for _ in range(proc_number)]
while values_processed < len(values)-1:
for p in processes:
if not p.is_alive():
p = Process(target=proc,
args=(lock, number_of_active_processes, values[values_processed]))
values_processed += 1
p.start()
If you run it like above the print(valor) added you see exactly what is happening, you are exponentially growing valor to the point you run out of memory, you don't get stuck in the while you get stuck in the for loop.
This is the output at the 12th process adding a print(len(srt(valor))) after a fraction of a second and it just keeps on going:
2
3
6
11
21
.........
59185
70726
68249
73004
77077
83805
93806
92732
90454
104993
118370
136498
131073
Just changing your loop to the following:
for i in range(1, 100):
print(valor)
valor = valor *2
The last number created is:
6021340351084089657109340225536
Using your own code you seem to get stuck in the while but it is valor is growing in the for loop to numbers with as many digits as:
167609
180908
185464
187612
209986
236740
209986
And on....
The problem is not your multiprocessing code. It's the pow operator in the for loop:
for i in range(1, 100):
valor=valor**2
the final result would be pow(val, 2**100), and this is too big, and calculate it would cost too much time and memory. so you got out of memory error in the last.
4 GB = 4 * pow(2, 10) * pow(2, 10) * pow(2, 20) * 8 bit = 2**35 bit
and for your smallest number 8:
pow(8, 2**100) = pow(2**3, 2**100) = pow(2, 3*pow(2, 100))
pow(2, 3*pow(2, 100))bit/4GB = 3*pow(2, 100-35) = 3*pow(2, 65)
it need 3*pow(2, 65) times of 4 GB memory.