Serialize iterator object to be passed between processes in Python

Serialize iterator object to be passed between processes in Python - python

I have a python script that calculates the eigenvalues of matrices from a list, and I would like to insert these eigenvalues into another collection in the same order as the original matrix and I would like to do this by spawning up multiple processes.
Here is my code:
import time
import collections
import numpy as NP
from scipy import linalg as LA
from joblib import Parallel, delayed
def computeEigenV(unit_of_work):
current_index = unit_of_work[0]
current_matrix = unit_of_work[1]
e_vals, e_vecs = LA.eig(current_matrix)
finished_unit = (current_index, lowEV[::-1])
return finished_unit
def run(work_list):
pool = Parallel( n_jobs = -1, verbose = 1, pre_dispatch = 'all')
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
return results
if __name__ == '__main__':
# create original array of matrices
original_matrix_list = []
work_list = []
#basic set up so we can run this test
for i in range(0, 100):
# generate the matrix & unit or work
matrix = NP.random.random_integers(0, 100, (500, 500))
#insert into respective resources
original_matrix_list.append(matrix)
for i, matrix in enumerate(original_matrix_list):
unit_of_work = [i, matrix]
work_list.append(unit_of_work)
work_result = run(work_list)
so work_result should hold all the eigenvalues from each matrix after all processes finish. And the iterator I am using is unit_of_work which is a list containing the index of the matrix (from the original_matrix_list) and the matrix itself.
The weird thing is, if I were to run this code by doing python matrix.py everything works perfectly. But when I use auto (a program that does calculations for differential equations?) to run my script, typing auto matrix.py gives me the following error:
Traceback (most recent call last):
File "matrix.py", line 50, in <module>
work_result = run(work_list)
File "matrix.py", line 27, in run
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 805, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 658, in dispatch_one_batch
tasks = BatchedCalls(itertools.islice(iterator, batch_size))
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 69, in __init__
self.items = list(iterator_slice)
File "matrix.py", line 27, in <genexpr>
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 162, in delayed
pickle.dumps(function)
TypeError: expected string or Unicode object, NoneType found
Note: when I ran this with auto I had to change if __name__ == '__main__': to if __name__ == '__builtin__':
I looked up this error and it seems like I am not serializing the iterator unit_of_work correctly when passing it around to different processes. I have then tried to use serialized_unit_of_work = pickle.dumps(unit_of_work), pass that around, and do pickle.loads when I need to use the iterator, but I still get the same error.
Can someone please help point me in the right direction as to how I can fix this? I hesitate to use pickle.dump(obj, file[, protocol]) because eventually I will be running this to calculate eigenvalues of thousands of matrices and I don't really want to create that many files to store the serialized iterator if possible.
Thanks!! :)

You can't pickle an iterator in python2.7 (but you can from 3.4 onward).
Also, pickling works differently in __main__ is different than when not in __main__, and it would seem that auto is doing something odd with __main__. What you often will observe when pickling fails on a particular object is that if instead of running the script with the object in it directly, you run a script as main which imports the portion of the script with the "difficult-to-serialize" object, then pickling will succeed. This is because the object will pickle by reference at a namespace level above where the "difficult" object lives… thus it's never directly pickled.
So, you can probably get away with pickling what you want, by adding a reference layer… a file import or a class. But, if you want to pickle an iterator, you are out of luck unless you move to at least python3.4.

Related

ERROR : When trying deployment of model in ML I get-'dict' object is not callable

I have used the below code for my python streamlit deployment of ML Model.
import streamlit as st
import pickle
import numpy as np
import pandas as pd
similarity=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\similarity.pkl','rb'),buffers=None)
list=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
movies=pd.DataFrame.from_dict(list)
def recomm(movie):
mov_index=movies[movies['title']==movie].index[0]
sim=similarity[mov_index]
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
rec_movie=[]
for i in movlist:
# print(i[0])
rec_movie.append(movies.iloc[i[0]]['title'])
return rec_movie
st.title('Movie Recommender System')
selected_movie_name = st.selectbox(
'How would you like to be contacted?',
movies['title'].values)
if st.button('Recommend'):
recom=recomm(selected_movie_name)
# recom=np.array(recom)
for i in recom:
st.write(i)
On colab the code is working fine but on vscode it was showing this error.
File "C:\Users\anaconda3\envs\Streamlit\lib\site-packages\streamlit\scriptrunner\script_runner.py", line 554, in _run_script
exec(code, module.__dict__)
File "C:\Users\OneDrive\Desktop\mlproject\app.py", line 30, in <module>
recom=recomm(selected_movie_name)
File "C:\Users\OneDrive\Desktop\mlproject\app.py", line 15, in recomm
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
Now I had to use different IDEs for deployement. But when I removed the keyword 'list' in the given line 15 it worked fine. What can be the reason behind it? I am ca begineer and really curious about it. Thank you.

But when I removed the keyword 'list' in the given line 15 it worked fine. What can be the reason behind it?
TL;DR: sorted accepts iterables, and enumerate is already an iterable
Long answer:
When you define list as
list=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
you're overriding Python's built-in list type. Python lets you do this without issuing any warnings, but the result is that, in your script, list now represents a dictionary object. The result of this is that when you call list(enumerate(sim)) later on, you're treating your dictionary object as a callable, which it is not.
The solution? Avoid overriding Python built-ins whenever you can.
import streamlit as st
import pickle
import numpy as np
import pandas as pd
similarity=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\similarity.pkl','rb'),buffers=None)
movies_dict=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
movies=pd.DataFrame.from_dict(movies_dict)
def recomm(movie):
mov_index=movies[movies['title']==movie].index[0]
sim=similarity[mov_index]
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
rec_movie=[]
for i in movlist:
# print(i[0])
rec_movie.append(movies.iloc[i[0]]['title'])
return rec_movie
st.title('Movie Recommender System')
selected_movie_name = st.selectbox(
'How would you like to be contacted?',
movies['title'].values)
if st.button('Recommend'):
recom=recomm(selected_movie_name)
# recom=np.array(recom)
for i in recom:
st.write(i)
To answer specifically why removing "list" on line 15 seemed to fix the issue, though: sorted accepts iterables, and enumerate is already an iterable. All list is doing on line 15 is gathering the results of enumerate before passing them into sorted. But the fundamental reason why removing list fixed things is because you're overriding Python's built-in, which you probably want to avoid doing.

Factorializing Medium Numpy Ints Creates Runtime Warning

I am checking run times on factorials (have to use the user-defined function), but I receive an odd error. Code I'm working with is as follows:
import numpy as np
import time
np.random.seed(14)
nums = list(np.random.randint(low=100, high=500, size=10))
# nums returns as [207, 444, 368, 427, 349, 458, 334, 256, 238, 308]
def fact(x):
if x == 1:
return 1
else:
return x * fact(x-1)
recursion_times = []
recursion_factorials = []
for i in nums:
t1 = time.perf_counter()
factorial = fact(i)
t2 = time.perf_counter()
execution = t2-t1
recursion_factorials.append(factorial)
recursion_times.append(execution)
print(execution)
When I run the above, I get the following:
RuntimeWarning: overflow encountered in long_scalars"""
But when I run it as below, I get no warnings.
recursion_times = []
recursion_factorials = []
for i in [207, 444, 368, 427, 349, 458, 334, 256, 238, 308]:
t1 = time.perf_counter()
factorial = fact(i)
t2 = time.perf_counter()
execution = t2-t1
recursion_factorials.append(factorial)
recursion_times.append(execution)
print(execution)
I know it's a bit of extra overhead to call the list nums, but why would it trigger a runtime warning? I've tried digging around but I only get dynamically-named variable threads and warning suppression libraries - I'm looking for why this might happen.
For what it's worth, I'm running Python3 in a jupyter notebook. Glad to answer any other questions if it will help.
Thanks in advance for the help!

If (as in the current version of your post) you created nums by calling list on a NumPy array, but wrote an explicit list literal with no NumPy for the second test, then the second test gives no warning because it's not using NumPy. nums is a list of NumPy fixed-width integers, while the other list is a list of ordinary Python ints. Ordinary Python ints don't overflow.
(If you want to create a list of ordinary Python scalars from a NumPy array, the way to do that is with array.tolist(). This is usually undesirable due to performance implications, but it is occasionally necessary to interoperate with code that chokes on NumPy types.)
There would usually be an additional effect due to the default Python warning handling. By default, Python only emits a warning once per code location per Python process. In the original version of your question, it looked like this was causing the difference.
Using a variable or not using a variable has no effect on this warning.

matplotlib error when running plotting in multiprocess

I am using python's Multiprocess.Pool to plot some data using multiple processes as follows:
class plotDriver:
def plot(self, parameterList):
numberOfWorkers = len(parameterList)
pool = Pool(numberOfWorkers)
pool.map(plotWorkerFunction, parameterList)
pool.close()
pool.join()
this is a simplified version of my class, the driver also contains other stuffs I choose to omit. The plotWorkderFunction is a single threaded function, which imports matplotlib and does all the plotting and setting figure styles and save the plots to one pdf file, and each worker is not interacting with the other.
I need to call this plot function multiple times since I have many parameterList, like following:
parameters = [parameterList0, parameterList1, ... parameterListn]
for param in parameters:
driver = PlotDriver()
driver.plot(param)
If parameters only contains one parameterList (the for loop only runs once), the code seems working fine. But it consistently fails whenever parameters contains more than one element, with the following error message happening on the second time in the loop.
Traceback (most recent call last):
File "plot.py", line 59, in <module>
plottingDriver.plot(outputFile_handle)
File "/home/yingryic/PlotDriver.py", line 69, in plot
pool.map(plotWrapper, workerParamList)
File "/home/yingryic/.conda/envs/pp/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func.iterable, chunksize).get()
File "/home/yingryic/.conda/envs/pp/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
RuntimeError: In set_text: could not load glyph
X Error: BadIDChoice (invalid resouce ID chosen for this connection) 14
Extension: 138 (RENDER)
Minor opcode: 17 (RenderCreateGlyphSet)
Resouce id: 0xe00002
: Fatal IO error: client killed
any idea what is going wrong and how should I fix?

You can try placing import matplotlib into plotWorkerFunction() so that child processes will have their own copy of the module.

Pythons multiprocess module (with dill) gives an unhelpful AssertionError

I have installed dill/pathos and its dependencies (with some difficulty) and I'm trying to perform a function over several processes. The class/attribute Model(self.xml,self.exp_data,i).SSR is custom made and depends on loads of other custom functions so I apologize in advance for not being able to provide 'runnable' code. In brief however, it takes some experimental data, integrates a system of ODE's with python's pysces module and calculates the sum of squares (SSR). The purpose for parallelizing this code is to speed up this calculation with multiple parameter sets.
The code:
import multiprocess
def evaluate_chisq(pop):
p = multiprocess.Pool(8)
res= p.map(lambda i:Model(self.xml,self.exp_data,i).SSR , pop)#calcualteSSR with this parameter set
return res
The error message I get is:
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 567, in get
raise self._value
AssertionError
Then I have tried using map_async :
def evaluate_chisq(pop):
p = multiprocess.Pool(8)
res= p.map_async(lambda i:Model(self.xml,self.exp_data,i).SSR , pop)#calcualteSSR with this parameter set
return res
which returns a <multiprocess.pool.MapResult object at 0x0000000014AF8C18> object which gives me the same error when I attempts to use the MapResult's `get' method
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 567, in get
raise self._value
AssertionError
Does anybody know what I'm doing wrong?

On Windows you need to use freeze_support from __main__.
See https://docs.python.org/2/library/multiprocessing.html#multiprocessing.freeze_support.

Create SAFEARRAY of Strings in Python

I'm trying to call a COM method that requires a SafeArray of Strings to be passed as reference, which is then filled up with the method results. This is the code in VBA, which works flawlessly:
dimr RC as New RAS41.HECRASController
RC.Project_Open "c:\myProj.prj"
dim numMessages as Long
dim messages() as String
RC.Compute_CurrentPlan( numMessages, messages())
Now, I'm trying to do the same from with Python 3.4, using the win32com module. However, I'm stuck at trying to create the second parameter with the correct type, which according to combrowse.py should be "Pointer SafeArray String".
This was my first attempt:
import win32com
RC = win32com.client.Dispatch("RAS41.HECRASController")
RC.Project_Open("c:\\myProj.prj")
numMessages = 0
messages = []
RC.Compute_CurrentPlan(numMessages, messages)
I also tried constructing that variable as
messages = win32com.client.VARIANT(pythoncom.VT_ARRAY | pythoncom.VT_BSTR, [])
but it didn't work either. Error messages look like this:
Traceback (most recent call last):
File "<pyshell#101>", line 1, in <module>
print(o.Compute_CurrentPlan(1,b))
File "<COMObject RAS41.HECRASController>", line 3, in Compute_CurrentPlan
File "C:\Python34\lib\site-packages\win32com\client\dynamic.py", line 282, in _ApplyTypes_
result = self._oleobj_.InvokeTypes(*(dispid, LCID, wFlags, retType, argTypes) + args)
TypeError: Objects for SAFEARRAYS must be sequences (of sequences), or a buffer object.

Make sure that you python variables are in the right format (Long and String). Try to use something like the following to get the variable types in shape:
messages = ['']
RC.Compute_CurrentPlan(long(numMessages), messages)
To be more flexible with your program you should check the variable types prior to the win32 call.

I realize this is an old question, but I ran into this issue and wanted to share the resolution. I was having issues defining the type of data for the first two arguments, but simply setting them to None works and your number of messages and compute messages are reported (I checked by assigning text = hec.Compute_CurrentPlan(None, None, True) and then print(test)). The third argument is Blocking Mode, set to True, meaning that the RAS computation will complete before moving to the next line of code. I am using Python 3.10.4 and HEC-RAS version 6.3.
import win32com.client
hec = win32com.client.Dispatch('RAS630.HECRASController')
hec.Project_Open(r"C:\myproj.prj")
hec.ShowRAS()
hec.Compute_CurrentPlan(None, None, True)
hec.QuitRAS()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Serialize iterator object to be passed between processes in Python - python

Related

ERROR : When trying deployment of model in ML I get-'dict' object is not callable

Factorializing Medium Numpy Ints Creates Runtime Warning

matplotlib error when running plotting in multiprocess

Pythons multiprocess module (with dill) gives an unhelpful AssertionError

Create SAFEARRAY of Strings in Python

Categories

Resources