I have Python 3.10 installed in my D disk (D:/Python/Python310)
If I run:
import os
import sys
os.path.dirname(sys.executable) # return 'D:\\Python\\Python310'
but when I run python and use numpy to generate a big random array (x.shape = (8000, 200, 200, 3)), its "running" in the C: disk (which has only 16Go free for 727Go for D:).
So executing this line, reduce the free space of C to only 10Go: it's not viable for other opperation like running my model.
I'm facing this problem for my ML model who loads a lot of data
How can I change this: use D's disk free space instead of disk C
If it's not clear, tell me :)
Thanks
Related
I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.
The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.
My python script passes changing inputs to a program called "Dymola", which in turn performs a simulation to generate outputs. Those outputs are stored as numpy arrays "out1.npy".
for i in range(0,100):
#code to initiate simulation
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out1.npy', output_data)
Unfortunately, Dymola crashes very often, which makes it necessary to rerun the loop from the time displayed in the console when it has crashed (e.g.: 50) and increase the number of the output file by 1. Otherwise the data from the first set would be overwritten.
for i in range(50,100):
#code to initiate simulation
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out2.npy', output_data)
Is there any way to read out the 'stoptime' value (e.g. 50) out of the console after Dymola has crashed?
I'm assuming dymola is a third-party entity that you cannot change.
One possibility is to use the subprocess module to start dymola and read its output from your program, either line-by-line as it runs, or all after the created process exits. You also have access to dymola's exit status.
If it's a Windows-y thing that doesn't do stream output but manipulates a window GUI-style, and if it doesn't generate a useful exit status code, your best bet might be to look at what files it has created while or after it has gone. sorted( glob.glob("somepath/*.out")) may be useful?
I assume you're using the dymola interface to simulate your model. If so, why don't you use the return value of the dymola.simulate() function and check for errors.
E.g.:
crash_counter = 1
from dymola.dymola_interface import DymolaInterface
dymola = DymolaInterface()
for i in range(0,100):
res = dymola.simulate("myModel")
if not res:
crash_counter += 1
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out%d.npy'%crash_counter, output_data)
As it is sometimes difficult to install the DymolaInterface on your machine, here is a useful link.
Taken from there:
The Dymola Python Interface comes in the form of a few modules at \Dymola 2018\Modelica\Library\python_interface. The modules are bundled within the dymola.egg file.
To install:
The recommended way to use the package is to append the \Dymola 2018\Modelica\Library\python_interface\dymola.egg file to your PYTHONPATH environment variable. You can do so from the Windows command line via set PYTHONPATH=%PYTHONPATH%;D:\Program Files (x86)\Dymola 2018\Modelica\Library\python_interface\dymola.egg.
If this does not work, append the following code before instantiating the interface:
import os
import sys
sys.path.insert(0, os.path.join('PATHTODYMOLA',
'Modelica',
'Library',
'python_interface',
'dymola.egg'))
Running Python 3.4 on Windows 7, the close function of Gio.MemoryInputStream does not free the memory, as it should. The test code is :
from gi.repository import Gio
import os, psutil
process = psutil.Process(os.getpid())
for i in range (1,10) :
input_stream = Gio.MemoryInputStream.new_from_data(b"x" * 10**7)
x = input_stream.close_async(2)
y = int(process.memory_info().rss / 10**6) # Get the size of memory used by the program
print (x, y)
This returns :
True 25
True 35
True 45
True 55
True 65
True 75
True 85
True 95
True 105
This shows that on each loop, the memory used by the program increases of 10 MB, even if the close function returned True.
How is it possible to free the memory, once the Stream is closed ?
Another good solution would be to reuse the stream. But set_data or replace_data raises the following error :
'Data access methods are unsupported. Use normal Python attributes instead'
Fine, but which property ?
I need a stream in memory in Python 3.4. I create a Pdf File with PyPDF2, and then I want to preview it with Poppler. Due to a bug in Poppler (see Has anyone been able to use poppler new_from_data in python?) I cannot use the new_from_data function and would like to use the new_from_stream function.
This is a bug in GLib’s Python bindings which can’t be trivially fixed.
Instead, you should use g_memory_input_stream_new_from_bytes(), which handles freeing memory differently, and shouldn’t suffer from the same bug.
In more detail, the bug with new_from_data() is caused by the introspection annotations, which GLib uses to allow language bindings to automatically expose all of its API, not supporting the GDestroyNotify parameter for new_from_data() which needs to be set to a non-NULL function to free the allocated memory which is passed in to the other arguments. Running your script under gdb shows that pygobject passes NULL to the GDestroyNotify parameter. It can’t do any better, since there is currently no way of expressing that the memory management semantics of the data parameter depend on what’s passed to destroy.
Thanks for your answer, #Philip Withnall. I tested the solution you propose, and it works. To help others to understand, here is my test code :
from gi.repository import Gio, GLib
import os, psutil
process = psutil.Process(os.getpid())
for i in range (1,10) :
input_stream = Gio.MemoryInputStream.new_from_bytes(GLib.Bytes(b"x" * 10**7))
x = input_stream.close()
y = int(process.memory_info().rss / 10**6) # Get the size of memory used by the program
print (x, y)
Now y non longer grows.
I get an IOError: bad message length when passing large arguments to the map function. How can I avoid this?
The error occurs when I set N=1500 or bigger.
The code is:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
In the docs of multiprocessing there is the function recv_bytes that raises an IOError. Could it be because of this? (https://python.readthedocs.org/en/v2.7.2/library/multiprocessing.html)
EDIT
If I use images as a numpy array instead of a list, I get a different error: SystemError: NULL result without error in PyObject_Call.
A bit different code:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
images=np.array(images) #new
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
EDIT2 The actual function that I use is:
def func(args):
i=args[0]
images=args[1]
image=np.mean(images,axis=0)
np.savetxt("image%d.txt"%(i),image)
return 0
Additionally, the iter_args do not contain the same set of images:
iter_args=[]
for i in range(0,1):
rand_ind=np.random.random_integers(0,N-1,N)
iter_args.append([i,images[rand_ind]])
You're creating a pool and sending all the images at once to func(). If you can get away with working on a single image at once, try something like this, which runs to completion with N=10000 in 35s with Python 2.7.10 for me:
import numpy as np
import multiprocessing
def func(args):
i = args[0]
img = args[1]
print "{}: {} {}".format(i, img.shape, img.sum())
return 0
N=10000
images = ((i, np.random.random_integers(1,100,size=(500,500))) for i in xrange(N))
pool=multiprocessing.Pool(4)
pool.imap(func, images)
pool.close()
pool.join()
The key here is to use iterators so you don't have to hold all the data in memory at once. For instance I converted images from an array holding all the data to a generator expression to create the image only when needed. You could modify this to load your images from disk or whatever. I also used pool.imap instead of pool.map.
If you can, try to load the image data in the worker function. Right now you have to serialize all the data and ship it across to another process. If your image data is larger, this might be a bottleneck.
[update now that we know func has to handle all images at once]
You could do an iterative mean on your images. Here's a solution without using multiprocessing. To use multiprocessing, you could divide your images into chunks, and farm those chunks out to the pool.
import numpy as np
N=10000
shape = (500,500)
def func(images):
average = np.full(shape, 0)
for i, img in images:
average += img / N
return average
images = ((i, np.full(shape,i)) for i in range(N))
print func(images)
Python is likely to load your data in your RAM memory and you need this memory to be available. Have you checked your computer memory usage ?
Also as Patrick mentioned, you're loading 3GB of data, make sure you use the 64 bits version of Python as you are reaching the 32 bits memory contraint. This could cause your process to crash : 32 vs 64 bits Python
Another improvement would be to use python 3.4 instead of 2.7. Python 3 implementation seems to be optimized for very large ranges, see Python3 vs Python2 list/generator range performance
When running your program it actually gives me an clear error:
OSError: [Errno 12] Cannot allocate memory
Like mentioned by other users, the solution to your problem is simple add memory(a lot) or change the way your program is handling the images.
The reason it's using so much memory is because you allocate your memory for your images on a module level. So when multiprocess forks your process it's also copying all the images (which isn't free according to Shared-memory objects in python multiprocessing), this is not necessary because you are also giving the images as an argument to the function which the multiprocess module also copies using ipc and pickle, this would still likely result in a lack of memory. Try one of the proposed solutions given by the other users.
This is what solved the problem: declaring the images global.
import numpy as np
import multiprocessing
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
def func(args):
i=args[0]
images=images
print i
return 0
iter_args=[]
for i in range(0,1):
iter_args.append([i])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
The reason why you get IOError: bad message length when passing around large objects is due to a hard coded limit in older CPython versions (3.2 and earlier) of 0x7fffffff Bytes or around 2.1GB: https://github.com/python/cpython/blob/v2.7.5/Modules/_multiprocessing/multiprocessing.h#L182
This CPython changeset (which is in CPython 3.3 and later) removed the hard coded limit: https://github.com/python/cpython/commit/87cf220972c9cb400ddcd577962883dcc5dca51a#diff-4711c9abeca41b149f648d4b3c15b6a7d2baa06aa066f46359e4498eb8e39f60L182
I am developing an application for color blind people to enable them smoothly surf the Internet. I have a set of colors, lets say A , which consists of all the colors seen by a color blind person. Set A is calculated using a big calculation involving millions of colors. Set A is independent of inputs taken in my application i.e set A is like a 'constant' to me (just like 'pi' in mathematics). Now I want to store set A so that whenever I run my application, it is available without any added computational cost i.e i don't have to calculate A every time I run my application.
My Try:
I think this can be done by building a class having one constant but can it be done without creating any special class for just a constant?
I am using Python!
No need for a class. You want to store the calculated values on disk and load them back again on startup: for that you will want to look into the shelve or pickle libraries.
Yes, you can certainly do this with Python
If your constant was just a number -- say, you had just discovered tau -- then you would just declare it in a module, and import that module in all of your other source files:
constants.py:
# Define my new super-useful number
TAU = 6.28318530718
everywhere else:
from constants import TAU # Look, no calculations!
Expanding a bit, if you had a more complicated structure, like a dictionary, that took you a long time to compute, then you could just declare that in your module instead:
constants.py:
# Verified results of the national survey
PEPSI_CHALLENGE = {
'Pepsi': 0.57,
'Coke': 0.43,
}
And you can do this for more and more complicated data. The problem, eventually, is that just writing your constants module gets harder and harder, the more complex your data is, and it can be especially hard to update if you occasionally recompute the value you want to cache. In that case, you want to look at pickling the data, possibly as the final step of a python script which calculates it, and then load that data in a module that you import.
To do that, import pickle, and dump a single object out to a disk file:
recalculate.py:
# Here is the script that computes a small value from the hugely complicated domain:
import random
from itertools import groupby
import pickle
# Collect all of the random numbers
random_numbers = [random.randint(0,10) for r in xrange(1000000)]
# TODO: Check this -- this should definitely be 7
most_popular = max(groupby(sorted(random_numbers)),
key=lambda(x, v):(len(list(v)),-L.index(x)))[0]
# Now save the most common random number to disk, using pickle
# Almost any object is picklable like this, but check the docs for the exact details
pickle.dump(most_popular, open('data_cache','w'))
Now, in your constants file, you can simply read the pickled data from the file on disk, and have it available without recalculating it:
constants.py:
import pickle
most_popular = pickle.load(open('data_cache'))
everywhere else:
from constants import most_popular