What are the different use cases of joblib versus pickle? - python

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.
it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string
I read this Q&A on Pickle,
Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?

joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".

Thanks to Gunjan for giving us this script! I modified it for Python3 results
#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib
file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891

I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file
#comapare pickle loaders
from time import time
import pickle
import os
try:
import cPickle
except:
print "Cannot import cPickle"
import joblib
t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
Output for this is
time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392
According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks

Just a humble note ...
Pickle is better for fitted scikit-learn estimators/ trained models. In ML applications trained models are saved and loaded back up for prediction mainly.

Related

Python fast way to save huge numpy array as lossless image (tiff)

I have a program that processes huge RGB images in the range of 30000x30000 px.
To load I use Pillow, which works good.
Then I process it with NumPy and then I need to save it lossless as tiff.
However, whether I'm using Pillow or OpenCV, this takes very long compared to the runtime of all the other stuff. I think this is because of the image compression. Without compression, the saving does not take long at all but my files are >2 GB.
I found the module tifffile but it takes just as long as OpenCV, unless I missed a parameter.
Is there a module that can compress faster? The ones I tried only use one CPU core.
It also seems, that it's faster on an intel machine with i7-9700k 16GB than on my PC with AMD Ryzen 5600X 32GB?
Here is the code I used to test:
from PIL import Image
import cv2
import tifffile
import numpy as np
import time
arr = np.random.default_rng().integers(0, 255, size=(30000,30000,3), endpoint=True, dtype=np.uint8)
st = time.time()
Image.fromarray(arr).save("test_pil.tiff", compression="tiff_adobe_deflate")
print(f"Pil took {time.time()-st} s")
st = time.time()
cv2.imwrite("test_cv2.tiff", arr, params=(cv2.IMWRITE_TIFF_COMPRESSION, 32946))
print(f"Opencv took {time.time()-st} s")
st = time.time()
tifffile.imwrite("test_tifff.tiff", arr, compression="zlib", compressionargs={'level':5}, predictor=True, tile=(64,64))
print(f"Tifffile took {time.time()-st} s")
I know these also use different compression algorithms, but I haven't found matching parameters. This feature is generally very poorly documented.
Result (intel):
Pil took 32.01173210144043 s
Opencv took 60.46461296081543 s
Tifffile took 59.410102128982544 s

How to pickle files > 2 GiB by splitting them into smaller fragments [duplicate]

This question already has an answer here:
Export machine learning model
(1 answer)
Closed 5 years ago.
I have a classifier object that is larger than 2 GiB and I want to pickle it, but I got this:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
OverflowError: cannot serialize a string larger than 2 GiB
I found this question that has the same problem and it was suggested there to either
use Python 3 protocol 4 - Not acceptable as I need to use Python 2
use from pyocser import ocdumps, ocloads - Not acceptable as I can't use other (non-trivial) modules
break the object into bytes and pickle each fragment
Is there a way to do so with my classifier? i.e. turn it into bytes, split, pickle, unpickle, concatenate the bytes, and use the classifier?
My code:
from sklearn.svm import SVC
import cPickle
def train_clf(X,y,clf_name):
start_time = time.time()
# after many tests, this was found to be best classifier
clf = SVC(C = 0.01, kernel='poly')
clf.fit(X,y)
print 'fit done... {} seconds'.format(time.time() - start_time)
with open(clf_name, "wb") as fo:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
# cPickle.HIGHEST_PROTOCOL == 2
# the error occurs inside the dump method
return time.time() - start_time
after this, I want to unpickle and use:
with open(clf_name, 'rb') as fo:
clf, load_time = cPickle.load(fo), time.time()
You can use sklearn.external.joblib which automatically split the model file into pickled numpy array files if model size is large
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
Update: sklearn will show
DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
So use this one instead.
import joblib
joblib.dump(clf, 'filename.pkl')
which can be unpickled later using:
clf = joblib.load('filename.pkl')

Load .npy file with np.load progress bar

I have a really large .npy file (previously saved with np.save) and I am loading it with:
np.load(open('file.npy'))
Is there any way to see the progress of the loading process? I know tqdm and some other libraries for monitoring the progress but don't how to use them for this problem.
Thank you!
As far I am aware, np.load does not provide any callbacks or hooks to monitor progress. However, there is a work around which may work: np.load can open the file as a memory-mapped file, which means the data stays on disk and is loaded into memory only on demand. We can abuse this machinery to manually copy the data from the memory mapped file into actual memory using a loop whose progress can be monitored.
Here is an example with a crude progress monitor:
import numpy as np
x = np.random.randn(8096, 4096)
np.save('file.npy', x)
blocksize = 1024 # tune this for performance/granularity
try:
mmap = np.load('file.npy', mmap_mode='r')
y = np.empty_like(mmap)
n_blocks = int(np.ceil(mmap.shape[0] / blocksize))
for b in range(n_blocks):
print('progress: {}/{}'.format(b, n_blocks)) # use any progress indicator
y[b*blocksize : (b+1) * blocksize] = mmap[b*blocksize : (b+1) * blocksize]
finally:
del mmap # make sure file is closed again
assert np.all(y == x)
Plugging any progress-bar library into the loop should be straight forward.
I was unable to test this with exceptionally large arrays due to memory constraints, so I can't really tell if this approach has any performance issues.

How to store a python ndarray on disk?

I have a pkl file containing an ndarray that I originally dump using a GPU. I unpickle it with the GPU and now I want to store it as whatever, that I can later uncompress using a CPU. I run everything on a supercomputer and later I want to just have access to the ndarrays on a normal computer without a fancy GPU. I looked into functions such as
np.save()
np.savez()
but save() I can't set allow_pickle=False and when I load the array stored with savez() it's empty.
This is how I save things:
I run THEANO_FLAGS="device=gpu,floatX=float32" srun -u python deep_q_rl/unpicklestuff.py
unpicklestuff.py:
import sys
import cPickle
import lasagne.layers
import os
import numpy as np
for i in os.listdir(path):
net_file = open(path+str(i), 'r')
network = cPickle.load(net_file)
q_layers = lasagne.layers.get_all_layers(network.l_out)
np.savez(savepath+str(i), q_layers)
And this is how I load them later:
q_layers = np.load(path)

Not sure why memory usage seen in `top` is unintuitive

I am working with a simple numpy.dtype array and I am using numpy.savez and numpy.load methods to save the array to a disk and read it from the disk. During both storing and loading of the array, the memory usage as shown by 'top' doesn't appear to be what it should be like. Below is a sample code that demonstrates this.
import sys
import numpy as np
import time
RouteEntryNP = np.dtype([('final', 'u1'), ('prefix_len', 'u1'),
('output_idx', '>u4'), ('children', 'O')])
a = np.zeros(1000000, RouteEntryNP)
time.sleep(10)
print(sys.getsizeof(a))
with open('test.np.npz', 'wb+') as f:
np.savez(f, a=a)
while True:
time.sleep(10)
Program starts with memory usage of 25M - somewhat closer to intuition - the actual size of members of RouteEntryNP is 14 bytes - so 25M is somewhat closer to intuition. But as the data is being written to the file - the memory usage shoots up to approx 250M.
A similar behavior is observed when loading the file, in this case the memory usage shoots up to approximately 160M and explicit gc.collect() doesn't seem to help as well. The way I am reading the file is as follows.
import numpy as np
np.load('test.np.npz')
import gc
gc.collect()
The memory usage stays # 160M. Not sure why this is happening. Is there a way to 'reclaim' this memory?

Categories