Use fasttext model (gensim) with threading - python

Is it possible to access a fasttext model (gensim) using multithreading?
Currently, I'm trying to load a model once (due to size and loading time), so it stays in memory and access its similarity functions multiple thousands times in a row. I want to do that in parallel and my current approach uses a wrapper class that loads the model and is then passed to the workers. But it looks like it does not return any results.
The wrapper class. Initiated once.
from gensim.models.fasttext import load_facebook_model
class FastTextLocalModel:
def __init__(self):
self.model_name = "cc.de.300.bin"
self.model_path = path.join("data", "models", self.model_name)
self.fast_text = None
def load_model(self):
self.fast_text = load_facebook_model(self.model_path)
def similarity(self, word1: str = None, word2: str = None):
return self.fast_text.wv.similarity(word1, word2)
And the Processor class makes use of the FastTextLocalModel methods above:
fast_text_instance = FastTextLocalModel()
fast_text_instance.load_model()
with concurrent.futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
docs = corpus.get_documents() # docs is iterable
processor = ProcessorClass(model=fast_text_instance)
executor.map(processor.process, docs)
Using max_workers=1 seems to work.
I have to mention that I have no expertise in python multithreading.

There may be useful ideas for you in this prior answer, which may need adaptation for FastText & latest versions of gensim:
https://stackoverflow.com/a/43067907/130288
(The keys are...
even redundantly loading in different processes may not use redundant memory, if the key memory-consuming arrays are mmapped and thus automatically shared at the OS level; and
you have to do a little extra trickery to prevent the usual recalc after-load and before-similarity-ops of normed vectors, which would destroy the sharing
..but messiness in the FastText code might make these a bit harder there.)

Related

Is Klepto (Python module) file_archive supposed to be 10x slower than Pickle?

Paging #mike-mckerns I suppose, but I'd be grateful for answers from anyone with experience with the Klepto module for Python (https://pypi.org/project/klepto/).
My situation is that I'm running a simulation which involves generating and logging several tens of thousands of objects which are each a combination of strings and numerical values. (By which I just meant to say that these objects cannot be trivially abstracted into e.g. a numpy array.)
Due to the sheer number of objects generated, I ran into memory issues. My solution so far has been to use pickle to individually dump each instance to its own pickle file.
The problem is that this leaves my simulation data spread across a good 30k individual files (of about ~5kb size each). This is cumbersome when trying to move or share the data from past simulations, the total size is manageable but the number of individual files has been a problem.
This is why I ended up with Klepto as a possible solution. The file_archive function I thought would let me use a single file as my 'external' dictionary instead of needing to give every instance its own pickle file.
I don't understand much of the module very well, so I tried to implement it as simply as possible. My code basically works as follows:
from klepto.archives import file_archive
ExternalObjectDictionary = file_archive('data/EOD.pkl', {}, serialized=True, cached=False)
ObjectCounter = 0
class SimObject:
def __init__(self):
self.name = 'name'
self.data1 = 100
self.data2 = ['pear', 'apple', 'banana']
#(Above values would be passed as arguments by the simulation)
global ExternalObjectDictionary
global ObjectCounter
ObjectCounter += 1
self.ID = ObjectCounter
ExternalObjectDictionary[self.ID] = ObjectData(self.name, self.data1, self.data2)
self.clear_data()
def load_data(self):
global ExternalObjectDictionary
ObjData = ExternalObjectDictionary[self.ID]
self.name = ObjData.name
self.data1 = ObjData.data1
self.data2 = ObjData.data2
def clear_data(self):
self.name = None
self.data1 = None
self.data2 = None
class ObjectData:
def __init__(self, name, data1, data2):
self.name = name
self.data1 = data1
self.data2 = data2
#Simulation would call data in a sequence as follows:
Obj1 = SimObject()
Obj1.load_data()
print(Obj1.name)
Obj1.clear_data()
When an Object is no longer needed, I destroy it simply with del ExternalObjectDictionary[x].
By itself, the implementation seems to work fine. EXCEPT, it ends up being something like a factor x10 or x20 slower than when I simply used pickle.dump() and pickle.load() on individual pickle files.
Am I using Klepto wrong, or is trying to dump/load from a single file simply inherently going to be this much slower than using individual files? I looked at a number of options, and Klepto seemed to offer the most straightforward read-from-file dictionary functionality for my needs, but perhaps I misunderstood how to use it?
Apologies if my code examples are simplified, I hope I've explained the issue clear enough for someone to respond and clear things up! If need be, I can continue using my current solution of 10k's of individual pickle files, but if an alternative method was possible that would be great!

Share Python dict across many processes

I am developing an heuristic algorithm to find "good" solutions for a NP (hence CPU intensive) problem.
I am implementing my solution using Python (I agree it is not the best choice when speed is a concern, but so it is) and I am splitting the workload across many subprocesses, each one in charge to explore a branch of the space of possible solutions.
To improve performances I would like to share some information gathered during the execution of each subprocess among all subprocesses.
The "obvious" way to gather such information is gathering them inside a dictionary whose keys are (frozen)sets of integers and values are lists (or sets) of integers.
Hence the shared dictionary must both be readable and writable from each subprocess, but I can safely expect that reads will be far more frequent than writes because a subprocess will write to the shared dict only when it finds something "interesting" and will read the dict far more frequently to know if a certain solution has already been evaluated by other processes (to avoid exploring the same branch twice or more).
I do not expect the dimension of such dictionary to exceed 10 MB.
At the moment I implemented the shared dict using an instance of multiprocessing.Manager() that takes care of handling concurrent accesses to the shared dictionary out of the box.
However (according to what I have found) this way of sharing data is implemented using pipes between processes which are a lot slower than plain and simple shared memory (moreover the dictionary must be pickled before being sent through the pipe and unpickled when it is received).
So far my code looks like this:
# main.py
import multiprocessing as mp
import os
def worker(a, b, c, shared_dict):
while condition:
# do things
# sometimes reads from shared_dict to check if a candidate solution has already been evaluated by other process
# if not, evaluate it and store it inside the shared_dict together with some related info
return worker_result
def main():
with mp.Manager() as manager:
# setup params a, b, c, ...
# ...
shared_dict = manager.dict()
n_processes = os.cpu_count()
with mp.Pool(processes=n_processes) as pool:
async_results = [pool.apply_async(worker, (a, b, c, shared_dict)) for _ in range(n_processes)]
results = [res.get() for res in async_results]
# gather the overall result from 'results' list
if __name__ == '__main__':
main()
To avoid the overhead due to pipes I would like to use shared memory, but it doesn't seem that the Python standard library offers a straightforward way to handle a dictionary in shared memory.
As far as I know the Python standard library offers helpers to store data in shared memory only for standard ctypes (with multiprocessing.Value and multiprocessing.Array) or gives you access to raw areas of shared memory.
I do not want to implement my own hash table in a raw area of shared memory since I am not an expert neither of hash tables nor of concurrent programming, instead I am wondering if there are other faster solutions to my needs that doesn't require to write everything from zero.
For example, I have seen that the ray library allows to read data written in shared memory way faster than using pipes, however it seems that you cannot modify a dictionary once it has been serialized and written to a shared memory area.
Any help?
Unfortunately shared memory in Ray must be immutable. Typically, it is recommended that you use actors for mutable state. (see here).
You can do a couple of tricks with actors. For example, you can store object references in your dict if the values are immutable. Then the dict itself won't be in shared memory, but all of its objects would be.
#ray.remote
class DictActor
def __init__(self):
self._dict = {}
def put(self, key, value):
self._dict[key] = ray.put(value)
def get(self, key):
return self._dict[key]
d = DictActor.remote()
ray.get(d.put.remote("a", np.zeros(100)))
ray.get(d.get.remote("a")) # This result is in shared memory.

How does one pickle arbitrary pytorch models that use lambda functions?

I currently have a neural network module:
import torch.nn as nn
class NN(nn.Module):
def __init__(self,args,lambda_f,nn1, loss, opt):
super().__init__()
self.args = args
self.lambda_f = lambda_f
self.nn1 = nn1
self.loss = loss
self.opt = opt
# more nn.Params stuff etc...
def forward(self, x):
#some code using fields
return out
I am trying to checkpoint it but because pytorch saves using state_dicts it means I can't save the lambda functions I was actually using if I checkpoint with the pytorch torch.save etc. I literally want to save everything without issue and re-load to train on GPUs later. I currently am using this:
def save_ckpt(path_to_ckpt):
from pathlib import Path
import dill as pickle
## Make dir. Throw no exceptions if it already exists
path_to_ckpt.mkdir(parents=True, exist_ok=True)
ckpt_path_plus_path = path_to_ckpt / Path('db')
## Pickle args
db['crazy_mdl'] = crazy_mdl
with open(ckpt_path_plus_path , 'ab') as db_file:
pickle.dump(db, db_file)
currently it throws no errors when I chekpoint it and it saved it.
I am worried that when I train it there might be a subtle bug even if no exceptions/errors are trained or something unexpected might happen (e.g. weird saving on disks in the clusters etc who knows).
Is this safe to do with pytorch classes/nn models? Especially if we want to resume training with GPUs?
Cross posted:
How does one pickle arbitrary pytorch models that use lambda functions?
https://discuss.pytorch.org/t/how-does-one-pickle-arbitrary-pytorch-models-that-use-lambda-functions/79026
https://www.reddit.com/r/pytorch/comments/gagpjg/how_does_one_pickle_arbitrary_pytorch_models_that/?
https://www.quora.com/unanswered/How-does-one-pickle-arbitrary-PyTorch-models-that-use-lambda-functions
I'm the dill author. I use dill (and klepto) to save classes that contain trained ANNs inside of lambda functions. I tend to use combinations of mystic and sklearn, so I can't speak directly to pytorch, but I can assume it works the same. The place where you have to be careful is if you have a lambda that contains a pointer to an object external to the lambda... so for example y = 4; f = lambda x: x+y. This might seem obvious, but dill will pickle the lambda, and depending on the rest of the code and the serialization variant, may not serialize the value of y. So, I've seen many cases where people serialize a trained estimator inside some function (or lambda, or class) and then the results aren't "correct" when they restore the function from serialization. The overarching cause is because the function wasn't encapsulated so all objects required for the function to yield the correct results are stored in the pickle. However, even in that case you can get the "correct" results back, but you'd just need to create the same environment you had when you pickled the estimator (i.e. all the same values it depends on in the surrounding namespace). The takeaway should be, try to make sure that all variables used in the function are defined within the function. Here's a portion of a class I've recently started to use myself (should be in the next release of mystic):
class Estimator(object):
"a container for a trained estimator and transform (not a pipeline)"
def __init__(self, estimator, transform):
"""a container for a trained estimator and transform
Input:
estimator: a fitted sklearn estimator
transform: a fitted sklearn transform
"""
self.estimator = estimator
self.transform = transform
self.function = lambda *x: float(self.estimator.predict(self.transform.transform(np.array(x).reshape(1,-1))).reshape(-1))
def __call__(self, *x):
"f(*x) for x of xtest and predict on fitted estimator(transform(xtest))"
import numpy as np
return self.function(*x)
Note when the function is called, everything that it uses (including np) is defined in the surrounding namespace. As long as pytorch estimators serialize as expected (without external references), then you should be fine if you follow the above guidelines.
Yes, I think it is safe to use dill to pickle lambda functions etc. I have been using torch.save with dill to save state dict and have had no problems resuming training over GPU as well as CPU unless the model class was changed. Even if the model class was changed (adding/deleting some parameters), I could load state dict, modify it, and load to the model.
Also, usually, people don't save the model objects but only state dicts i.e parameter values to resume the training along with hyperparameters/model arguments to get the same model object later.
Saving model object can be sometimes problematic as changes to model class (code) can make the saved object useless. If you don't plan on changing your model class/code at all and hence the model object won't be changed then maybe saving objects can work well but generally, it is not recommended to pickle module object.
this is not a good idea. If you do this then if your code changes to a different github repo then it will be hard restore your models that took a lot of time to train. The cycles spent recovering those or retraining is not worth it. I recommend to instead do it the pytorch way and only save the weights as they recommend in pytorch.

how to combine dask and classes?

I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks the behaviour is this toy problem (I'm a mathematician obsessed with the p-sums):
class Summer:
def __init__(self, p):
self.p = p
def sum(self):
return sum(pow(i,-self.p) for i in range(1,1000000))
total = sum([Summer(p).sum() for p in range(2,20)])
If I replace the last line by:
from dask.distributed import Client
def psum(p): return Summer(p).sum()
client = Client()
A = client.map(psum,range(2,20))
total=client.submit(sum,A).result()
My runtime is cut in 4 (the number of cores available on my machine). This ideal behaviour does NOT persist if I use my real classes which are data intensive (big pandas structures taking up memory). Is there a recommended alternative to dask.distributed? I'm seeing bad slowdowns which I attribute to data being passed around.

Best way to initialize variable in a module?

Let's say I need to write incoming data into a dataset on the cloud.
When, where and if I will need the dataset in my code, depends on the data coming in.
I only want to get a reference to the dataset once.
What is the best way to achieve this?
Initialize as global variable at start and access through global variable
if __name__="__main__":
dataset = #get dataset from internet
This seems like the simplest way, but initializes the variable even if it is never needed.
Get reference first time the dataset is needed, save in global variable, and access with get_dataset() method
dataset = None
def get_dataset():
global dataset
if dataset is none
dataset = #get dataset from internet
return dataset
Get reference first time the dataset is needed, save as function attribute, and access with get_dataset() method
def get_dataset():
if not hasattr(get_dataset, 'dataset'):
get_dataset.dataset = #get dataset from internet
return get_dataset.dataset
Any other way
The typical way to do what you want is to wrap your service calling for the data into a class:
class MyService():
dataset = None
def get_data(self):
if self.dataset = None:
self.dataset = get_my_data()
return self.dataset
Then you instantiate it once in your main and use it wherever you need it.
if __name__="__main__":
data_service = MyService()
data = data_service.get_data()
# or pass the service to whoever needs it
my_function_that_uses_data(data_service)
The dataset variable is internal but accessible through a discoverable function. You could also use a property on the instance of the class.
Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.
Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.
One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.
Note about your proposed options:
Initializing in the if __name__ == '__main__' section:
It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).
You need to fetch the data to pass it somewhere else, even if you don't need it in main.
Set a global within a function.
The use of global is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.
Attribute on a function
This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.
You can also use the lru_cache decorator from the functools module for achieving the goal of running an expensive operation only once.
As long as the parameters are the same, calling the function again and again returns the same object.
https://docs.python.org/3/library/functools.html#functools.lru_cache
#lru_cache
def fun(input1, input2):
... # expensive operation
return result
Similar to MrE's answer, it is best to encapsulate the data with a wrapper.
However, I would recommend you to use a python closure python closure instead of a class.
A class should be used to encapsulate data and relevant functions that are closely related to the data. A class should be something that you will instantiate objects of and objects will retain individuality. You can read more about this here
You can use closures in the following way
def get_dataset_wrapper():
dataset = None
def get_dataset():
nonlocal dataset
if dataset is none
dataset = #get dataset from internet
return dataset
return get_dataset
You can use this in the following way
dataset = get_dataset_wrapper()()
If the ()() syntax bothers you, you can do this:
def wrapper():
return get_dataset_wrapper()()

Categories