Initializing state on dask-distributed workers

Initializing state on dask-distributed workers - python

I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?

You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.

Related

How can I synchronize some Flask-RESTful Resources?

I have an Flask-RESTful API that acts as a gateway to TCP devices that cannot handle asynchronous calls to them.
Since for me Resource objects just spawn, I cannot queue and manage them from a single point of source.
I tried to create a decorator that Resources that need sync will use. In this decorator, I tried to append the id of TCP device (load_id) to a list in the global scope, and remove it after the request is handled.
The problem is, when an async request is made, the first Resource gets an empty list, appends to it, and while it is still being executed the second Resource is created for a second request. This second Resource instance also gets an empty list. Therefore I cannot actually make Resource instances share a list.
I tried this without a decorator, within get, put methods explicitly, with locks defined on database model objects or with a common handler object that manages locks on objects that are uniquely identified with load_id, but no avail, I always get a list that is outdated.
Here is the stripped down version of one:
loads_with_query_in_progress = [] # Global scope
def disallow_async_calls(func):
#wraps(func)
def decorator(*args, **kwargs):
global loads_with_query_in_progress
load_id = kwargs.get("load_id", None)
load = Load.query.get(load_id)
if load in loads_with_query_in_progress: # Load is in the list. Aborting.
raise Exception
else:
loads_with_query_in_progress.append(load) # APPEND
try:
decorated_function_output = func(*args, **kwargs)
except Exception as e:
loads_with_query_in_progress.remove(load) # Expt handling cleanup
raise e
loads_with_query_in_progress.remove(load) # Remove lock
return decorated_function_output
return decorator
class LoadStateAPI(Resource):
decorators = [auth.login_required,
disallow_async_calls]
...
def get(self, load_id):
load = Load.query.get(load_id)
try:
rqObj = RelayQueryObject(load)
rqObj.execute()
except:
raise
if(rqObj.fsmState == CommState.COMPLETED):
return {'state' : rqObj.response}, 200
Here on the code, in the first request the line commented with #APPEND changes the loads_with_query_in_progress in its scope. But when another request is spawned the variable loads_with_query_in_progress is fetched unedited.
Is there any way to resolve this async-sync conversion?

The discrepancy was due to the fact that production uses uwsgi, and uwsgi uses several processes, which results in phantom differences between shared objects, cause we were using different objects in different processes, but we were debugging with logging processes that all log to the same file.

Where to instantiate boto s3 client so it is reused during a request?

I'm wondering where the best place to instantiate a boto3 s3 client is so that it can be reused during the duration of a request in django.
I have a django model with a computed property that returns a signed s3 url:
#property
def url(self):
client = boto3.client('s3')
params = {
'Bucket': settings.BUCKET,
'Key': self.frame.s3_key,
'VersionId': self.key
}
return client.generate_presigned_url('get_object', Params=params)
The object is serialized as json and returned in a list that can contain 100's of these objects.
Even though boto3.client('s3') does not perform any network requests when instantiated, I've found that it is slow.
Placing S3_CLIENT = boto3.client('s3') into settings.py and then using that instead of instantiating a new client per object reduced the response time by ~3X with 100 results. However, I know it is bad practice to place global variables in settings.py
My question is where to instantiate this client so that is can be reused at least at the request level?

If you use a lambda client, go with global. The client lets you reuse execution environments which has cost and performance savings
Take advantage of execution environment reuse to improve the performance of your function. Initialize SDK clients and database connections outside of the function handler
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
Otherwise I think this is a stylistic choice dependent on your app.
If your client will never change, global seems like a safe way to do it. The drawback is since it's a constant, you shouldn't change it during runtime. This has consequences, e.g. this makes changing Session hard. You could use a singleton but the code would become more verbose.
If you instantiate clients everywhere, you run the risk of making a client() call signature change a large effort, eg if you need to pass client('s3', verify=True), you'd have to add verify=True everywhere which is a pain. It's unlikely you'd do this though. The only param you're likely to override is config which you can pass through the session using set_default_config.
You could make it its own module, eg
foo.bar.aws.clients
session = None
ecs_client = None
eks_client = None
def init_session(new_session):
session = new_session
ecs_client = session.client('ecs')
eks_client = session.client('eks')
You can call init_session from an appropriate place or have defaults and an import hook to auto instatiate. This file will get larger as you use more clients but at least the ugliness is contained. You could also do a hack like
def init_session(s):
session = s
clients = ['ecs', 'iam', 'eks', …]
for c in clients:
globals()[f'{c}_client'] = session.client(c)
The problem is the indirection that this hack adds, eg intelliJ is not smart enough to figure out where your clients came from and will say you are using an undefined variable.

My best approach is to use functools.partial and have all the constant variables such as bucket and other metadata frozen in a partial and then pust pass in variable data. However, boto3 is still slow as hell to create the signed urls, compared to a simple string format it is ~x100 slower.

Initialize module python

I have a module which wrap an json api for querying song cover/remix data with limits for number of requests per hour/minute. I'd like to keep an optional cache of json responses without forcing users to adjust a cache/context parameter every time. What is a good way of initializing a library/module in python? Or would you recommend I just do the explicit thing and use a cache named parameter in every call that eventually request json data?
I was thinking of doing
_cache = None
class LFU(object):
...
NO_CACHE, LFU = "NO_CACHE", "LFU"
def set_cache_strategy(strategy):
if _cache == NO_CACHE:
_cache = None
else:
_cache = LFU()
import second_hand_songs_wrapper as s
s.set_cache_strategy(s.LFU)
l1 = s.ShsLabel.get_from_resource_id(123)
l2 = s.ShsLabel.get_from_resource_id(123,use_cache=Fale)
edit:
I'm probably only planning on having two strategies one with/ one without a cache.
Other possible alternative initialization schemes I can think of of the top of my head include using enviromental variables, initializing _cache by hand in the user code to None/LFU(), and using explicit cache everywhere(possibly defaulting to having a cache).
Note the reason I don't set cache on an instance of the class is that I currently use a never instantiated class(use class functions + class state as a singleton) to abstract downloading the json data along with some convenience/methods to download certain urls. I could instantiate the downloader class but then I'd have to pass the instance explicitly to each function, or use another global variable for a convenience version of the class. The downloader class also keeps track of # of requests(website has limit per minute/hour) so having multiple downloader objects would cause more trouble.

There's nothing wrong in setting a default, even if that default is None. I would note though that having the pseudo-constants as well as a conditional (provided that's all you use the values for) is redundant. Try:
caching_strategies = {'NO_CACHE' : lambda: None,
'LFU' : LFU}
_cache = caching_strategies['NO_CACHE']
def set_cache_strategy(strategy):
_cache = caching_methods[strategy]()
If you want to provide a convenience method for the available strategies, just wrap caching_strategies.keys(). Really though, as far as your strategies go, you should probably have all your strategies inherit from some base strategy, and just create a no_cache strategy class that inherits from that and stubs all the methods for your standardized caching inteface.

Access global queue object in python

I am building a messaging app in python that interfaces with twitter I want to create a global fifo queue that the twitterConnection object can access to insert new messages.
this is the main section of the app:
#!/usr/bin/python
from twitterConnector import TwitterConnector
from Queue import deque
#instantiate the request queue
requestQueue = deque()
#instantiate the twitter connector
twitterInterface = TwitterConnector()
#Get all new mentions
twitterInterface.GetMentions()
Specifically I want to be able to call the .GetMentions() method of the TwitterConnector class and process any new mentions and put those messages into the queue for processing separately.
This is the twitterConnector class so far:
class TwitterConnector:
def __init__(self):
self.connection = twitter.Api(consumer_key=self.consumer_key,consumer_secret=self.consumer_secret,access_token_key=self.access_token_key,access_token_secret=self.access_token_secret)
self.mentions_since_id = None
def VerifyCredentials(self):
print self.connection.VerifyCredentials()
def GetMentions(self):
mentions = self.connection.GetMentions(since_id=self.mentions_since_id)
for mention in mentions:
print mention.text
global requestQueue.add(mention) # <- error
Any assistance would be greatly appreciated.
Update: let me see if I can clarify the use case a bit. My TwitterConnector is intended to retrieve messages and will eventually transform the twitter status object into a simplified object that contains the needed information for downstream processing. In the same method, it will perform some other calculations to try to determine other needed information. It is actually the transformed objects that I want to put into the queue. hopefully that makes sense. As a simplified example, this method would map the twitter status object to a new object that would contain properties such as: origin network (twitter), username, message, and location if available. This new object would then be put into the queue.
eventually there will be other connectors for other messaging systems that will perform similar actions. They will receive a message populate the new message object and place them into the same queue for further processing. Once the message has been completely processed a response would be formulated and then added to an appropriate queue for transmittal. Using the new object described above the once the actual processing of the tweet content took place then based upon the origin network and username a response might be returned to the originator via twitter.
I had thought of passing a reference to the queue in the contractor (or as an argument to the method) I had also thought of modifying the method to return a list of the new objects and iterate over them to put them into the queue, but I wanted to make sure there wasn't a better or more efficient way to handle this. I also wish to be able to do the same thing with a logging object and a database handler. Thoughts?

The problem is that "global" should appear on a line of its own
global requestQueue
requestQueue.add(mention)
Moreover, the requestQueue must appear in the module that is defining the class.
If you are importing the requestQueue symbol from another class, you don't need the global at all.
from some_module import requestQueue
class A(object):
def foo(o):
requestQueue.add(o) # Should work

It is generally good idea to avoid globals; a better design often exists. For details on the issue of globals see for instance ([1], [2], [3], [4]).
If by using globals you wish to share a single requestQueue between multiple TwitterConnector instances, you could also pass the queue to the connector in its constructor:
class TwitterConnector:
def __init__(self, requestQueue):
self.requestQueue = requestQueue
...
def GetMentions(self):
mentions = self.connection.GetMentions(since_id = self.mentions_since_id)
requestQueue = deque()
for mention in mentions:
print mention.text
self.requestQueue.add(mention)
Correspondingly, you need to provide your requestQueue to the constructor as:
twitterInterface = TwitterConnector(requestQueue)

Initializing cherrypy.session early

I love CherryPy's API for sessions, except for one detail. Instead of saying cherrypy.session["spam"] I'd like to be able to just say session["spam"].
Unfortunately, I can't simply have a global from cherrypy import session in one of my modules, because the cherrypy.session object isn't created until the first time a page request is made. Is there some way to get CherryPy to initialize its session object immediately instead of on the first page request?
I have two ugly alternatives if the answer is no:
First, I can do something like this
def import_session():
global session
while not hasattr(cherrypy, "session"):
sleep(0.1)
session = cherrypy.session
Thread(target=import_session).start()
This feels like a big kludge, but I really hate writing cherrypy.session["spam"] every time, so to me it's worth it.
My second solution is to do something like
class SessionKludge:
def __getitem__(self, name):
return cherrypy.session[name]
def __setitem__(self, name, val):
cherrypy.session[name] = val
session = SessionKludge()
but this feels like an even bigger kludge and I'd need to do more work to implement the other dictionary functions such as .get
So I'd definitely prefer a simple way to initialize the object myself. Does anyone know how to do this?

For CherryPy 3.1, you would need to find the right subclass of Session, run its 'setup' classmethod, and then set cherrypy.session to a ThreadLocalProxy. That all happens in cherrypy.lib.sessions.init, in the following chunks:
# Find the storage class and call setup (first time only).
storage_class = storage_type.title() + 'Session'
storage_class = globals()[storage_class]
if not hasattr(cherrypy, "session"):
if hasattr(storage_class, "setup"):
storage_class.setup(**kwargs)
# Create cherrypy.session which will proxy to cherrypy.serving.session
if not hasattr(cherrypy, "session"):
cherrypy.session = cherrypy._ThreadLocalProxy('session')
Reducing (replace FileSession with the subclass you want):
FileSession.setup(**kwargs)
cherrypy.session = cherrypy._ThreadLocalProxy('session')
The "kwargs" consist of "timeout", "clean_freq", and any subclass-specific entries from tools.sessions.* config.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Initializing state on dask-distributed workers - python

Related

How can I synchronize some Flask-RESTful Resources?

Where to instantiate boto s3 client so it is reused during a request?

Initialize module python

Access global queue object in python

Initializing cherrypy.session early

Categories

Resources