On my django app I have a report (a csv download) that can take some time to run. When a user runs the report they are redirected to a 'processing' page where a javascript function checks the server every second to see if the csv has been created (the file name is included in the HttpResponse object).
What I'm looking for is a way of identifying the thread that's creating the csv. That way I can add an estimated_time_to_completion attribute to the thread, and include this info in the holding page. In fact I could stop checking for the existance of the (unlocked) csv - I could just ask the thread if it's finished.
My csv building thread looks something like -
class CsvBuilder(threading.Thread):
def __init__(self, file_name, parameters)
self.file_name = file_name
self.parameters = parameters
threading.Thread.__init__(self)
def run():
# ...
file = open(self.file_name, 'wb')
writer = csv.writer(file)
for patient in patients:
writer.writerow('some data')
self.time_remaining = # a timedelta object
file.close()
And then my django requests will look something like -
def create_csv(request):
'''
Standard django view to create a csv
'''
# get filename and parameters from request
thread = CsvBuilder (file_name, parameters)
return render_to_response('processing.html', {"thread_id": thread.thread_id})
def check_progress(request):
'''
An ajax call to check the progress on a report
'''
thread_id = requst.GET['thread_id']
# find the thread
return HttpResponse(thread.time_remaining)
Is this possible? Or should I be going about this a different way?
It's probably easiest and safest to use dedicated background task library, they are designed for usecase like this. Most common for python is Celery. It has good Django support and it's very easy to use.
I'd suggest you have your writer function update a memcached key/value for time_remaining calculations.
If it were me, I'd have probably used Celery for the long running job, starting a thread from django seems like it could have pitfalls, but nothing specific is springing to mind.
Related
I am working on a Python flask app, and the main method start() calls an external API (third_party_api_wrapper()). That external API has an associated webhook (webhook()) that receives the output of that external API call (note that the output that webhook() receives is actually different from the response returned in the third_party_wrapper())
The main method start() needs the result of webhook(). How do I make start() wait for webhook() to be executed? And how do wo pass the returned value of webhook() back to start()?
Here's is a minimal code snippet to capture the scenario.
#app.route('/webhook', methods=['POST'])
def webhook():
return "webhook method has executed"
# this method has a webhook that calls webhook() after this method has executed
def third_party_api_wrapper():
url = 'https://api.thirdparty.com'
response = requests.post(url)
return response
# this is the main entry point
#app.route('/start', methods=['POST'])
def start():
third_party_api_wrapper()
# The rest of this code depends on the output of webhook().
# How do we wait until webhook() is called, and how do we access the returned value?
The answer to this question really depends on how you plan on running your app in production. It's much simpler if we make the assumption that you only plan to have a single instance of your app running at once (as opposed to multiple behind a load balancer, for example), so I'll make that assumption first to give you a place to start, and comment on a more "production-ready" solution afterwards.
A big thing to keep in mind when writing a web application is that you have to understand how you want the outside world to interact with your app. Do you expect to have the /start endpoint called only once at the beginning of your app's lifetime, or is this a generic endpoint that may start any number of background processes that you want the caller of each to wait for? Or, do you want the behavior where any caller after the first one will wait for the same process to complete as the first one? I can't answer these questions for you, it depends on the use-case you're trying to implement. I'll give you a relatively simple solution that you should be able to modify to fulfill any of the ones I mentioned though.
This solution will use the Event class from the threading standard library module; I added some comments to clarify which parts you may have to change depending on the specifics of the API you're calling and stuff like that.
import threading
import uuid
from typing import Any
import requests
from flask import Flask, Response, request
# The base URL for your app, if you're running it locally this should be fine
# however external providers can't communicate with your `localhost` so you'll
# need to change this for your app to work end-to-end.
BASE_URL = "http://localhost:5000"
app = Flask(__name__)
class ThirdPartyProcessManager:
def __init__(self) -> None:
self.events = {}
self.values = {}
def wait_for_request(self, request_id: str) -> None:
event = threading.Event()
actual_event = self.events.setdefault(request_id, event)
if actual_event is not event:
raise ValueError(f"Request {request_id} already exists.")
event.wait()
return self.values.pop(request_id)
def finish_request(self, request_id: str, value: Any) -> None:
event = self.events.pop(request_id, None)
if event is None:
raise ValueError(f"Request {request_id} does not exist.")
self.values[request_id] = value
event.set()
MANAGER = ThirdPartyProcessManager()
# This is assuming that you can specify the callback URL per-request, otherwise
# you may have to get the request ID from the body of the request or something
#app.route('/webhook/<request_id>', methods=['POST'])
def webhook(request_id: str) -> Response:
MANAGER.finish_request(request_id, request.json)
return "webhook method has executed"
# Somehow in here you need to create or generate a unique identifier for this
# request--this may come from the third-party provider, or you can generate one
# yourself. There are three main paths I see here:
# - If you can specify the callback/webhook URL in each call, you can just pass them
# <base>/webhook/<request_id> and use that to identify which request is being
# responded to in the webhook.
# - If the provider gives you a request ID, you can return it from this function
# then retrieve it from the request body in the webhook route
# For now, I'll assume the first situation but you should be able to implement the second
# with minimal changes
def third_party_api_wrapper() -> str:
request_id = uuid.uuid4().hex
url = 'https://api.thirdparty.com'
# Just an example, I don't know how the third party API you're working with works
response = requests.post(
url,
json={"callback_url": f"{BASE_URL}/webhook/{request_id}"}
)
# NOTE: unrelated to the problem at hand, you should always check for errors
# in HTTP responses. This method is an easy way provided by requests to raise
# for non-success status codes.
response.raise_for_status()
return request_id
#app.route('/start', methods=['POST'])
def start() -> Response:
request_id = third_party_api_wrapper()
result = MANAGER.wait_for_request(request_id)
return result
If you want to run the example fully locally to test it, do the following:
Comment out lines 62-71, which actually make the external API call
Add a print statement after line 77, so that you can get the ID of the "in flight" request. E.g. print("Request ID", request_id)
In one terminal, run the app by pasting the above code into an app.py file and running flask run in that directory.
In another terminal, start the process via:
curl -XPOST http://localhost:5000/start
Copy the request ID that will be logged in the first terminal that's running the server.
In a third terminal, complete the process by calling the webhook:
curl -XPOST http://localhost:5000/webhook/<your_request_id> -H Content-Type:application/json -d '{"foo":"bar"}'
You should see {"foo":"bar"} as the response in the second terminal that made the /start request.
I hope that's enough to help you get started w/ whatever problem you're trying to solve.
There are a couple of design-y comments I have based on the information provided as well:
As I mentioned before, this will not work if you have more than one instance of the app running at once. This works by storing the state of in-flight requests in a global state inside your python process, so if you have more than one process, they won't all be working and modifying the same state. If you need to run more than one instance of your process, I would use a similar approach with some database backend to store the shared state (assuming your requests are pretty short-lived, Redis might be a good choice here, but once again it'll depend on exactly what you're trying to do).
Even if you do only have one instance of the app running, flask is capable of being run in a variety of different server contexts--for example, the server might be using threads (the default), greenlets via gevent or a similar library, or multiple processes, or maybe some other approach entirely in order to handle multiple requests concurrently. If you're using an approach that creates multiple processes, you should be able to use the utilities provided by the multiprocessing module to implement the same approach as I've given above.
This approach probably will work just fine for something where the difference in time between the API call and the webhook response is small (on the order of a couple of seconds at most I'd say), but you should be wary of using this approach for something where the difference in time can be quite large. If the connection between the client and your server fails, they'll have to make another request and run the long-running process that your third party is completing for you again. Some proxies and load balancers may also have time out behavior that could terminate the request after a certain amount of time even if nothing goes wrong in the connection between your server and the client making a request to it. An alternative approach would be for your /start endpoint to return quickly and give the client a request_id that they could poll for updates. As an example, AWS Athena's API is structured like this--there is a StartQueryExecution method, and separate GetQueryExecution and GetQueryResults methods that the client makes requests to check the status of a query and retrieve the results respectively (there are also other methods like StopQueryExecution and GetQueryRuntimeStatistics available as well). You can check out the documentation here.
I know that's a lot of info, but I hope it helps. Happy to update the answer w/ more specific info if you'll provide some more details about your use-case.
I'm working on a django application which reads csv file from dropbox, parse data and store it in database. For this purpose I need background task which checks if the file is modified or changed(updated) and then updates database.
I've tried 'Celery' but failed to configure it with django. Then I find django-background-tasks which is quite simpler than celery to configure.
My question here is how to initialize repeating tasks?
It is described in documentation
but I'm unable to find any example which explains how to use repeat, repeat_until or other constants mentioned in documentation.
can anyone explain the following with examples please?
notify_user(user.id, repeat=<number of seconds>, repeat_until=<datetime or None>)
repeat is given in seconds. The following constants are provided:
Task.NEVER (default), Task.HOURLY, Task.DAILY, Task.WEEKLY,
Task.EVERY_2_WEEKS, Task.EVERY_4_WEEKS.
You have to call the particular function (notify_user()) when you really need to execute it.
Suppose you need to execute the task while a request comes to the server, then it would be like this,
#background(schedule=60)
def get_csv(creds):
#read csv from drop box with credentials, "creds"
#then update the DB
def myview(request):
# do something with my view
get_csv(creds, repeat=100)
return SomeHttpResponse
Excecution Procedure
1. Request comes to the url hence it would dispatch to the corresponding view, here myview()
2. Excetes the line get_csv(creds, repeat=100) and then creates a async task in DB (it wont excetute the function now)
3. Returning the HTTP response to the user.
After 60 seconds from the time which the task creation, get_csv(creds) will excecutes repeatedly in every 100 seconds
For example, suppose you have the function from the documentation
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
Suppose you want to repeat this task daily until New Years day of 2019 you would do the following
import datetime
new_years_2019 = datetime.datetime(2019, 01, 01)
notify_user(some_id, repeat=task.DAILY, repeat_until=new_years_2019)
I'm implementing a simple upload handler in Python which reads an uploaded file in chunks into memory, GZips and signs them, and reuploads them to another server for long term storage. I've already devised a way to read the upload in chunks with my web server, and essentially I have a workflow like this:
class MyUploadHandler:
def on_file_started(self, file_name):
pass
def on_file_chunk(self, chunk):
pass
def on_file_finished(self, file_size):
pass
This part works great.
Now I need to upload the file in chunks to the final destination after performing my modifications to them. I'm looking for a workflow somewhat like this:
import requests
class MyUploadHandler:
def on_file_started(self, file_name):
self.request = requests.put("http://secondaryuploadlocation.com/upload/%s" %
(file_name,), streaming_upload = True)
def on_file_chunk(self, chunk):
self.request.write_body(transform_chunk(chunk))
def on_file_finished(self, file_size):
self.request.finish()
Is there a way to do this using the Python requests library? It seems that they allow for file-like upload objects which can be read, but I'm not sure exactly what that means and how to apply it for my situation. How can I provide a streaming upload request like this?
I would suggest using multiprocessing module of Python. You can use the apply_async routine in that module to upload each chunk as they are completed without affecting the other uploads. You can then put them in a temporary folder and after the upload event completion, you can sew them together.
The following answer to a similar question should solve your problem:
Q: "How to stream POST data into Python requests?"
A: Example code using queue, threading and iter() with sentinel
https://stackoverflow.com/a/40018547/19163
I have a Python HTTP server, on a certain GET request a file is created which is returned as response afterwards. The file creation might take a second, respectively the modification (updating) of the file.
Hence, I cannot return immediately the file as response. How do I approach such a problem? Currently I have a solution like this:
while not os.path.isfile('myfile'):
time.sleep(0.1)
return myfile
This seems very inconvenient, but is there a possibly better way?
A simple notification would do, but I don't have control over the process which creates/updates the files.
You could use Watchdog for a nicer way to watch the file system?
Something like this will remove the os call:
while updating:
time.sleep(0.1)
return myfile
...
def updateFile():
# updating file
updating = false
Implementing blocking io operations in synchronous HTTP requests is a bad approach. If many people run the same procedure simultaneously you may soon run out of threads (if there is a limited thread pool). I'd do the following:
A client requests the file creation URI. A file generating procedure is initialized in a background process (some asynchronous task system), the user gets a file id / name in the HTTP response. Next the client makes AJAX calls every once a while (polling), to check if the file has been created/modified (seperate file serve/check-if-exists URI). When the file is finaly created, the user is redirected (js window.location) to the file serving URI.
This approach will require a bit more work, but eventually it will pay off.
You can try using os.path.getmtime, this would check the modification time of the file and return if it's less than 1 sec ago. Also I suggest you only make a limited amount of tries or you will be stuck in an infinite loop if the file doesn't get created/modified. And as #Krzysztof RosiĆski pointed out you should probably think about doing it in a non-blocking way.
import os
from datetime import datetime
import time
for i in range(10):
try:
dif = datetime.now()-datetime.fromtimestamp(os.path.getmtime(file_path))
if dif.total_seconds() < 1:
return file
except OSError:
time.sleep(0.1)
I created a new Pylons project, and would like to use Cassandra as my database server. I plan on using Pycassa to be able to use cassandra 0.7beta.
Unfortunately, I don't know where to instantiate the connection to make it available in my application.
The goal would be to :
Create a pool when the application is launched
Get a connection from the pool for each request, and make it available to my controllers and libraries (in the context of the request). The best would be to get a connexion from the pool "lazily", i.e. only if needed
If a connexion has been used, release it when the request has been processed
Additionally, is there something important I should know about it ? When I see some comments like "Be careful when using a QueuePool with use_threadlocal=True, especially with retries enabled. Synchronization may be required to prevent the connection from changing while another thread is using it.", what does it mean exactly ?
Thanks.
--
Pierre
Well. I worked a little more. In fact, using a connection manager was probably not a good idea as this should be the template context. Additionally, opening a connection for each thread is not really a big deal. Opening a connection per request would be.
I ended up with just pycassa.connect_thread_local() in app_globals, and there I go.
Okay.
I worked a little, I learned a lot, and I found a possible answer.
Creating the pool
The best place to create the pool seems to be in the app_globals.py file, which is basically a container for objects which will be accessible "throughout the life of the application". Exactly what I want for a pool, in fact.
I just added at the end of the file my init code, which takes settings from the pylons configuration file :
"""Creating an instance of the Pycassa Pool"""
kwargs = {}
# Parsing servers
if 'cassandra.servers' in config['app_conf']:
servers = config['app_conf']['cassandra.servers'].split(',')
if len(servers):
kwargs['server_list'] = servers
# Parsing timeout
if 'cassandra.timeout' in config['app_conf']:
try:
kwargs['timeout'] = float(config['app_conf']['cassandra.timeout'])
except:
pass
# Finally creating the pool
self.cass_pool = pycassa.QueuePool(keyspace='Keyspace1', **kwargs)
I could have done better, like moving that in a function, or supporting more parameters (pool size, ...). Which I'll do.
Getting a connection at each request
Well. There seems to be the simple way : in the file base.py, adding something like c.conn = g.cass_pool.get() before calling WSGIController, something like c.conn.return_to_pool() after. This is simple, and works. But this gets a connection from the pool even when it's not required by the controller. I have to dig a little deeper.
Creating a connection manager
I had the simple idea to create a class which would be instantiated at each request in the base.py file, and which would automatically grab a connection from the pool when requested (and release it after). This is a really simple class :
class LocalManager:
'''Requests a connection from a Pycassa Pool when needed, and releases it at the end of the object's life'''
def __init__(self, pool):
'''Class constructor'''
assert isinstance(pool, Pool)
self._pool = pool
self._conn = None
def get(self):
'''Grabs a connection from the pool if not already done, and returns it'''
if self._conn is None:
self._conn = self._pool.get()
return self._conn
def __getattr__(self, key):
'''It's cooler to write "c.conn" than "c.get()" in the code, isn't it?'''
if key == 'conn':
return self.get()
else:
return self.__dict__[key]
def __del__(self):
'''Releases the connection, if needed'''
if not self._conn is None:
self._conn.return_to_pool()
Just added c.cass = CassandraLocalManager(g.cass_pool) before calling WSGIController in base.py, del(c.cass) after, and I'm all done.
And it works :
conn = c.cass.conn
cf = pycassa.ColumnFamily(conn, 'TestCF')
print cf.get('foo')
\o/
I don't know if this is the best way to do this. If not, please let me know =)
Plus, I still did not understand the "synchronization" part in Pycassa source code. If it is needed in my case, and what should I do to avoid problems.
Thanks.