I'm looking for a powerful and fast way to handle processing of large file in Google App Engine.
It works as the following (simplified workflow at the end):
The customer send a CSV file, that our server will treat, line by line.
Once the file is uploaded, an entry is added in the NDB datastore Uploads with the CSV name, file path (to Google Storage) and some basic informations. Then, a Task is created, called "pre-processing".
The pre-processing task will loop over all the lines of the CSV file (could be millions) and will add a NDB entry to UploadEntries model, for each line, with the CSV id, the line, the data to extract/treat, and some indicators (boolean) on if this line has started processing, and ended processing ("is_treating", "is_done")
Once the pre-processing task has ended, it updates the information to the client "XXX lines will be processed"
A call to Uploads.next() is made. The next method will :
Search the UploadEntries that has is_treating and is_done at false,
Will add a task in a Redis datastore for the next line found. (The Redis datastore is used because the work here is made on servers not managed by Google)
Will also create a new entry in the task Process-healthcheck (This task is runned after 5 minutes and checks that 7) has been correctly executed. If not, it considers that the Redis/Outside server has failed and do the same as 7), without the result ("error" instead)).
Then, it updates UploadEntries.is_treating to True for that entry.
The outside server will process the data, and returns the results by making a POST request to an endpoint on the server.
That endpoint update the UploadEntries entry in the datastore (including "is_treating" and "is_done"), and call Uploads.next() to start the next line.
In Uploads.next, when searching for the next entries returns nothing, I consider the file to be finally treated, and call the task post-process that will rebuild the CSV with the treated data, and returns it to the customer.
Here's a few things to keep in mind :
The servers that does the real work are outside of Google AppEngine, that's why I had to come up with Redis.
The current way of doing things give me a flexibility on the number of parallel entries to process : In the 5), the Uploads.next() methods contains a limit argument that let me search for n process in parallel. Can be 1, 5, 20, 50.
I can't just add all the lines from the pre-processing task directly to Redis becase in that case, the next customer will have to wait for the first file to be finished processing, and this will pile-up to take too long
But this system has various issues, and that's why I'm turning to your help :
Sometimes, this system is so fast that the Datastore is not yet updated correctly and when calling Uploads.next(), the entries returned are already being processed (it's just that entry.is_treating = True is not yet pushed to the database)
The Redis or my server (I don't really know) sometime loose the task, or the POST request after the processing is not made, so the task never goes to is_done = True. That's why I had to implement a Healcheck system, to ensure the line is correctly treated no matter what. This has a double advantage : The name of that task contains the csv ID, and the line. Making it unique per file. If I the datastore is not up to date and the same task is run twice, the creation of the healthcheck will fail because the same name already exist, letting me know that there is a concurrence issue, so I ignore that task because it means the Datastore is not yet up to date.
I initiall thought about running the file in one independant process, line by line, but this has the big disadvantage of not being able to run multiple line in parallel. Moreover, Google limits the running of a task to 24h for dedicated targets (not default), and when the file is really big, it can run for longer than 24h.
For information, if it helps, I'm using Python
And to simplify the workflow, here's what I'm trying to achieve in the best way possible :
Process a large file, run multiple paralllel process, one per line.
Send the work to an outside server using Redis. Once done, that outside server returns the result via a POST request to the main server
The main server then update the information about that line, and goes to the next line
I'd really appreciate if someone had a better way of doing this. I really believe I'm not the first to do that kind of work and I'm pretty sure I'm not doing it correctly.
(I believe Stackoverflow is the best section of Stack Exchange to post that kind of question since it's an algorithm question, but it's also possible I didn't saw a better network for that. If so, I'm sorry about that).
The servers that does the real work are outside of Google AppEngine
Have you considered using Google Cloud Dataflow for processing large files instead?
It is a managed service that will handle the file splitting and processing for you.
Based on initial thoughts here is an outline process:
User uploads files direct to google cloud storage, using signed urls or blobstore API
A request from AppEngine launches a small compute engine instance that initiates a blocking request (BlockingDataflowPipelineRunner) to launch the dataflow task. (I'm afraid it needs to be a compute instance because of sandbox and blocking I/O issues).
When the dataflow task is finished the compute engine instance is unblocked and posts a message into pubsub.
The pubsub message invokes a webhook on the AppEngine service that changes the tasks state from 'in progress' to 'complete' so the user can fetch their results.
Related
I want to create a Python3 program that takes in MySQL data and holds it temporarily, and can then pass this data onto a cloud MySQL database.
The idea would be that it acts as a buffer for entries in the event that my local network goes down, the buffer would then be able to pass those entries on at a later date, theoretically providing fault-tolerance.
I have done some research into Replication and GTIDs and I'm currently in the process of learning these concepts. However I would like to write my own solution, or at least have it be a smaller program rather than a full implementation of replication server-side.
I already have a program that generates some MySQL data to fill my DB, the key part I need help with would be the buffer aspect/implementation (The code itself I have isn't important as I can rework it later on).
I would greatly appreciate any good resources or help, thank you!
I would implement what you describe using a message queue.
Example: https://hevodata.com/learn/python-message-queue/
The idea is to run a message queue service on your local computer. Your Python application pushes items into the MQ instead of committing directly to the database.
Then you need another background task, called a worker, which you may also write in Python or another language, which consumes items from the MQ and writes them to the cloud database when it's available. If the cloud database is not available, then the background worker pauses.
The data in the MQ can grow while the background worker is paused. If this goes on too long, you may run out of space. But hopefully the rate of growth is slow enough and the cloud database is available regularly, so the risk of this happening is low.
Re your comment about performance.
This is a different application architecture, so there are pros and cons.
On the one hand, if your application is "writing" to a local MQ instead of the remote database, it's likely to appear to the app as if writes have lower latency.
On the other hand, posting to the MQ does not write to the database immediately. There still needs to be a step of the worker pulling an item and initiating its own write to the database. So from the application's point of view, there is a brief delay before the data appears in the database, even when the database seems available.
So the app can't depend on the data being ready to be queried immediately after the app pushes it to the MQ. That is, it might be pretty prompt, under 1 second, but that's not the same as writing to the database directly, which ensures that the data is ready to be queried immediately after the write.
The performance of the worker writing the item to the database should be identical to that of the app writing that same item to the same database. From the database perspective, nothing has changed.
I have two Lambda functions written in Python:
Lambda function 1: Gets 'new' data from an API, gets 'old' data from an S3 bucket (if exists), compares the new to the old and creates 3 different lists of dictionaries: inserts, updates, and deletes. Each list is passed to the next lambda function in batches (~6MB) via Lambda invocation using RequestResponse. The full datasets can vary in size from millions of records to 1 or 2.
Lambda function 2: Handles each type of data (insert, update, delete) separately - specific things happen for each type, but eventually each batch is written to MySQL using pymysql executemany.
I can't figure out the best way to handle errors. For example, let's say one of the batches being written contains a single record that has a NULL value for a field that is not allowed to be NULL in the database. That entire batch fails and I have no way of figuring out what was written to the database and what wasn't for that batch. Ideally, a notification would be triggered and the rouge record would be written somewhere where it could be human reviewed - all other records would be successfully written
Ideally, I could use something like the Bisect Batch on Function Failure available in Kinesis Firehose. It will recursively split failed batches into smaller batches and retry them until it has isolated the problematic records. These will then be sent to DLQ if one is configured. However, I don't think Kenesis Firehose will work for me because it doesn't write to RDS and therefore doesn't know which records fail.
This person https://stackoverflow.com/a/58384445/6669829 suggested using execute if executemany fails. Not sure if that will work for the larger batches. But perhaps if I stream the data from S3 instead of invoking via RequestResponse this could work?
This article (AWS Lambda batching) talks about going from Lambda to SQS to Lambda to RDS, but I'm not sure how specifically you can handle errors in that situation. Do you have to send one record at a time?
This blog uses something similar, but I'm still not sure how to adapt this for my use case or if this is even the best solution.
Looking for help in any form I can get; ideas, blog posts, tutorials, videos, etc.
Thank you!
I do have a few suggestions focused on organization, debugging, and resiliency - please keep in mind assumptions are being made about your architecture
Organization
You currently have multiple dependent lambda's that are processing data. When you have a processing flow like this, the complexity of what you're trying to process dictates if you need to utilize an orchestration tool.
I would suggest orchestrating your lambda's via AWS Step Functions
Debugging
At the application level - log anything that isn't PII
Now that you're using an orchestration tool, utilize the error handling of Step Functions along with any business logic in the application to error appropriately if conditions for the next step aren't met
Resiliency
Life happens, things break, incorrect code get's pushed
Design your orchestration to put the failing event(s) your lambdas receive into a processing queue (AWS SQS, Kafka, etc..)- you can reprocess your events or if the events are at fault, DLQ them.
Here's a nice article on the use of orchestration with a design use case - give it a read
I'm currently writing an application in python that mainly works with files. So the algorithm works like this:
A user submits a file through an API, imagine a post request with the file and some data.
Then the program works with the file and extracts some conclusions.
After that, those conclusions are stored inside a DB.
Then the user is able to query the db and ask for conclusions.
As the user is able to submit a file through an API and this can be done simultaneously by many users in many systems and the process of file processing may take some time. I want to explore a way to implement a work queue such as:
Only one file can be processed at a time, so when a user submits a file, that file is put inside a work queue and so has to wait before getting inside the "processing function".
How can I do that, any reference or tutorial?
Thanks
Checkout Celery, there are many good tutorials online.
It works with workers so it also doesn't block you api listening.
Also it could provide the option to process multiple files at once concurently if you'd want.
I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.
As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).
im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!
I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.
I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.
i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.