I have two Lambda functions written in Python:
Lambda function 1: Gets 'new' data from an API, gets 'old' data from an S3 bucket (if exists), compares the new to the old and creates 3 different lists of dictionaries: inserts, updates, and deletes. Each list is passed to the next lambda function in batches (~6MB) via Lambda invocation using RequestResponse. The full datasets can vary in size from millions of records to 1 or 2.
Lambda function 2: Handles each type of data (insert, update, delete) separately - specific things happen for each type, but eventually each batch is written to MySQL using pymysql executemany.
I can't figure out the best way to handle errors. For example, let's say one of the batches being written contains a single record that has a NULL value for a field that is not allowed to be NULL in the database. That entire batch fails and I have no way of figuring out what was written to the database and what wasn't for that batch. Ideally, a notification would be triggered and the rouge record would be written somewhere where it could be human reviewed - all other records would be successfully written
Ideally, I could use something like the Bisect Batch on Function Failure available in Kinesis Firehose. It will recursively split failed batches into smaller batches and retry them until it has isolated the problematic records. These will then be sent to DLQ if one is configured. However, I don't think Kenesis Firehose will work for me because it doesn't write to RDS and therefore doesn't know which records fail.
This person https://stackoverflow.com/a/58384445/6669829 suggested using execute if executemany fails. Not sure if that will work for the larger batches. But perhaps if I stream the data from S3 instead of invoking via RequestResponse this could work?
This article (AWS Lambda batching) talks about going from Lambda to SQS to Lambda to RDS, but I'm not sure how specifically you can handle errors in that situation. Do you have to send one record at a time?
This blog uses something similar, but I'm still not sure how to adapt this for my use case or if this is even the best solution.
Looking for help in any form I can get; ideas, blog posts, tutorials, videos, etc.
Thank you!
I do have a few suggestions focused on organization, debugging, and resiliency - please keep in mind assumptions are being made about your architecture
Organization
You currently have multiple dependent lambda's that are processing data. When you have a processing flow like this, the complexity of what you're trying to process dictates if you need to utilize an orchestration tool.
I would suggest orchestrating your lambda's via AWS Step Functions
Debugging
At the application level - log anything that isn't PII
Now that you're using an orchestration tool, utilize the error handling of Step Functions along with any business logic in the application to error appropriately if conditions for the next step aren't met
Resiliency
Life happens, things break, incorrect code get's pushed
Design your orchestration to put the failing event(s) your lambdas receive into a processing queue (AWS SQS, Kafka, etc..)- you can reprocess your events or if the events are at fault, DLQ them.
Here's a nice article on the use of orchestration with a design use case - give it a read
Related
I have a code that contains a variable that I want to change manually when I want without stopping the main loop neither pause it (with input()). I can't find a library that allows me to set manually in the run, or access the RAM memory to change that value.
for now I set a file watcher that reads the parameters every 1 minutes but this is inefficient way I presume.
I guess you just want to expose API. You did it with files which works but less common. You can use common best practices such as:
HTTP web-server. You can do it quickly with Flask/bottle.
gRPC
pub/sub mechanism - Redis, Kafka (more complicated, requires another storage solution - the DB itself).
I guess that there are tons of other solution but you got the idea. I hope that's what you're looking for.
With those solution you can manually trigger your endpoint and change whatever you want in your application.
I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.
In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.
In the components where the data frame is used for training or validating the models, download from S3 the data frame.
The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.
Thanks
As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".
However, there are certain considerations worth spelling out in your case.
Saving to S3 in between pipeline steps.
This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.
So the questions are:
Are the steps idempotent (restartable)?
How often the pipeline fails?
Is it easy to restart the processing from some mid-point?
Do you care about the processing time more than the risk of loosing some work?
Do you care about the incurred cost of S3 storage/transfer?
The question is about if this is a best practice
The best practice is to use the file-based I/O and built-in data-passing features. The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components. This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).
or is better to share the data frame directly between components
How can you "directly share" in-memory python object between different python programs running in containers on different machines?
because the upload to the S3 can fail, and then fail the pipeline.
The failed pipeline can just be restarted. The caching feature will make sure that already finished tasks won't be re-executed.
Anyways, what is the alternative? How can you send the data between distributed containerized programs without sending it over the network?
My software goal is to automate the preprocessing pipeline, the pipeline has three code blocks:
Fetching the data - either by api or by client uploading csv to s3 bucket.
Processing the data - my goal is to unified the data from the different clients to a unified end scheme.
Store scheme is database.
I know it is a very common system but I failed to find what is the best design for it.
The requirements are:
The system is not real time, for each client I plan each X days to fetch the new data and it is dose not matter if only even a day later it will finish
The processing partis unique per client data, of course there are some common features, but also a lot of different features and muniplation.
I wish the system to be automated.
I thought of the following :
The lambda solution:
schedule a lambda for each client which will fetch the data every X days, the lambda will trigger another lambda which will do processing. But if I have 100 clients that will be awful to handle 200 lambdas.
2.1 making a project call Api and have different script for each client, my a schudle for each script on a ec2 or ecs.
2.2 Have another project call processing where the father class has the common code and all the subclass client code inherite from it, the API script will activate the relevant processing script.
In the end I am very confused what is the best practice, I only found example which handle one client, or a general scheme approch/ diagram block which is to broad.
Because I know it such a common system, I would appreciate learning from others experience.
Would appreciate any reference links or wisdom
Take a look at Step Functions, it will allow you to decouple the execution of each stage and allow you to reuse your Lambdas.
By passing in input into the step function the top Lambda might be able to make decisions which feed to the others.
To schedule this use a scheduled CloudWatch event
I'm looking for a powerful and fast way to handle processing of large file in Google App Engine.
It works as the following (simplified workflow at the end):
The customer send a CSV file, that our server will treat, line by line.
Once the file is uploaded, an entry is added in the NDB datastore Uploads with the CSV name, file path (to Google Storage) and some basic informations. Then, a Task is created, called "pre-processing".
The pre-processing task will loop over all the lines of the CSV file (could be millions) and will add a NDB entry to UploadEntries model, for each line, with the CSV id, the line, the data to extract/treat, and some indicators (boolean) on if this line has started processing, and ended processing ("is_treating", "is_done")
Once the pre-processing task has ended, it updates the information to the client "XXX lines will be processed"
A call to Uploads.next() is made. The next method will :
Search the UploadEntries that has is_treating and is_done at false,
Will add a task in a Redis datastore for the next line found. (The Redis datastore is used because the work here is made on servers not managed by Google)
Will also create a new entry in the task Process-healthcheck (This task is runned after 5 minutes and checks that 7) has been correctly executed. If not, it considers that the Redis/Outside server has failed and do the same as 7), without the result ("error" instead)).
Then, it updates UploadEntries.is_treating to True for that entry.
The outside server will process the data, and returns the results by making a POST request to an endpoint on the server.
That endpoint update the UploadEntries entry in the datastore (including "is_treating" and "is_done"), and call Uploads.next() to start the next line.
In Uploads.next, when searching for the next entries returns nothing, I consider the file to be finally treated, and call the task post-process that will rebuild the CSV with the treated data, and returns it to the customer.
Here's a few things to keep in mind :
The servers that does the real work are outside of Google AppEngine, that's why I had to come up with Redis.
The current way of doing things give me a flexibility on the number of parallel entries to process : In the 5), the Uploads.next() methods contains a limit argument that let me search for n process in parallel. Can be 1, 5, 20, 50.
I can't just add all the lines from the pre-processing task directly to Redis becase in that case, the next customer will have to wait for the first file to be finished processing, and this will pile-up to take too long
But this system has various issues, and that's why I'm turning to your help :
Sometimes, this system is so fast that the Datastore is not yet updated correctly and when calling Uploads.next(), the entries returned are already being processed (it's just that entry.is_treating = True is not yet pushed to the database)
The Redis or my server (I don't really know) sometime loose the task, or the POST request after the processing is not made, so the task never goes to is_done = True. That's why I had to implement a Healcheck system, to ensure the line is correctly treated no matter what. This has a double advantage : The name of that task contains the csv ID, and the line. Making it unique per file. If I the datastore is not up to date and the same task is run twice, the creation of the healthcheck will fail because the same name already exist, letting me know that there is a concurrence issue, so I ignore that task because it means the Datastore is not yet up to date.
I initiall thought about running the file in one independant process, line by line, but this has the big disadvantage of not being able to run multiple line in parallel. Moreover, Google limits the running of a task to 24h for dedicated targets (not default), and when the file is really big, it can run for longer than 24h.
For information, if it helps, I'm using Python
And to simplify the workflow, here's what I'm trying to achieve in the best way possible :
Process a large file, run multiple paralllel process, one per line.
Send the work to an outside server using Redis. Once done, that outside server returns the result via a POST request to the main server
The main server then update the information about that line, and goes to the next line
I'd really appreciate if someone had a better way of doing this. I really believe I'm not the first to do that kind of work and I'm pretty sure I'm not doing it correctly.
(I believe Stackoverflow is the best section of Stack Exchange to post that kind of question since it's an algorithm question, but it's also possible I didn't saw a better network for that. If so, I'm sorry about that).
The servers that does the real work are outside of Google AppEngine
Have you considered using Google Cloud Dataflow for processing large files instead?
It is a managed service that will handle the file splitting and processing for you.
Based on initial thoughts here is an outline process:
User uploads files direct to google cloud storage, using signed urls or blobstore API
A request from AppEngine launches a small compute engine instance that initiates a blocking request (BlockingDataflowPipelineRunner) to launch the dataflow task. (I'm afraid it needs to be a compute instance because of sandbox and blocking I/O issues).
When the dataflow task is finished the compute engine instance is unblocked and posts a message into pubsub.
The pubsub message invokes a webhook on the AppEngine service that changes the tasks state from 'in progress' to 'complete' so the user can fetch their results.
I'm looking to convert a large directory of high resolution images (several million) into thumbnails using Python. I have a DynamoDB table that stores the location of each image in S3.
Instead of processing all these images on one EC2 instance (would take weeks) I'd like to write a distributed application using a bunch of instances.
What techniques could I use to write a queue that would allow a node to "check out" an image from the database, resize it, and update the database with the new dimensions of the generated thumbnails?
Specifically I'm worried about atomicity and concurrency -- how can I prevent two nodes from checking out the same job at the same time with DynamoDB?
One approach you could take would be to use Amazon's Simple Queue Service(SQS) in conjunction with DynamoDB. So what you could do is write messages to the queue that contain something like the hash key of the image entry in DynamoDB. Each instance would periodically check the queue and grab messages off. When an instance grabs a message off the queue, it becomes invisible to other instances for a given amount of time. You can then look up and process the image and delete the message off the queue. If for some reason something goes wrong with processing the image, the message will not be deleted and it will become visible for other instances to grab.
Another, probably more complicated, approach would be to use DynamoDB's conditional update mechanism to implement a locking scheme. For example, you could add something a 'beingProcessed' attribute to your data model, that is either 0 or 1. The first thing an instance could do is perform a conditional update on this column, changing the value to 1 iff the initial value is 0. There is probably more to do here around making it a proper/robust locking mechanism....
Using DynamoDB's optimistic locking with versioning would allow a node to "check out" a job by updating a status field to "InProgress". If a different node tried checking out the same job by updating the status field, it would receive an error and would know to retrieve a different job.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html
I know this is an old question, so this answer is more for the community than the original poster.
A good/cool approach is to use EMR for this. There is an inter-connection layer in EMR to connect HIVE to DynamoDB. You can then walk through your Table almost as you would with a SQL one and perform your operations.
There is a pretty good guide for it here: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
It for import/export but can be easily adapted.
Recently, DynamoDB released parallel scan:
http://aws.typepad.com/aws/2013/05/amazon-dynamodb-parallel-scans-and-other-good-news.html
Now, 10 hosts can read from the same table at the same time, and DynamoDB guarantees they won't see the same items.