I'm looking to convert a large directory of high resolution images (several million) into thumbnails using Python. I have a DynamoDB table that stores the location of each image in S3.
Instead of processing all these images on one EC2 instance (would take weeks) I'd like to write a distributed application using a bunch of instances.
What techniques could I use to write a queue that would allow a node to "check out" an image from the database, resize it, and update the database with the new dimensions of the generated thumbnails?
Specifically I'm worried about atomicity and concurrency -- how can I prevent two nodes from checking out the same job at the same time with DynamoDB?
One approach you could take would be to use Amazon's Simple Queue Service(SQS) in conjunction with DynamoDB. So what you could do is write messages to the queue that contain something like the hash key of the image entry in DynamoDB. Each instance would periodically check the queue and grab messages off. When an instance grabs a message off the queue, it becomes invisible to other instances for a given amount of time. You can then look up and process the image and delete the message off the queue. If for some reason something goes wrong with processing the image, the message will not be deleted and it will become visible for other instances to grab.
Another, probably more complicated, approach would be to use DynamoDB's conditional update mechanism to implement a locking scheme. For example, you could add something a 'beingProcessed' attribute to your data model, that is either 0 or 1. The first thing an instance could do is perform a conditional update on this column, changing the value to 1 iff the initial value is 0. There is probably more to do here around making it a proper/robust locking mechanism....
Using DynamoDB's optimistic locking with versioning would allow a node to "check out" a job by updating a status field to "InProgress". If a different node tried checking out the same job by updating the status field, it would receive an error and would know to retrieve a different job.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html
I know this is an old question, so this answer is more for the community than the original poster.
A good/cool approach is to use EMR for this. There is an inter-connection layer in EMR to connect HIVE to DynamoDB. You can then walk through your Table almost as you would with a SQL one and perform your operations.
There is a pretty good guide for it here: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
It for import/export but can be easily adapted.
Recently, DynamoDB released parallel scan:
http://aws.typepad.com/aws/2013/05/amazon-dynamodb-parallel-scans-and-other-good-news.html
Now, 10 hosts can read from the same table at the same time, and DynamoDB guarantees they won't see the same items.
Related
I have two Lambda functions written in Python:
Lambda function 1: Gets 'new' data from an API, gets 'old' data from an S3 bucket (if exists), compares the new to the old and creates 3 different lists of dictionaries: inserts, updates, and deletes. Each list is passed to the next lambda function in batches (~6MB) via Lambda invocation using RequestResponse. The full datasets can vary in size from millions of records to 1 or 2.
Lambda function 2: Handles each type of data (insert, update, delete) separately - specific things happen for each type, but eventually each batch is written to MySQL using pymysql executemany.
I can't figure out the best way to handle errors. For example, let's say one of the batches being written contains a single record that has a NULL value for a field that is not allowed to be NULL in the database. That entire batch fails and I have no way of figuring out what was written to the database and what wasn't for that batch. Ideally, a notification would be triggered and the rouge record would be written somewhere where it could be human reviewed - all other records would be successfully written
Ideally, I could use something like the Bisect Batch on Function Failure available in Kinesis Firehose. It will recursively split failed batches into smaller batches and retry them until it has isolated the problematic records. These will then be sent to DLQ if one is configured. However, I don't think Kenesis Firehose will work for me because it doesn't write to RDS and therefore doesn't know which records fail.
This person https://stackoverflow.com/a/58384445/6669829 suggested using execute if executemany fails. Not sure if that will work for the larger batches. But perhaps if I stream the data from S3 instead of invoking via RequestResponse this could work?
This article (AWS Lambda batching) talks about going from Lambda to SQS to Lambda to RDS, but I'm not sure how specifically you can handle errors in that situation. Do you have to send one record at a time?
This blog uses something similar, but I'm still not sure how to adapt this for my use case or if this is even the best solution.
Looking for help in any form I can get; ideas, blog posts, tutorials, videos, etc.
Thank you!
I do have a few suggestions focused on organization, debugging, and resiliency - please keep in mind assumptions are being made about your architecture
Organization
You currently have multiple dependent lambda's that are processing data. When you have a processing flow like this, the complexity of what you're trying to process dictates if you need to utilize an orchestration tool.
I would suggest orchestrating your lambda's via AWS Step Functions
Debugging
At the application level - log anything that isn't PII
Now that you're using an orchestration tool, utilize the error handling of Step Functions along with any business logic in the application to error appropriately if conditions for the next step aren't met
Resiliency
Life happens, things break, incorrect code get's pushed
Design your orchestration to put the failing event(s) your lambdas receive into a processing queue (AWS SQS, Kafka, etc..)- you can reprocess your events or if the events are at fault, DLQ them.
Here's a nice article on the use of orchestration with a design use case - give it a read
I looking to schedule a task when a document's datetime field hits that time, I've set that up using TTL. Problem is that according to delete event when I receive the cursor, the original document is not returned to the program. I still need the document (that is now deleted) on python stack since it contains other properties that are important to executing the task. Is there some kind of a workaround where I can get the document via change event without deleting it, or get the deleted document without having to do a query?
There is no workaround. The tools you've chosen are not sufficient for the job.
Replace the TTL index with a regular one and replace your ChangeStream listener with a cron job to run a worker every minute.
The worker will get all expired documents, do the job, and delete the documents from the collection either one by one or in batches.
It is more reliable, flexible and scalable approach comparing to the TTL + ChangeStream.
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
I'm looking for a way to get the size of my database. Right now, I've got a class method in my datastore model called query_all which does exactly that: it queries all entities of that model and returns them.
In my code, I call that like this:
query = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel'))
count = query.count()
return 'Database size: {}'.format(count)
Now this works, but because my database is pretty big (it has ~10,000 entities), every time I call this it takes away 20% of my Datastore Small Operations quota (I can do 0.05 million ops per day).
Is there any way for me to do this more efficiently? I think it's the .count() that's causing the problem and not the query_all(...), but I'm not sure.
Any ideas?
Thanks.
Don't use the datastore for reporting:
Rollup the data in realtime and use Memcache.
When I want to track some consumable resource:
What I have done in the past is create a QuotaManagementService that I call when I create or destroy some resource.
On every insert/delete operation you increment/decrement the count for that resource.
One easy way to do with with GAE is with MemcacheService, it even has inc()/dec() methods for you to manipulate the counts. You can also do this async so that it doesn't affect latency.
Since Memcache is an in memory cache only you would want to persist the information some how. You might want to store the value in the datastore periodically, maybe from a background task. Or you could just live with recreating it when your application is updated.
Google Analytics
You could also pump create/delete events into Google Analytics so you can just go to the dashboard and see how many calls have been made any other detailed stuff you wouldn't have to write much code for this either.
Either way this can be pretty transparent with Python
You could probably just create a Decorator to automatically inc/dec the counts and not even manage it with code.
Memcache writes are free so I don't think I would write every event to a DataStore counter if they were frequent.
Here is some example code for using the DataStore directly instead of Memcache for high volume writes to a counter.
Our situation is as follows:
We are working on a schoolproject where the intention is that multiple teams walk around in a city with smarthphones and play a city game while walking.
As such, we can have 10 active smarthpones walking around in the city, all posting their location, and requesting data from the google appengine.
Someone is behind a webbrowser,watching all these teams walk around, and sending them messages etc.
We are using the datastore the google appengine provides to store all the data these teams send and request, to store the messages and retrieve them etc.
However we soon found out we where at our max limit of reads and writes, so we searched for a solution to be able to retrieve periodic updates(which cost the most reads and writes) without using any of the limited resources google provides. And obviously, because it's a schoolproject we don't want to pay for more reads and writes.
Storing this information in global variables seemed an easy and quick solution, which it was... but when we started to truly test we noticed some of our data was missing and then reappearing. Which turned out to be because there where so many requests being done to the cloud that a new instance was made, and instances don't keep these global variables persistent.
So our question is:
Can we somehow make sure these global variables are always the same on every running instance of google appengine.
OR
Can we limit the amount of instances ever running, no matter how many requests are done to '1'.
OR
Is there perhaps another way to store this data in a better way, without using the datastore and without using globals.
You should be using memcache. If you use the ndb (new database) library, you can automatically cache the results of queries. Obviously this won't improve your writes much, but it should significantly improve the numbers of reads you can do.
You need to back it with the datastore as data can be ejected from memcache at any time. If you're willing to take the (small) chance of losing updates you could just use memcache. You could do something like store just a message ID in the datastore and have the controller periodically verify that every message ID has a corresponding entry in memcache. If one is missing the controller would need to reenter it.
Interesting question. Some bad news first, I don't think there's a better way of storing data; no, you won't be able to stop new instances from spawning and no, you cannot make seperate instances always have the same data.
What you could do is have the instances perioidically sync themselves with a master record in the datastore, by choosing the frequency of this intelligently and downloading/uploading the information in one lump you could limit the number of read/writes to a level that works for you. This is firmly in the kludge territory though.
Despite finding the quota for just about everything else, I can't find the limits for free read/write so it is possible that they're ludicrously small but the fact that you're hitting them with a mere 10 smartphones raises a red flag to me. Are you certain that the smartphones are being polled (or calling in) at a sensible frequency? It sounds like you might be hammering them unnecessarily.
Consider jabber protocol for communication between peers. Free limits are on quite high level for it.
First, definitely use memcache as Tim Delaney said. That alone will probably solve your problem.
If not, you should consider a push model. The advantage is that your clients won't be asking you for new data all the time, only when something has actually changed. If the update is small enough that you can deliver it in the push message, you won't need to worry about datastore reads on memcache misses, or any other duplicate work, for all those clients: you read the data once when it changes and push it out to everyone.
The first option for push is C2DM (Android) or APN (iOS). These are limited on the amount of data they send and the frequency of updates.
If you want to get fancier you could use XMPP instead. This would let you do more frequent updates with (I believe) bigger payloads but might require more engineering. For a starting point, see Stack Overflow questions about Android and iOS.
Have fun!