My software goal is to automate the preprocessing pipeline, the pipeline has three code blocks:
Fetching the data - either by api or by client uploading csv to s3 bucket.
Processing the data - my goal is to unified the data from the different clients to a unified end scheme.
Store scheme is database.
I know it is a very common system but I failed to find what is the best design for it.
The requirements are:
The system is not real time, for each client I plan each X days to fetch the new data and it is dose not matter if only even a day later it will finish
The processing partis unique per client data, of course there are some common features, but also a lot of different features and muniplation.
I wish the system to be automated.
I thought of the following :
The lambda solution:
schedule a lambda for each client which will fetch the data every X days, the lambda will trigger another lambda which will do processing. But if I have 100 clients that will be awful to handle 200 lambdas.
2.1 making a project call Api and have different script for each client, my a schudle for each script on a ec2 or ecs.
2.2 Have another project call processing where the father class has the common code and all the subclass client code inherite from it, the API script will activate the relevant processing script.
In the end I am very confused what is the best practice, I only found example which handle one client, or a general scheme approch/ diagram block which is to broad.
Because I know it such a common system, I would appreciate learning from others experience.
Would appreciate any reference links or wisdom
Take a look at Step Functions, it will allow you to decouple the execution of each stage and allow you to reuse your Lambdas.
By passing in input into the step function the top Lambda might be able to make decisions which feed to the others.
To schedule this use a scheduled CloudWatch event
Related
I have to process requests published in a Pub/ Sub Topic by the users of my "service" in a Python application, having a main loop.
Each instance of the service will only be able to process one request a time due to the limitation of the application. Burst of requests (~10 at the beginning, growing to ~ 10^6/7) will mix into mostly total idle time. The cold start time of the application is very high compared to the processing time for a single request.
My code will be a plug-in, which polls the subscription, calls methods in the application based on it and then saves data in Cloud Storage and Big Query.
I have read through the cloud documentation and it seems that Cloud Run is the right solution, together with the synchronous subscription API in Python. Cloud Functions I excluded, because it seems more be suited for asynchronous stuff.
However, I did not fully understand how the auto-scaling works. Basically it would have to do this based on the average processing time for each request and the length of the queue, considering the average start-up time of a container.
Unfortunately I did not find a tutorial or example for such a use-case, especially not really explaining how the auto-scaling happens in detail.
Does anyone have something like that or could one explain me here?
I have two Lambda functions written in Python:
Lambda function 1: Gets 'new' data from an API, gets 'old' data from an S3 bucket (if exists), compares the new to the old and creates 3 different lists of dictionaries: inserts, updates, and deletes. Each list is passed to the next lambda function in batches (~6MB) via Lambda invocation using RequestResponse. The full datasets can vary in size from millions of records to 1 or 2.
Lambda function 2: Handles each type of data (insert, update, delete) separately - specific things happen for each type, but eventually each batch is written to MySQL using pymysql executemany.
I can't figure out the best way to handle errors. For example, let's say one of the batches being written contains a single record that has a NULL value for a field that is not allowed to be NULL in the database. That entire batch fails and I have no way of figuring out what was written to the database and what wasn't for that batch. Ideally, a notification would be triggered and the rouge record would be written somewhere where it could be human reviewed - all other records would be successfully written
Ideally, I could use something like the Bisect Batch on Function Failure available in Kinesis Firehose. It will recursively split failed batches into smaller batches and retry them until it has isolated the problematic records. These will then be sent to DLQ if one is configured. However, I don't think Kenesis Firehose will work for me because it doesn't write to RDS and therefore doesn't know which records fail.
This person https://stackoverflow.com/a/58384445/6669829 suggested using execute if executemany fails. Not sure if that will work for the larger batches. But perhaps if I stream the data from S3 instead of invoking via RequestResponse this could work?
This article (AWS Lambda batching) talks about going from Lambda to SQS to Lambda to RDS, but I'm not sure how specifically you can handle errors in that situation. Do you have to send one record at a time?
This blog uses something similar, but I'm still not sure how to adapt this for my use case or if this is even the best solution.
Looking for help in any form I can get; ideas, blog posts, tutorials, videos, etc.
Thank you!
I do have a few suggestions focused on organization, debugging, and resiliency - please keep in mind assumptions are being made about your architecture
Organization
You currently have multiple dependent lambda's that are processing data. When you have a processing flow like this, the complexity of what you're trying to process dictates if you need to utilize an orchestration tool.
I would suggest orchestrating your lambda's via AWS Step Functions
Debugging
At the application level - log anything that isn't PII
Now that you're using an orchestration tool, utilize the error handling of Step Functions along with any business logic in the application to error appropriately if conditions for the next step aren't met
Resiliency
Life happens, things break, incorrect code get's pushed
Design your orchestration to put the failing event(s) your lambdas receive into a processing queue (AWS SQS, Kafka, etc..)- you can reprocess your events or if the events are at fault, DLQ them.
Here's a nice article on the use of orchestration with a design use case - give it a read
I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.
In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.
In the components where the data frame is used for training or validating the models, download from S3 the data frame.
The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.
Thanks
As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".
However, there are certain considerations worth spelling out in your case.
Saving to S3 in between pipeline steps.
This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.
So the questions are:
Are the steps idempotent (restartable)?
How often the pipeline fails?
Is it easy to restart the processing from some mid-point?
Do you care about the processing time more than the risk of loosing some work?
Do you care about the incurred cost of S3 storage/transfer?
The question is about if this is a best practice
The best practice is to use the file-based I/O and built-in data-passing features. The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components. This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).
or is better to share the data frame directly between components
How can you "directly share" in-memory python object between different python programs running in containers on different machines?
because the upload to the S3 can fail, and then fail the pipeline.
The failed pipeline can just be restarted. The caching feature will make sure that already finished tasks won't be re-executed.
Anyways, what is the alternative? How can you send the data between distributed containerized programs without sending it over the network?
I work on a project for our clients which is heavily ML based and is computationally intensive (as in complex and multi-level similarity scores, NLP, etc.) For the prototype, we delivered a Django RF where the API would have access to a database from the client and with every request at specific end-points it would literally do all the ML applications on the fly (in the backend).
Now that we are scaling and more user activity is taking place in the production, the app seems to lag a lot. A simple profiling shows that a single POST request could take upto 20 secs to respond. So no matter how much I optimize in terms of horizontal scaling, I can't get rid of the bottleneck of all the calculations happening with the API calls. I have a hunch that caching could be a kind of solution. But I am not sure. I can imagine a lot of 'theoretical' solutions but I don't want to reinvent the wheel (or shall I say, re-discover the wheel).
Are there specific design architectures for ML or computationally intensive REST API calls that I can refer to in redesigning my project?
Machine learning & natural language processing systems are often resource-hungry and in many cases there is not much one can do about it directly. Some operations simply take longer than others but this is actually not the main problem in your case. The main problem is that the user doesn't get any feedback while the backend does its job which is not a good user experience.
Therefore, it is not recommended to perform resource-heavy computation within the traditional HTTP request-response cycle. Instead of calling the ML logic within the API view and waiting for it to finish, consider setting up an asychronous task queue to perform the heavy lifting independently of the synchronous request-response cycle.
In the context of Django, the standard task queue implementation would be Celery. Setting it put will require some learning and additional infrastructure (e.g. a Redis instance and worker servers), but there is really no other way to not to break the user experience.
Once you have set up everything, you can then start an asynchronous task whenever your API endpoint receives a request and immediately inform the user that their request is being carried out via a normal view response. Once the ML task has finished and its results have been written to the database (using a Django model, of course), you can then notify the user (e.g. via mail or directly in the browser via WebSockets) to view the analysis results in a dedicated results view.
Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.
A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.
Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.
Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.
I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where
Server A would write a file, compress it and then produce a message for a queue.
Server B would subscribe to the queue and would process files named in the message body.
I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).
I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.
It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.
I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.
I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.
Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.
Message brokers such as Rabbit contain practical solutions for a number of problems:
multiple producers and consumers are supported without risk of duplication of messages
atomicity and unit-of-work logic provide transactional integrity, preventing duplication and loss of messages in the event of failure
horizontal scaling--most mature brokers can be clustered so that a single queue exists on multiple machines
no-rendezvous messaging--it is not necessary for sender and receiver to be running at the same time, so one can be brought down for maintenance without affecting the other
preservation of FIFO order
Depending on the particular web service platform you are considering, you may find that you need some of these features and must implement them yourself if not using a broker. The web service protocols and formats such as HTTP, SOAP, JSON, etc. do not solve these problems for you.
In my previous job the project management passed on using message brokers early on, but later the team ended up implementing quick-and-dirty logic meant to solve some of the same issues as above in our web service architecture. We had less time to provide business value because we were fixing so many concurrency and error-recovery issues.
So while a message broker may seem on its face like a heavyweight solution, and may actually be more than you need right now, it does have a lot of benefits that you may need later without yet realizing it.
As wberry alluded to, a REST or web-hook based solution can be functional but will not be very tolerant to failure. Paying the operations price up front for messaging will pay long term dividends as you will find additional problems which are a natural fit for the messaging model.
Regarding other OSS options; If you are considering stream based processing in addition to this specific use case, I would recommend taking a look at Apache Kafka. Kafka provides similar messaging semantics to RabbitMQ, but is tightly focused on processing message streams (not to mention that is has been battle tested in production at LinkedIn).