Loading large python packages into AWS lambda function

Loading large python packages into AWS lambda function - python

I can't seem to get around this(below mentioned) error while trying to upload a function onto AWS lambda:
The Code tab failed to save. Reason: Unzipped size must be smaller than 262144000 bytes
I've zipped the function and all of it's dependencies and uploaded the zipped file to S3, and pasted the file's S3 URL at the lambda's prompt (upload a file from Amazon S3).
Any leads would be appreciated. Thanks

Adding to Entropic's answer, what about using something like pyminifier? This could be a very simple solution if the minification it performs is sufficient to reach the limit of 250 MB.
Also, if you are using the AWS SDK, you do not need to include it in your package as it is included in the Lambda execution environment. This could also save some space if it is the case.

As kosa mentioned there is a hard limit at 250MB. This reddit thread had a few good ideas:
https://www.reddit.com/r/aws/comments/4qrw9m/how_to_work_around_aws_lambdas_250mb_limit/
Most solutions along the lines of 1) Loading more code later, thus getting around the 250 limit 2) Split up the code into smaller pieces, which is more aws-lambda-ish anyway, and 3) use strip command like this guy: https://serverlesscode.com/post/scikitlearn-with-amazon-linux-container/
2 is probably the best way to go, if you can split it up.

Related

How to get s3 metadata for all keys in a bucket via boto3

I want to fetch all metadata for a bucket with a prefix via Boto. There are a few SO questions that imply this isn't possible via the AWS API. So, two questions:
Is there a good reason this shouldn't be possible via the AWS API?
Although I can't find one in docs, Is there a convenience method for this in Boto?
I'm currently doing this using multithreading, but that seems like overkill, and I'd really rather avoid it if at all possible.

While there isn't a way to do this directly through boto, you could add an inventory configuration on the bucket(s) which generates a daily CSV / ORC file with all file metadata.
Once this has been generated you can then process the output rather than multithreading or any other method that requires a huge number of requests.
See: put_bucket_inventory_configuration
Its worth noting that it can take upto 48 hours for the first one to be generated.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

S3 and Filepicker.io - Multi-file, zipped download on the fly

I am using Ink (Filepicker.io) to perform multi-file uploads and it is working brilliantly.
However, a quick look around the internet shows multi-file downloads are more complicated. While I know it is possible to spin up an EC2 instance to zip on the fly, this would also entail some wait-time on the user's part, and the newly created file would also not be available on my Cloudfront immediately.
Has anybody done this before, and what are the practical UX implications - is the wait time significant enough to negatively affect the user experience?
The obvious solution would be to create the zipped files ahead of time, but this would result in some (unnecessary?) redundancy.
What is the best way to avoid redundant storage while reducing wait times for on-the-fly folder compression?

You can create the ZIP archive on the client side using JavaScript. Check out:
http://stuk.github.io/jszip/

Python + Paramiko - Checking whether two files are identical without downloading

I have a script that downloads a lot of fairly large (20MB+) files. I would like to be able to check if the copy I have locally is identical to the remote version. I realize I can just use a combination of date modified and length, but is there something even more accurate I can use (that is also available via paramiko) that I can use to ensure this? Ideally some sort of checksum?
I should add that the remote system is Windows and I have SFTP access only, no shell access.

I came with a similar scenario. the solution I currently take is to compare the remote file's size by using item.st_size for item in sftp.listdir_attr(remote_dir) with the local file's size by using os.path.getsize(local_file). when the two files are around 1MB or smaller,this solution is fine. However, a weird thing might happen: when the files are around 10MB or larger, the two size might differ slightly,e.g., one is 10000 Byte, another is 10003 Byte.

Python Azure blob storage upload file bigger then 64 MB

From the sample code, I can upload 64MB, without any problem:
myblob = open(r'task1.txt', 'r').read()
blob_service.put_blob('mycontainer', 'myblob', myblob, x_ms_blob_type='BlockBlob')
What if I want to upload bigger size?
Thank you

I ran into the same problem a few days ago, and was lucky enough to find this. It breaks up the file into chunks and uploads it for you.
I hope this helps. Cheers!

I'm not a Python programmer. But a few extra tips I can offer (my stuff is all in C):
Use HTTP PUT operations(comp=block option) for as many Blocks (4MB each) as required for your file, and then use a final PUT Block List (comp=blocklist option) that coalesces the Blocks. If your Block uploads fail or you need to abort, the cleanup for deleting the partial set of Blocks previously uploaded is a DELETE command for the file you are looking to create, but this appears supported by the 2013-08-15 version only (Someone from the Azure support should confirm this).
If you need to add Meta information, an additional PUT operation (with the comp=metadata prefix) is what I do when using the Block List method. There might be a more efficient way to tag the meta information without requiring an additional PUT, but I'm not aware of it.

This is good question. Unfortunately I don't see a real implementation for uploading arbitrary large files. So, from what I see there is much more work to do on the Python SDK, unless I am missing something really crucial.
The sample code provided in the documentation indeed uses just a single text file and uploads at once. There is no real code that is yet implemented (from what I see in the SDK Source code) to support upload of larger files.
So, for you, to work with Blobs from Python you need to understand how Azure Blob Storage works. Start here.
Then take a quick look at the REST API documentation for PutBlob operation. It is mentioned in the remarks:
The maximum upload size for a block blob is 64 MB. If your blob is
larger than 64 MB, you must upload it as a set of blocks. For more
information, see the Put Block (REST API) and Put Block List (REST
API) operations. It's not necessary to call Put Blob if you upload the
blob as a set of blocks.
The good news is that PutBlock and PutBlockList is implemented in the Python SDK, but with no sample provided for how to use it. What you have to do is to manually split your file into chunks (blocks) of up to 4 MB each. and then use put_block(self, container_name, blob_name, block, blockid, content_md5=None, x_ms_lease_id=None): function from the python SDK to upload the blocks. Ultimately you will upload the blocks in parallel. Do not forget however that you have to execute also put_block_list(self, container_name, blob_name, block_list, content_md5=None, x_ms_blob_cache_control=None... at the end to commit all blocks uploaded.
Unfortunately I'm not Python expert to help you further, but at least I give you a good picture of the situation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading large python packages into AWS lambda function - python

Related

How to get s3 metadata for all keys in a bucket via boto3

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

S3 and Filepicker.io - Multi-file, zipped download on the fly

Python + Paramiko - Checking whether two files are identical without downloading

Python Azure blob storage upload file bigger then 64 MB

Categories

Resources