copy multiple files in storage bucket using list - python

The gsutil cp command hast the -I option to copy multiple files using a list of file names from stdin. Is there a way to do the same in python, preferable using a library like the official google storage client or gcfs/fsspec? Or is it only possible to iterate over all the file names and copy each one?

Using the version currently in gcsfs master (to be released soon), you can copy files to GCS from memory or local files with a list:
gcs.pipe({path1: content1, path2, content2}) # for in-memory bytes
gcs.put([local_path1, local_path2], [remote_path1, remote_path2]) # files
For the latter, you can give just one remote path, which will be assumed to be a directory, and the remote files will get the same basenames as local.
The calls will be processed concurrently, which may be much faster than sequential uploads, especially for small transfers.

Related

How Python AWS Lambda interacts specifically with the uploaded file?

I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?
Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.
In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.

move files from aws s3 loaction to another aws s3 location using NiFi execute script processor

I'm getting files into a S3 location on a weekly basis and I need to move these files after processing to another S3 location to archive the files. I have cloudera NiFi hosted on AWS. I can't use putS3Object+DeleteS3Object processors at the end of the flow because in this NiFi process because I'm decompressing the file and adding additional column (compressing file and dropping column hits performance). I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
No, you don't. You can use the record processors or a script to update the files and push them to S3. We pull, mutate and reupload data like this all the time without having to control the upload with a script.

using aws lambda /tmp - python

for <some-condition>:
g.to_csv(f'/tmp/{k}.csv')
This example makes use of /tmp/. When /tmp/ not used in g.to_csv(f'/tmp/{k}.csv') then it gives Read only file system error from here https://stackoverflow.com/a/42002539/13016237, so question is if AWS lambda clears /tmp/ on its own or is it to be done manually. Is there any workaround for this within the scope of boto3. Thanks!
/tmp, as the name suggest, is only a temporary storage. It should not be relied upon for any long term data storage. The files in /tmp persist for as long as lambda execution context is kept alive. The time is not defined and varies.
To overcome the size limitation (512 MB) and to ensure long term data storage there are two solutions employed:
Using Amazon EFS with Lambda
Using AWS Lambda with Amazon S3
The use of the EFS is easier (but not cheaper), as this will present a regular filesystem to your function which you can write and read directly. You can also re-use the same filesystem across multiple lambda functions, instances, containers and more.
The S3 will be cheaper but there is some extra work required from you to seamlessly use in lambda. Pandas does support S3, but for seamless integration you would have to include S3FS in your deployment package (or layer) if not already present. The S3 can also be accessed from different functions, instances and containers.
g.to_csv('s3://my_bucket/my_data.csv') should work if you will package s3fs with your lambda.
Another option is to save the csv into memory and use boto3 to create an object in s3

Can we increase lambda /tmp folder size ?, i Is there any method to append a file to S3

I have a lambda function that does some functions and generates a file. I have some append operation in this file so I write the file /tmp folder . after the process I upload to s3 . some times file size will bigger than 512mb so function fails. So is there any method I can write the file directly to s3. S3 does not support appending .I used python language in lambda
#
after so many searches, There is a python package smart_open that allow you to write directly to the s3
smart_open
As described here, the storage of the /tmp folder cannot exceeds 512 MB. It will be better to remove your logic from lambdas and at the end of your process, you just need to upload file to S3 by using boto3.
But if you are using image file, you can use Pillow to reduce file size before.

Split S3 file into smaller files of 1000 lines

I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3.
So far, I've been running this on my local drive using the linux command:
split -l 1000 file
which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3.
What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?
Anything that you do will have to download the file, split it, and re-upload it. The only question is where, and whether local disk is involved.
John Rotenstein gave you an example using local disk on an EC2 instance. This has the benefit of running in the AWS datacenters, so it gets a high-speed connection, but has the limitations that (1) you need disk space to store the original file and its pieces, and (2) you need an EC2 instance where you can do this.
One small optimization is to avoid the local copy of the big file, by using a hyphen as the destination of the s3 cp: this will send the output to standard out, and you can then pipe it into split (here I'm also using a hyphen to tell split to read from standard input):
aws s3 cp s3://my-bucket/big-file.txt - | split -l 1000 - output.
aws s3 cp output.* s3://dest-bucket/
Again, this requires an EC2 instance to run it on, and the storage space for the output files. There is, however, a flag to split that will let you run a shell command for each file in the split:
aws s3 cp s3://src-bucket/src-file - | split -b 1000 --filter 'aws s3 cp - s3://dst-bucket/result.$FILE' -
So now you've eliminated the issue of local storage, but are left with the issue of where to run it. My recommendation would be AWS Batch, which can spin up an EC2 instance for just the time needed to perform the command.
You can, of course, write a Python script to do this on Lambda, and that would have the benefit of being triggered automatically when the source file has been uploaded to S3. I'm not that familiar with the Python SDK (boto), but it appears that get_object will return the original file's body as a stream of bytes, which you can then iterate over as lines, accumulating however many lines you want into each output file.
Your method seems sound (download, split, upload).
You should run the commands from an Amazon EC2 instance in the same region as the Amazon S3 bucket.
Use the AWS Command-Line Interface (CLI) to download/upload the files:
aws s3 cp s3://my-bucket/big-file.txt .
aws s3 cp --recursive folder-with-files s3://my-bucket/destination-folder/

Categories