I am trying to upload an in-memory zip file to an S3-bucket (in order to avoid temporary files on my server).
Here's my code:
import zipfile
import io
zip_bytes_io = io.BytesIO()
zip_archive = zipfile.ZipFile(zip_bytes_io, mode='a',
compression=zipfile.ZIP_DEFLATED)
zip_archive.writestr('test.txt', b'My string')
s3_client = boto3.client('s3')
zip_bytes_io.seek(0) # So that bytes are read from beginning
s3_client.upload_fileobj(zip_bytes_io, test_bucket, 'test.zip')
When I run the above code, it does upload the file to an S3 bucket. However when I download the file, I am unable to unzip it. I get the error :"Error 1 - operation not permitted."
I assume this is because I'm not uploading the file correctly in the first place. Any help appreciated :)
When looking for Error 1 - operation not permitted. error online, looks like it might be a Mac-os opening a zip file issue?
Try to open the file differently?
Related
I am getting a "[Errno 2] No such file or directory" error when trying to download files from an FTP server to a Pandas dataframe. The files are in the root directory of the FTP.
I am guessing that the pd.read_csv() function is looking at my local file system... i.e. at the local path were the script resides.... But I do not understand how to change this.
def fetch_files(site, username, password, directory: str = '/', filematch: str = '*.csv'):
with ftplib.FTP(site) as ftp:
# pass the url without protocol
ftp = ftplib.FTP(site)
# pass credentials if anonymous access is not allowed
ftp.login(username, password)
ftp.cwd(directory)
list_ = []
for file_ in ftp.nlst(filematch):
print(file_) # This works
df = pd.read_csv(file_, index_col=None, header=0) # This fails
list_.append(df)
Or would I have to use the ftp.retrlines() method? If so what is the difference between the LIST and MLSD parameter?
On a side note: The files in the CSVs have HTML code in them like & which screws out the SQL bulk insert. That's the reason I am reading them to a dataframe is to change the encoding and merge the individual files. Is there a faster way to do this directly via the Python csv module? I guess this would be faster?
Thank you in advance
Use FTP.retrbinary and BytesIO to download the file to memory and then pass the in-memory file-like object to read_csv:
flo = BytesIO()
ftp.retrbinary('RETR ' + file_, flo.write)
flo.seek(0)
pd.read_csv(flo, ...)
Similar question: Reading files from FTP server to DataFrame in Python
The above loads whole CSV file to a memory and only then it will parse it. If you want to prse the file as it downloads, it that would probably require implementing a smart custom file-like object. What is not easy.
For a question that does something similar, see my answer to:
Get files names inside a zip file on FTP server without downloading whole archive.
Around 60 CSV files being generated daily in my S3 bucket. The average size of each file is around 500MB. I want to zip all these files through lambda function on the fly(without downloading a file inside Lambda execution) and upload these zipped files to another s3 bucket. I came across these solutions 1 and 2 but I am still getting issue in the implementation. Right now, I am trying to stream CSV file data into a zipped file(this zip file is being created in Lambda tmp directory) and then uploading on s3. But I am getting this error message while writing into zip file:
[Errno 36] File name too long
This is my test Lambda function where I am just trying with one file but in actual case I need to zip 50-60 CSV files individually:
import boto3
import zipfile
def lambda_handler(event, context):
s3 = boto3.resource('s3')
iterator = s3.Object('bucket-name', 'file-name').get()['Body'].iter_lines()
my_zip = zipfile.ZipFile('/tmp/test.zip', 'w')
for line in iterator:
my_zip.write(line)
s3_resource.meta.client.upload_fileobj(file-name, "another-bucket-name", "object-name")
Also, is there a way where I can stream data from my CSV file, zip it and upload it to another s3 bucket without actually saving a full zip file on Lambda memory?
After lot of research and trials, I am able to make it work. I used smart_open library for my issue and managed to zip 550MB file with just 150MB memory usage in my Lambda. To use external library, I had to use Layers in Lambda. Here is my code:
from smart_open import open, register_compressor
import lzma, os
def lambda_handler(event, context):
with open('s3://bucket-name-where-large-file/file-key-name') as fin:
with open('s3://bucket-name-to-put-zip-file/zip-file-key-name', 'w') as fout:
for line in fin:
fout.write(line)
Please note, smart_open supports .gz and .bz2 file compression. If you want to zip file in other formats, you can create your own compressor using register_compressor method of this library.
I have a really large 7z file in s3 bucket say s3://tempbucket1/Test_For7zip.7z that runs into several tens of GB. I do not want to download it, unzip it and re upload it back to s3. I want to use Boto3 to unzip it on the fly and save it into S3.
I tried to solve this using lzma package based on Previous SO answer which dealt with on the fly unzipping of *.zip files using the fileobj option present in gzip.GzipFile.
from io import BytesIO
import gzip
import lzma
import boto3
# setup constants
bucket = 'tempbucket1'
gzipped_key = 'Test_For7zip.7z'
uncompressed_key = 'Test_Unzip7zip'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj( # upload a new obj to s3
Fileobj=lzma.LZMAFile(
BytesIO(s3.get_object(Bucket=bucket,
Key=gzipped_key)['Body'].read()),
'rb'), # read binary
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
However, this thows the following error
LZMAError: Input format not supported by decoder
Is there a python package that provides can decode 7z files based on BytesIO, or is there a better way of achieving this?
I never tried this, but Googling gave me this as a possible solution. Please reach out through this post if this solves your problem.
I have a direct connection to an SFTP server – the connection works without any problem and I can display files from the selected directory without any major problem. There are different files on the server, I have several functions to read them and below here is a piece of code concerning .pdf files – I use pdfplumber to read PDF files:
# SSH.connect configuration
sftp = ssh.open_sftp()
path = "/server_path/.."
for filename in sftp.listdir(path):
fullpath = path + "/" + filename
if filename.endswith('.pdf'):
#fullpath - full server path with filename - like /server_path/../file.pdf
#filename - filename without path - like file.pdf
with sftp.open(fullpath, 'rb') as fl:
pdf = pdfplumber.open(fl)
in this for loop I want to read all the .pdf files in the chosen directory - and it works for me on the localhost without any problem.
I tried to solve it this way with sftp.open(path, 'rb') as fl: - but in this case this solution doesn't work and such an error code appears:
Traceback (most recent call last):
pdf = pdfplumber.open(fl)
return cls(open(path, "rb"), **kwargs)
TypeError: expected str, bytes or os.PathLike object, not SFTPFile
pdfplumber.open takes as an argument the exact path to the file with its name – in this case fullpath. How can I solve this problem so that it works directly from the server? How to manage the memory in this case – because I understand that these files are somehow pulled into memory. Please give me some hints.
Paramiko SFTPClient.open returns a file-like object.
To use a file-like object with pftplumber, it seems that you can use load function:
pdf = pdfplumber.load(fl)
You will also want to read this:
Reading file opened with Python Paramiko SFTPClient.open method is slow
As the Paramiko file-like object seems to work suboptimal when combined with pftplumber.load function, as a workaround, you can download the file to memory instead:
flo = BytesIO()
sftp.getfo(fullpath, flo)
flo.seek(0)
pdfplumber.load(flo)
See How to use Paramiko getfo to download file from SFTP server to memory to process it
I'm currently working on an Amazon Web Services Lambda function, where I need to utilize sklearn files that I've created locally but are now stored in S3. I've been stumped on how to actually do this, I've tried a few different methods from boto3's docs but this results in a timeout error:
import boto3
import pickle
s3 = boto3.client("s3", region_name="us-west-1")
response = s3.get_object(Bucket="mybucket", Key="bin_files/scaler.pkl")
loaded_file = pickle.load(response["Body"])
While doing this on the other hand results in a file not found error:
with open(key, 'wb') as data:
s3.Bucket(event['bucket']).download_fileobj(key, data)
"errorMessage": "[Errno 2] No such file or directory:
'bin_files/scaler.pkl'"
Does anyone have any thoughts or tips on how to do this?
Correct code:
with open('scaler.pkl', 'wb') as data:
s3.Bucket('bucket').download_fileobj(key, data)
If you want to download the s3 file to current directory, then you should just pass the Filename to the file open() method and NOT the entire key. In your case you are passing bin_files/scaler.pkl as the Filename.
In your case, the python code will look for bin_files directory in the current working directory(directory where the boto script is getting executed) and try to write scaler.pkl inside the bin_files directory. Since the directory doesn't exist, it throws the [Errno 2] No such file or directory error.
If you add a PATH to the Filename, then you should ensure the entire PATH exists with appropriate directory permissions.