Uploading a h5py file to aws and downloading it again

Uploading a h5py file to aws and downloading it again - python

I have not been able to find a clear solution so hopefully this question can be of use to others in the future.
Basically I am creating a h5py object as follows:
keys_d = [matrix_l2T.data, matrix_l2T.indices, matrix_l2T.indptr]
keys_n = ['data', 'indices', 'indptr' ]
file_to_save = h5py.File(FILENAME)
for i,j in zip(keys_n, keys_d):
file_to_save.create_dataset(i, data = j)
Uploading to AWS as follows:
s3 = boto3.client('s3')
s3.upload_file(Filename = FILENAME, Key = KEY, Bucket = BUCKET)
Downloading as follows:
s3 = boto3.resource('s3')
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, FILENAME) except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
Then I have tried to open this file in a variety of ways but it does not seem to be performed without an error.
I have tried:
h5py.File(FILENAME,'r')
And i got:
OSError: Unable to open file (bad object header version number)
I tried also:
with open(FILENAME, 'rb') as f:
a = f.read()
I need the rb because without it I get 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte. The last piece of code returns binary code not really sure what I should do next.

I resolved the issue. I wasn't closing the h5py file. One needs to add
file_to_save.close()
before uploading to aws

Related

Create a temp tar.gz file in azure function and write ftp server file content on that temp file in python

I have a task requirement of reading a tar.gz file from a ftp server and store it on a blob storage.
How I think I can accomplish is that I must first create a temp file in azure function temp directory, write all content on it, close it and then upload it on the blob storage.
What I have done so far is:
fp = tempfile.NamedTemporaryFile()
filesDirListInTemp = listdir(tempFilePath)
logging.info(filesDirListInTemp)
try:
with open('/tmp/fp', 'w+') as fp:
data = BytesIO()
save_file = ftp.retrbinary('RETR '+ filesDirListInTemp, data.write, 1024)
data.seek(0)
blobservice=BlobClient.from_connection_string(conn_str=connection_string,container_name=container_name,blob_name=filename,max_block_size=4*1024*1024,max_single_put_size=16*1024*1024)
blobservice.upload_blob(gzip.decompress(data.read()))
print("File Uploaded!")
except Exception as X:logging.info(X)
But I am getting error as: expected str, bytes or os.PathLike object, not list.
Please tell me what I am doing wrong here?

Create PDF from HTML using AWS Lambda and consuming imgs from a S3 Bucket with Python

I have this issue in which I hope that someone can help me with.
So I have a process that saves some images into a S3 bucket.
Then, I have a lambda process, that using python, it's supposed to create a PDF file, displaying these images.
I'm using the library xhtml2pdf for that, which I've uploaded to my lambda environment as a layer.
My 1st approach was to download the image from the S3 bucket, and save it into the lambda '/tmp', but I was getting this error from xhtml2pdf:
Traceback (most recent call last):
File "/opt/python/xhtml2pdf/xhtml2pdf_reportlab.py", line 359, in __init__
raise RuntimeError('Imaging Library not available, unable to import bitmaps only jpegs')
RuntimeError: Imaging Library not available, unable to import bitmaps only jpegs fileName=
<_io.BytesIO object at 0x7f1eaabe49a0>
Then I thought that if I had it being transformed into a base64 file, that this issue would be solved, but then I got the same error.
Can anybody here, please, give me some guidance about the best way to do this ?
Thank you
This is a small piece of my lambda code:
from xhtml2pdf import pisa
def getFileFromS3(fileKey, fileName):
try:
localFileName = f'/tmp/{fileName}'
bot_utils.log(f'fileKey : {fileKey}')
bot_utils.log(f'fileName : {fileName}')
bot_utils.log(f'localFileName : {localFileName}')
s3 = boto3.client('s3')
bucketName = 'fileholder'
s3.download_file(bucketName, fileKey, localFileName)
return 'data:image/jpeg;base64,' + getImgBase64( localFileName )
except botocore.exceptions.ClientError as e:
raise
htmlText = '<table>'
for i in range(0, len(shoppingLines), 2):
product = shoppingLines[i]
text = product['text']
folderName = product['folder']
tmpFile = getFileFromS3(f"pannings/{folderName}/{product['photo_id']}.jpg", f"{product['photo_id']}.jpg")
htmlText += f"""<tr><td align="center"><img src="{tmpFile}" width="40" height="55"></td><td>{text}</td></tr>"""
htmlText += '</table>'
result_file = open('/tmp/file.pdf', "w+b")
pisa_status = pisa.CreatePDF(htmlText ,dest=result_file)
result_file.close()

For future google searches.
Seems like the issue is with the PIL/Pillow library.
I've found a version of these library on this GIT repo (https://github.com/keithrozario/Klayers)
When I use this version, it works...

Upload object to Oracle Storage using put_object in Python

I'm trying to upload an object to Oracle Storage with oci-cli library in Python. When I try using command-line:
oci os object put -ns grddddaaaZZ -bn dev.bucket --name processed/2020-11 --file /path/to/my/file/image.tif
I actually get a response like:
Upload ID: 4f...78f0fdc5
Split file into 2 parts for upload.
Uploading object [------------------------------------] 0%
...
but when I try using the framework:
try:
namespace = 'grddddaaaZZ'
bucket = 'dev.bucket'
object_path = 'processed/2020-11/image.tif'
with open('/path/to/my/file/image.tif', "rb") as image:
publish_payload = image.read()
response = object_storage.put_object(namespace, bucket, object_path, publish_payload)
except (InvalidConfig, BaseConnectTimeout, ConfigFileNotFound, ServiceError) as error:
logging.error(">>>>>>>> Something went wrong when try to list bucket {} objects. Error {}".
format(bucket, error))
the upload does not complete:
...
response = object_storage.put_object(namespace, bucket, object_path, publish_payload)
File ".../.venv/lib/python3.8/site-packages/oci/object_storage/object_storage_client.py", line 4113, in put_object
return self.base_client.call_api(
File ".../.venv/lib/python3.8/site-packages/oci/base_client.py", line 272, in call_api
response = self.request(request)
File ".../.venv/lib/python3.8/site-packages/oci/base_client.py", line 378, in request
raise exceptions.RequestException(e)
oci.exceptions.RequestException: ('Connection aborted.', timeout('The write operation timed out'))
I thought that it could be the size of file (which is around 208Mb), but in put_object documentation says 5Gb limit. So, I do not think it could be the issue. My last chance would be to use os.system(), but it would not be what I truly want.
Some clue in what could be missing in this second option?

You could try uploading some other data first, to see if it's the payload:
namespace = 'grddddaaaZZ'
bucket_name = 'dev.bucket'
object_name = 'processed/2020-11/test.txt'
test_data = b"Hello, World!"
obj = object_storage.put_object(
namespace,
bucket_name,
object_name,
my_data)
or you try it without reading the file contents and just passing the file object:
namespace = 'grddddaaaZZ'
bucket = 'dev.bucket'
object_path = 'processed/2020-11/image.tif'
with open('/path/to/my/file/image.tif', 'rb') as f:
obj = object_storage.put_object(namespace, bucket, object_path, f)

with open('tomcat_access_log_20221118-231901.log.zip', 'rb') as filePtr:
... upload_resp = object_storage_client.put_object(nameSpace,bucket_name='my-Test-Bucket',object_name=file_to_upload,put_object_body=filePtr)
Note : file_to_upload = 'empty_folder_for_testing/tomcat-admin-server/tomcat_access_log_20221118-231901.log.zip'
The above code getting stuck for very log till end getting timeout. But actually i can see file uploaded properly. But this command getting stuck for long enough till timeout ... Any idea ?

Reading a pandas pickle file in Tensorflow in CloudML

I'm getting an error trying to read a pandas pickle e.g. df.to_pickle() method, which is stored in Google Cloud storage. I'm trying to do the following:
path_to_gcs_file = 'gs://xxxxx'
f = file_io.FileIO(path_to_gcs_file, mode='r').read()
train_df = pd.read_pickle(f)
f.close()
I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Alternatively I tried:
f = BytesIO(file_io.read_file_to_string(path_to_gcs_file, binary_mode=True))
train_df = pd.read_pickle(f)
Which works locally but not on CloudML!
f = file_io.read_file_to_string(path_to_gcs_file, binary_mode=True)
train_df = pd.read_pickle(f)
Gives me an error:
AttributeError: 'bytes' object has no attribute 'seek'

You should be able to get away with using a context manager, but I think you're pulling the end of the certificate using this way, so you should instead download the file through the api
pip install --upgrade google-cloud-storage
Then
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(path_to_gcs_file)
with open(path_to_gcs_file, "rb" as f:
train_df = = pickle.load(f)
Much was taken from this answer:
Downloading a file from google cloud storage inside a folder

pandas.read_pickle accepts a path as the first argument; you are passing a File object (file.FileIO) and a bytes object (read_to_string).
So far I have not found a way to read a pickle object directly from GCS using pandas, so you will have to copy it to the machine. You could use file_io.copy for that:
file_io.copy('gs://xxxx', '/tmp/x.pkl')
train_df = pd.read_pickle('/tmp/x.pkl')

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.

Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.

You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Uploading a h5py file to aws and downloading it again - python

I resolved the issue. I wasn't closing the h5py file. One needs to add file_to_save.close() before uploading to aws

Related

Create a temp tar.gz file in azure function and write ftp server file content on that temp file in python

Create PDF from HTML using AWS Lambda and consuming imgs from a S3 Bucket with Python

Upload object to Oracle Storage using put_object in Python

Reading a pandas pickle file in Tensorflow in CloudML

Writing a file to S3 using Lambda in Python with AWS

Categories

Resources