Use pickled files stored on S3 in AWS Lambda

Use pickled files stored on S3 in AWS Lambda - python

I'm currently working on an Amazon Web Services Lambda function, where I need to utilize sklearn files that I've created locally but are now stored in S3. I've been stumped on how to actually do this, I've tried a few different methods from boto3's docs but this results in a timeout error:
import boto3
import pickle
s3 = boto3.client("s3", region_name="us-west-1")
response = s3.get_object(Bucket="mybucket", Key="bin_files/scaler.pkl")
loaded_file = pickle.load(response["Body"])
While doing this on the other hand results in a file not found error:
with open(key, 'wb') as data:
s3.Bucket(event['bucket']).download_fileobj(key, data)
"errorMessage": "[Errno 2] No such file or directory:
'bin_files/scaler.pkl'"
Does anyone have any thoughts or tips on how to do this?

Correct code:
with open('scaler.pkl', 'wb') as data:
s3.Bucket('bucket').download_fileobj(key, data)
If you want to download the s3 file to current directory, then you should just pass the Filename to the file open() method and NOT the entire key. In your case you are passing bin_files/scaler.pkl as the Filename.
In your case, the python code will look for bin_files directory in the current working directory(directory where the boto script is getting executed) and try to write scaler.pkl inside the bin_files directory. Since the directory doesn't exist, it throws the [Errno 2] No such file or directory error.
If you add a PATH to the Filename, then you should ensure the entire PATH exists with appropriate directory permissions.

Related

Use pdfplumber and Paramiko to read a PDF file from an SFTP server

I have a direct connection to an SFTP server – the connection works without any problem and I can display files from the selected directory without any major problem. There are different files on the server, I have several functions to read them and below here is a piece of code concerning .pdf files – I use pdfplumber to read PDF files:
# SSH.connect configuration
sftp = ssh.open_sftp()
path = "/server_path/.."
for filename in sftp.listdir(path):
fullpath = path + "/" + filename
if filename.endswith('.pdf'):
#fullpath - full server path with filename - like /server_path/../file.pdf
#filename - filename without path - like file.pdf
with sftp.open(fullpath, 'rb') as fl:
pdf = pdfplumber.open(fl)
in this for loop I want to read all the .pdf files in the chosen directory - and it works for me on the localhost without any problem.
I tried to solve it this way with sftp.open(path, 'rb') as fl: - but in this case this solution doesn't work and such an error code appears:
Traceback (most recent call last):
pdf = pdfplumber.open(fl)
return cls(open(path, "rb"), **kwargs)
TypeError: expected str, bytes or os.PathLike object, not SFTPFile
pdfplumber.open takes as an argument the exact path to the file with its name – in this case fullpath. How can I solve this problem so that it works directly from the server? How to manage the memory in this case – because I understand that these files are somehow pulled into memory. Please give me some hints.

Paramiko SFTPClient.open returns a file-like object.
To use a file-like object with pftplumber, it seems that you can use load function:
pdf = pdfplumber.load(fl)
You will also want to read this:
Reading file opened with Python Paramiko SFTPClient.open method is slow
As the Paramiko file-like object seems to work suboptimal when combined with pftplumber.load function, as a workaround, you can download the file to memory instead:
flo = BytesIO()
sftp.getfo(fullpath, flo)
flo.seek(0)
pdfplumber.load(flo)
See How to use Paramiko getfo to download file from SFTP server to memory to process it

zipfile write dont find files in gcloud

Im trying zip a few files from Google Storage.
The zipfile of Python doesnt find the files in gcloud, just in the project.
How can I do for my code find the files in gcloud?
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for revenue in revenues:
# queryset with files a lot, so, for a each file, add in zip
t = tempfile.NamedTemporaryFile()
t.write(revenue.revenue.name)
if revenue.revenue.name:
t.seek(0)
with default_storage.open(revenue.revenue.name, "r") as file_data:
zip_file.write(file_data.name, compress_type=zipfile.ZIP_DEFLATED)
# the code dont pass from this part
t.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename=my_zip.zip'
response.write(zip_buffer.getvalue())
return response
In this part, I write the file that I opened from gcloud, but stop inside the function:
def write(self, filename, arcname=None, compress_type=None):
"""Put the bytes from filename into the archive under the name
arcname."""
if not self.fp:
raise RuntimeError(
"Attempt to write to ZIP archive that was already closed")
st = os.stat(filename)
# when I try find the file, the command os.stat search in project, not in gcloud
the "os.stat(filename)" search for a file in project, how can I do for find in the gcloud?

I will post my findings as an answer, since I would like to comment about few things.
I have understood:
You have a Python library zipfile that is used to work with ZIP files.
You are looking for files locally and add one by one into the ZIP file.
You would like to do this as well for files located in Google Cloud Storage bucket. But it is failing to find the files.
If I have misunderstood the use-case scenario, please elaborate further in a comment.
However, if this is exactly what you are trying to do, then this is not supported. In the StackOverflow Question - Compress files saved in Google cloud storage, it is stated that compressing files that are already in the Google Cloud Storage is not possible. The solution in that question is to subscribe to newly created files and then download them locally, compress them and overwrite them in GCS. As you can see, you can list the files, or iterate through the files stored in GCS, but you first need to download them to be able to process them.
Work around
Therefore, in your use-case scenario, I would recommend the following workaround, by using the Python client API:
You can use Listing objects Python API, to get all the objects from GCS.
Then you can use Downloading objects Python API, to download the objects locally.
As soon as the objects are located in local directory, you can use the zipfile Python library to ZIP them together, as you are already doing it.
Then the objects are ZIPed and if you no longer need the downloaded objects, you can delete them with os.remove("downloaded_file.txt").
In case you need to have the compressed ZIP file in the Google Cloud Storage bucket, then you can use the Uploading objects Python API to upload the ZIP file in the GCS bucket.
As I have mentioned above, processing files (e.g. Adding them to a ZIP files etc.) directly in Google Cloud Storage bucket, is not supported. You first need to download them locally in order to do so. I hope that my workaround is going to be helpful to you.
UPDATE
As I have mentioned above, zipping files while they are in GCS bucket is not supported. Therefore I have prepared for you a working example in Python on how to use the workaround.
NOTE: As I am not professional on operating os commands with Python
library and I am not familiar with zipfile library, there is
probably a better and more efficient way of achieving this. However,
the code that can be found in this GitHub link, does the following
procedures:
Under #Public variables: section change BUCKET_NAME to your corresponding bucket name and execute the python script in Google Cloud Shell. Cloud Shell
Now my bucket structure is as follows:
gs://my-bucket/test.txt
gs://my-bucket/test1.txt
gs://my-bucket/test2.txt
gs://my-bucket/directory/test4.txt
When executing the command, what the app does is the following:
Will get the path of where the script is executed. e.g. /home/username/myapp.
It will create a temporary directory within this directory e.g. /home/username/myapp/temp
It will iterate through all the files located in the bucket that you have specified and will download them locally inside that temp directory.
NOTE: If the file in the bucket is under directory it will simple download the file, instead of creating that sub-directory again. You can modify the code to make it work as you desired later.
So the new downloaded files will look like this:
/home/username/myapp/temp/test.txt
/home/username/myapp/temp/test1.txt
/home/username/myapp/temp/test2.txt
/home/username/myapp/temp/test4.txt
After that, the code will zip all those files to a new zipedFile.zip that will be located in the same directory with the main.py script that you have executed.
When this step is done as well, the script will delete the directory /home/username/myapp/temp/ with all of its contents.
As I have mentioned above, after executing the script locally, you should be able to see the main.py and an zipedFile.zip file with all the zipped files from the GCS bucket. Now you can take the idea of implementation and modify it according to your project's needs.

the final code:
zip_buffer = io.BytesIO()
base_path = '/home/everton/compressedfiles/'
fiscal_compentecy_month = datetime.date(int(year), int(month), 1)
revenues = CompanyRevenue.objects.filter(company__pk=company_id, fiscal_compentecy_month=fiscal_compentecy_month)
if revenues.count() > 0:
path = base_path + str(revenues.first().company.user.pk) + "/"
zip_name = "{}-{}-{}-{}".format(revenues.first().company.external_id, revenues.first().company.external_name, month, year)
for revenue in revenues:
filename = revenue.revenue.name.split('revenues/')[1]
if not os.path.exists(path):
os.makedirs(path)
with open(path + filename, 'wb+') as file:
file.write(revenue.revenue.read())
file.close()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for file in os.listdir(path):
zip_file.write(path + file, compress_type=zipfile.ZIP_DEFLATED)
zip_file.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename={}.zip'.format(zip_name)
response.write(zip_buffer.getvalue())
shutil.rmtree(path)
return response

How to upload in-memory zip file to S3-bucket

I am trying to upload an in-memory zip file to an S3-bucket (in order to avoid temporary files on my server).
Here's my code:
import zipfile
import io
zip_bytes_io = io.BytesIO()
zip_archive = zipfile.ZipFile(zip_bytes_io, mode='a',
compression=zipfile.ZIP_DEFLATED)
zip_archive.writestr('test.txt', b'My string')
s3_client = boto3.client('s3')
zip_bytes_io.seek(0) # So that bytes are read from beginning
s3_client.upload_fileobj(zip_bytes_io, test_bucket, 'test.zip')
When I run the above code, it does upload the file to an S3 bucket. However when I download the file, I am unable to unzip it. I get the error :"Error 1 - operation not permitted."
I assume this is because I'm not uploading the file correctly in the first place. Any help appreciated :)

When looking for Error 1 - operation not permitted. error online, looks like it might be a Mac-os opening a zip file issue?
Try to open the file differently?

PySpark: how to resolve path of a resource file present inside the dependency zip file

I have a mapPartitions on an RDD and within each partition, a resource file has to be opened. This module that contains the method invoked by mapPartitions and the resource file is passed on to each executor using the --py-files argument as a zip file.
To make it clear:
rdd = rdd.mapPartitions(work_doing_method)
def work_doing_method(rows):
for row in rows:
resource_file_path = os.path.join(os.path.dirname(__file__), "resource.json")
with open(resource_file_path) as f:
resource = json.loads(f.read())
...
When I do this after passing the zip file which includes all of this using the --py-file parameter to the spark-submit command,
I get IOError: [Errno 20] Not a directory:/full/path/to/the/file/within/zip/file
I do not understand how Spark uses the zip file to read the dependencies. The os.path.dirname utility returns the full path including the zip file, for eg. /spark/dir/my_dependency_file.zip/path/to/the/resource/file. I believe this should be the problem. I tried many combinations to resolve the path of the file. Any help is appreciated.
Thanks!

I think when you add a file to a Spark job, it will be copied to the working directory of each executor. I've used the SparkFiles API to get absolute paths to files on the executors.
You can also use the --archives flag to pass in arbitrary data archives such as zipfiles. What's the difference between --archives, --files, py-files in pyspark job arguments

We get the path to a resource file within an egg/zip file (inside the executor working dir) when we look for the absolute path. I ended up using the zipfile module in Python and actually open it like here.

Kodi saving via special:// protocol gives ERRNO 2 (No such file or directory)

I have a problem with writing a kodi plugin.
I am listing an entry to view a stream that provides a preview image. But since kodi caches the images I thought of a way of requesting the image manually every time. To achieve that I want to save the image to the resources/cache directory of my plugin.
But I get the following error:
Error Contents: [Errno 2] No such file or directory: 'special://home/addon_data/[plugin]/resources/caches/preview_de.png'
My code is
f = urlopen(Request(url))
local_file = open(local, 'w'+mode)
local_file.write(f.read())
local_file.close()
I guess the special:// protocol is the problem, but what can I do to not only work on one machine?

You need to call translatePath() and use the returned string as url before you can use it.
Example:
local = xbmc.translatePath('special://home/addon_data/[plugin]/resources/caches/preview_de.png')
f = urlopen(Request(url))
local_file = open(local, 'w'+mode)
local_file.write(f.read())
local_file.close()
PS. To avoid caching of images, you might be able to archive by adding random GET data into your request.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.