zipfile write dont find files in gcloud - python

Im trying zip a few files from Google Storage.
The zipfile of Python doesnt find the files in gcloud, just in the project.
How can I do for my code find the files in gcloud?
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for revenue in revenues:
# queryset with files a lot, so, for a each file, add in zip
t = tempfile.NamedTemporaryFile()
t.write(revenue.revenue.name)
if revenue.revenue.name:
t.seek(0)
with default_storage.open(revenue.revenue.name, "r") as file_data:
zip_file.write(file_data.name, compress_type=zipfile.ZIP_DEFLATED)
# the code dont pass from this part
t.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename=my_zip.zip'
response.write(zip_buffer.getvalue())
return response
In this part, I write the file that I opened from gcloud, but stop inside the function:
def write(self, filename, arcname=None, compress_type=None):
"""Put the bytes from filename into the archive under the name
arcname."""
if not self.fp:
raise RuntimeError(
"Attempt to write to ZIP archive that was already closed")
st = os.stat(filename)
# when I try find the file, the command os.stat search in project, not in gcloud
the "os.stat(filename)" search for a file in project, how can I do for find in the gcloud?

I will post my findings as an answer, since I would like to comment about few things.
I have understood:
You have a Python library zipfile that is used to work with ZIP files.
You are looking for files locally and add one by one into the ZIP file.
You would like to do this as well for files located in Google Cloud Storage bucket. But it is failing to find the files.
If I have misunderstood the use-case scenario, please elaborate further in a comment.
However, if this is exactly what you are trying to do, then this is not supported. In the StackOverflow Question - Compress files saved in Google cloud storage, it is stated that compressing files that are already in the Google Cloud Storage is not possible. The solution in that question is to subscribe to newly created files and then download them locally, compress them and overwrite them in GCS. As you can see, you can list the files, or iterate through the files stored in GCS, but you first need to download them to be able to process them.
Work around
Therefore, in your use-case scenario, I would recommend the following workaround, by using the Python client API:
You can use Listing objects Python API, to get all the objects from GCS.
Then you can use Downloading objects Python API, to download the objects locally.
As soon as the objects are located in local directory, you can use the zipfile Python library to ZIP them together, as you are already doing it.
Then the objects are ZIPed and if you no longer need the downloaded objects, you can delete them with os.remove("downloaded_file.txt").
In case you need to have the compressed ZIP file in the Google Cloud Storage bucket, then you can use the Uploading objects Python API to upload the ZIP file in the GCS bucket.
As I have mentioned above, processing files (e.g. Adding them to a ZIP files etc.) directly in Google Cloud Storage bucket, is not supported. You first need to download them locally in order to do so. I hope that my workaround is going to be helpful to you.
UPDATE
As I have mentioned above, zipping files while they are in GCS bucket is not supported. Therefore I have prepared for you a working example in Python on how to use the workaround.
NOTE: As I am not professional on operating os commands with Python
library and I am not familiar with zipfile library, there is
probably a better and more efficient way of achieving this. However,
the code that can be found in this GitHub link, does the following
procedures:
Under #Public variables: section change BUCKET_NAME to your corresponding bucket name and execute the python script in Google Cloud Shell. Cloud Shell
Now my bucket structure is as follows:
gs://my-bucket/test.txt
gs://my-bucket/test1.txt
gs://my-bucket/test2.txt
gs://my-bucket/directory/test4.txt
When executing the command, what the app does is the following:
Will get the path of where the script is executed. e.g. /home/username/myapp.
It will create a temporary directory within this directory e.g. /home/username/myapp/temp
It will iterate through all the files located in the bucket that you have specified and will download them locally inside that temp directory.
NOTE: If the file in the bucket is under directory it will simple download the file, instead of creating that sub-directory again. You can modify the code to make it work as you desired later.
So the new downloaded files will look like this:
/home/username/myapp/temp/test.txt
/home/username/myapp/temp/test1.txt
/home/username/myapp/temp/test2.txt
/home/username/myapp/temp/test4.txt
After that, the code will zip all those files to a new zipedFile.zip that will be located in the same directory with the main.py script that you have executed.
When this step is done as well, the script will delete the directory /home/username/myapp/temp/ with all of its contents.
As I have mentioned above, after executing the script locally, you should be able to see the main.py and an zipedFile.zip file with all the zipped files from the GCS bucket. Now you can take the idea of implementation and modify it according to your project's needs.

the final code:
zip_buffer = io.BytesIO()
base_path = '/home/everton/compressedfiles/'
fiscal_compentecy_month = datetime.date(int(year), int(month), 1)
revenues = CompanyRevenue.objects.filter(company__pk=company_id, fiscal_compentecy_month=fiscal_compentecy_month)
if revenues.count() > 0:
path = base_path + str(revenues.first().company.user.pk) + "/"
zip_name = "{}-{}-{}-{}".format(revenues.first().company.external_id, revenues.first().company.external_name, month, year)
for revenue in revenues:
filename = revenue.revenue.name.split('revenues/')[1]
if not os.path.exists(path):
os.makedirs(path)
with open(path + filename, 'wb+') as file:
file.write(revenue.revenue.read())
file.close()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for file in os.listdir(path):
zip_file.write(path + file, compress_type=zipfile.ZIP_DEFLATED)
zip_file.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename={}.zip'.format(zip_name)
response.write(zip_buffer.getvalue())
shutil.rmtree(path)
return response

Related

How to Create list of filenames in an S3 directory using pyspark and/or databricks utils

I have a need to move files from one S3 bucket directory to two others. I have to do this from a Databricks notebook. If the the file has a json extension, I will move into jsonDir. Otherwise, I will move into otherDir. Presumably I would do this with pyspark, and databrick utils (dbutils).
I do not know the name of the S3 bucket, only the relative path off of it (call it MYPATH). For instance, I can do:
dbutils.fs.ls(MYPATH)
and it lists all the files in the S3 directory. Unfortunately with dbutils, you can move one file at a time or all of them (no wildcards). The bulk of my program is:
for file in fileList:
if file.endswith("json"):
dbutils.fs.mv(file, jsonDir)
continue
if not file.endswith("json")
dbutils.fs.mv(file, otherDir)
continue
My Problem: I do not know how to retrieve the list of files from MYPATH to put them in array "fileList". I would be grateful for any ideas. Thanks.
I think your code runs if you do these minor changes:
fileList = dbutils.fs.ls(MYPATH)
for file in fileList:
if file.name.endswith("/"): # Don't copy dirs
continue
if file.name.endswith("json"):
dbutils.fs.mv(file.path, jsonDir + file.name)
continue
if not file.name.endswith("json"):
dbutils.fs.mv(file.path, otherDir + file.name)
continue
Here, file.name is appended to keep the name of the file in the new dir. I need this one Azure dbfs backed storage, otherwise everything gets moved to the same blob.
It is critical that jsonDir and otherDir ends with a / character.

How to create directories in Azure storage container without creating extra files?

I've created python code to create a range of folders and subfolders (for data lake) in an Azure storage container. The code works and is based on the documentation on Microsoft Azure. One thing though is that I'm creating a dummy 'txt' file in the folders in order to create the directory (which I can clean up later). I was wondering if there's a way to create the folders and subfolders without creating a file. I understand that the folders in Azure container storage are not hierarchical and are instead metadata and what I'm asking for may not be possible?
connection_string = config['azure_storage_connectionstring']
gen2_container_name = config['gen2_container_name']
container_client = ContainerClient.from_connection_string(connection_string, gen2_container_name)
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# blob_service_client.create_container(gen2_container_name)
def create_folder(folder, sub_folder):
blob_client = container_client.get_blob_client('{}/{}/start_here.txt'.format(folder, sub_folder))
with open ('test.txt', 'rb') as data:
blob_client.upload_blob(data)
def create_all_folders():
config = load_config()
folder_list = config['folder_list']
sub_folder_list = config['sub_folder_list']
for folder in folder_list:
for sub_folder in sub_folder_list:
try:
create_folder(folder, sub_folder)
except Exception as e:
print ('Looks like something went wrong here trying to create this folder structure {}/{}. Maybe the structure already exists?'.format(folder, sub_folder))
I've created python code to create a range of folders and subfolders
(for data lake) in an Azure storage container. The code works and is
based on the documentation on Microsoft Azure. One thing though is
that I'm creating a dummy 'txt' file in the folders in order to create
the directory (which I can clean up later). I was wondering if there's
a way to create the folders and subfolders without creating a file. I
understand that the folders in Azure container storage are not
hierarchical and are instead metadata and what I'm asking for may not
be possible?
No, for blob storage, this is not possible. There is no way to create so-called "folders"
But you can use data-lake SDK like this to create directory:
from azure.storage.filedatalake import DataLakeServiceClient
connect_str = "DefaultEndpointsProtocol=https;AccountName=0730bowmanwindow;AccountKey=xxxxxx;EndpointSuffix=core.windows.net"
datalake_service_client = DataLakeServiceClient.from_connection_string(connect_str)
myfilesystem = "test"
myfolder = "test1111111111"
myfile = "FileName.txt"
file_system_client = datalake_service_client.get_file_system_client(myfilesystem)
directory_client = file_system_client.create_directory(myfolder)
Just to add some context, the reason this is not possible in Blob Storage is that folders/directories are not "real". Folders do not exist as standalone objects, they are only defined as part of a blob name.
For example, if you have a folder "mystuff" with a file (blob) "somefile.txt", the blob name actually includes the folder name and "/" character like mystuff/somefile.txt. The blob exists directly inside the container, not inside a folder. This naming convention can be nested many times over in a blob name like folder1/folder2/mystuff/anotherfolder/somefile.txt, but that blob still only exists directly in the container.
Folders can appear to exist in certain tooling (like Azure Storage Explorer) because the SDK permits blob name filtering: if you do so on the "/" character, you can mimic the appearance of a folder and its contents. But in order for a folder to even appear to exist, there must be blob in the container with the appropriate name. If you want to "force" a folder to exist, you can create a 0-byte blob with the correct folder path in the name, but the blob artifact will still need to exist.
The exception is Azure Data Lake Storage (ADLS) Gen 2, which is Blob Storage that implements a Hierarchical Namespace. This makes it more like a file system and so respects the concept of Directories as standalone objects. ADLS is built on Blob Storage, so there is a lot of parity between the two. If you absolutely must have empty directories, then ADLS is the way to go.

Use pickled files stored on S3 in AWS Lambda

I'm currently working on an Amazon Web Services Lambda function, where I need to utilize sklearn files that I've created locally but are now stored in S3. I've been stumped on how to actually do this, I've tried a few different methods from boto3's docs but this results in a timeout error:
import boto3
import pickle
s3 = boto3.client("s3", region_name="us-west-1")
response = s3.get_object(Bucket="mybucket", Key="bin_files/scaler.pkl")
loaded_file = pickle.load(response["Body"])
While doing this on the other hand results in a file not found error:
with open(key, 'wb') as data:
s3.Bucket(event['bucket']).download_fileobj(key, data)
"errorMessage": "[Errno 2] No such file or directory:
'bin_files/scaler.pkl'"
Does anyone have any thoughts or tips on how to do this?
Correct code:
with open('scaler.pkl', 'wb') as data:
s3.Bucket('bucket').download_fileobj(key, data)
If you want to download the s3 file to current directory, then you should just pass the Filename to the file open() method and NOT the entire key. In your case you are passing bin_files/scaler.pkl as the Filename.
In your case, the python code will look for bin_files directory in the current working directory(directory where the boto script is getting executed) and try to write scaler.pkl inside the bin_files directory. Since the directory doesn't exist, it throws the [Errno 2] No such file or directory error.
If you add a PATH to the Filename, then you should ensure the entire PATH exists with appropriate directory permissions.

Using python to open various links

This is my first post here and i hope i get my answers.
I want to open various links from my ftp server, and do some stuff in them. My links are http://mypage/photos0001/ , /photos002/, /photos003/ etc.
How can i write a script to open all of them and do the same job in all of them?
I tried:
Link = 'http://mypage/photos0001/' + 1
To do something like loop, but this doesn't work of course.
Any help?
Without being able to see your actual FTP directory tree, this may be a little difficult, but hopefully the following can get you started.
Consider reading up on ftplib for more information (see Docs)
import ftplib
ftp = ftplib.FTP('mypage')
ftp.login()
for dir in ftp.nlst():
if 'photos' in dir:
ftp.cwd('/mypage/{}'.format(dir))
for file in ftp.nlst():
if file.endswith('.jpg'):
try:
print('Attempting to download {}...'.format(file), end=' ')
with open(file, 'wb') as f:
ftp.retbinary('RETR ' + file, f.write, 8*1024)
print('[SUCCESS]')
except Exception as e:
print('[FAILED]')
print(e)
ftp.close()
So let's try and run through what is going on here:
Log in to your FTP server mypage.
List all the directories found in the root directory of your server.
If the folder name contains 'photos' then change working directory into that folder.
List all the files in this photos sub-folder.
If the file ends in .jpg its probably a picture we want.
Create a file on your system with the same name, and download the picture into it.
Repeat.
Now, expect to run into problems when you directory tree turns out to be slightly different than you've described to use here; however, you should be able to modify the example to fit your server. I do know this code works, as I have been able to use it to recursively download .html files from ftp.debian.org.

Opening a dynamic file in Django not in the project directory

So I am working on an app that allows users to upload CSV files and then generate graph data corresponding to their files. I have it working in development only when the files that are uploaded are sitting in the project's current working directory.
I discovered that the reason behind this is because in my graph view I am opening the file like so:
data_file = open(new_file, 'rb')
Open expects to find the file within the project directory. If I attempt to upload a file outside of that directory it throws this error:
Errno 2] No such file or directory: 'test.CSV'
I've read about os.path.expanduser and have tried:
data_file = open(os.path.expanduser('~' + new_file), 'rb')
but without success. The above code tries to find the file in C:/Users/test.csv.
Any suggestions on how I can achieve this would be greatly appreciated.
EDIT
My current attempt is now:
file_upload_dir = os.path.join(settings.MEDIA_ROOT, 'Data_Files')
data_file = open(os.path.join(file_upload_dir, new_file), 'rb')
And the error is:
File b'test.CSV' does not exist
Data_Files is a folder within my Media folder.
You need join ~, and file name with directory separator (os.sep). Using os.path.join will do it for you.
data_file = open(os.path.expanduser(os.path.join('~', new_file)), 'rb')
Uploaded files, as well as file generated by your application, have nothing to do in the project's directory (=> source code). You have a setting for where to store them (settings.MEDIA_ROOT), and you have a models.FileField to remember where they are stored and how to access them.

Categories