Google Storage // Cloud Function // Python Modify CSV file in the Bucket

Google Storage // Cloud Function // Python Modify CSV file in the Bucket - python

thanks for reading.
I've some problem with touching csv file in Bucket, i know how i can copy/rename/move file, but i have no idea how to modify file with out downloading to local machine.
Actually i have major idea , its download blob (csv file) as bytes then modify and upload to the Bucket as bytes. But i don't understand how to modify bytes.
How i should touch csv : add new header - date , and add value (today.date) in each row of csv
---INPUT---
CSV file in the Bucket:
a
b
1
2
--OUTPUT---
updated CSV file in the Bucket:
a
b
date
1
2
today
my code :
def addDataToCsv(bucket,fileName):
today = str(date.today())
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(fileName)
fileNameText = blob.download_as_string()
/// This should be a magic bytes modification //
blobNew = bucket.blob(path+'/'+'mod.csv')
blobNew.upload_from_string(fileNameText,content_type='text/csv')
Please help, thank you for time and effort

If I understand, you want to modify the CSV file in the bucket without downloading it to the local machine file-system.
You cannot directly edit a file from a Cloud Storage Bucket, aside from its metadata, therefore you will need to download it to your local machine somehow and push changes to the bucket.
Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime.
However, an approach would be to use Cloud Storage FUSE, which mounts a Cloud Storage bucket as a file system so you can edit any file from there and changes are applied to your bucket.
Still if this is not a suitable solution for you, the bytes can be downloaded and modified as you propose by decoding the bytes object (commonly using UTF-8, although depends on your characters) and reencoding it before uploading it.
# Create an array of every CSV file line
csv_array = fileNameText.decode("utf-8").split("\n")
# Add header
csv_array[0] = csv_array[0] + ",date\n"
# Add the date to each field
for i in range(1,len(csv_array)):
csv_array[i] = csv_array[i] + "," + today + "\n"
# Reencode from list to bytes to upload
fileNameText = ''.join(csv_array).encode("utf-8")
Take into account that if your local machine has some serious storage or performance limitations, if your CSV is large enough that it might cause problems handling it like above, or just for reference, you could use the compose command. For this you would need to modify the code above so only some sections of the CSV file are edited every time, uploaded, and then joined by gsutil compose in Cloud Storage.

Sorry I know I'm not at your shoes, but if I were you I will try to keep things simple. In deed most systems work best if they are kept simple and they are easier to maintain and share (KISS principle). So given you are using your local machine, I assume you have a generous network bandwidth and enough disk space and memory. So I will not hesitate to download the file, modify it, and upload it again. Even when dealing with big files.
Then, if your are willing to use another format of the file:
download blob (csv file) as bytes
In this case a better solution for size and simple code, is to use / convert your file to Parquet or Avro format. These formats will reduce drastically you file size, especially if you add compression. Then they allow you to keep a structure for your data, which makes their modifications way simpler. Finally you have many resources on the net on how to use these formats with python, and comparisons between CSV, Avro and Parquet.

Related

Force overwrite when saving to GCP bucket (pd.to_csv)

I have a little problem that I cannot solve and would really appreciate any input on this. You see, whenever I've used pd.to_csv() to write a file to a folder, it has done what I expected to do, namely to overwrite any file with the same name.
I am now working solely working on GCP and I need to refresh tokens to access some API:s using existing ones. To do this is not any problem, but what I've done is save copies of all the API tokens in a dataframe that I store in a bucket. All my files have a datestamp:
Saving
base = datetime.date.today()
bucket_name_tokens = 'project-tokens'
bucket_path_tokens = 'gs://' + bucket_name_tokens
tokens.to_csv(bucket_path_tokens +'/tokens_{}.csv'.format(base))
Reading
bucket_name_tokens = 'project-tokens'
bucket_path_tokens = 'gs://' + bucket_name_tokens
tokens_new = pd.read_csv(bucket_path_tokens +'/tokens_{}.csv'.format(base))
I discovered that if I (for some reason) refresh my token and save the new one using the above code (on the same day, obviously), it does not! overwrite the older one with the same name.
I've looked everywhere for an explanation or for an option to pandas.to_csv() that I wasn't aware of without any luck.
Any idea on how to ensure that files are overwritten? Have I missed something?
NOTE This is also true of all the files I save in any of the buckets I have created.
Thanks in advance

don't create a storage bucket with a retention.
retention is nice but it wont allow overwriting or deleting of items from the bucket, if turned off.
personally, I find it better to not have any retention.

How can one use the StorageStreamDownloader to stream download from a blob and stream upload to a different blob?

I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
Some context
I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a BlobInputStream is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. A BlobOutputStream, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:
Get a CloudBlockBlob object, open it's BlobInputStream and essentially get back an InputStream that was 'tied' to the CloudBlockBlob. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).
Perform some operations on that data.
Retrieve the CloudBlockBlob object that I am uploading to, get it's BlobOutputStream, and write to it the data I did some operations on.
A good example of this is if I wanted to compress a file. I had a GzipStreamReader class that would accept an BlobInputStream and an BlobOutputStream. It would read data from the BlobInputStream and, whenever it has compressed some amount of data, write to the BlobOutputStream. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.
Now for Python
Now, the Python SDK is a little different, and obviously for good reason; the io module works differently than Java's InputStream and OutputStream classes (which the Blob{Input/Output}Stream classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and pass that stream to the StorageStreamDownloader's readinto method.
For uploads, I would call the BlobClient's upload method. The upload method accepts a data parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]].
I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:
When I call download_blob, I get back a StorageStreamDownloader with all the data in the blob. Some investigation has shown that I can use the offset and length to download the amount of data I want. Perhaps I can call it once with a download_blob(offset=0, length=4MB), process the data I get back, then again call download_bloc(offset=4MB, length=4MB), process the data, etc. This is unfavorable. The other thing I could do is utilize the max_chunk_get_size parameter for the BlobClient and turn on the validate_content flag (make it true) so that the StorageStreamDownloader only downloads 4mb. But this all results in several problems: that's not really streaming from a stream object. I'll still have to call download and readinto several times. And fine, I would do that, if it weren't for the second problem:
How the heck do I stream an upload? The upload can take a stream. But if the stream doesn't auto-update itself, then I can only upload once, because all the blobs I deal with must be BlockBlobs. The docs for the upload_function function say that I can provide a param overwrite that does:
keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.
If True, upload_blob will overwrite the existing data. If set to False, the
operation will fail with ResourceExistsError. The exception to the above is with Append
blob types: if set to False and the data already exists, an error will not be raised
and the data will be appended to the existing blob. If set overwrite=True, then the existing
append blob will be deleted, and a new one created. Defaults to False.
And this makes sense because BlockBlobs, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?
Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the io module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.
To recap
Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:
Retrieving some data from a blob (the data received in one call should not be more than 10MB) in a stream.
Compressing the data in that stream.
Upload the data to a blob.
This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
I hope this makes sense! I just want to stream data. This shouldn't be that hard.
Edit:
Some people may want to close this and claim the question is a duplicate. I have forgotten to include something very important: I am currently using the newest, mot up-to-date azure-sdk version. My azure-storage-blob package's version is 12.5.0. There have been other questions similar to what I have asked for severely outdated versions. I have searched for other answers, but haven't found any for 12+ versions.

If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code
import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'
source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='',
blob_name='',
credential=key,
max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
max_single_get_size=4*1024*1024)
des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
container_name='',
blob_name='',
credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
#process your data
# use the put block rest api to upload the chunk to azure storage
blk_id = str(uuid.uuid4())
des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
block_list.append(BlobBlock(block_id=blk_id))
#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)
Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url

Identify new files in FTP and write them to AWS S3

I'm currently using ftplib in Python to get some files and write them to S3.
The approach I'm using is to use with open as shown below:
with open('file-name', 'wb') as fp:
ftp.retrbinary('filename', fp.write)
to download files from FTP server and save them in a temporary folder, then upload them to S3.
I wonder if this is the best practice, because the shortcoming about this approach is:
if files are too many&big, I can download them and upload to S3, then delete them from the temp folder,
but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?
Hope this makes sense, would be great if anyone has an example or something, many thanks.

You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.

Access HDF files stored on s3 in pandas

I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf. If I understand it correctly the file must be present on the filesystem.
Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315
It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?
For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.
If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?
I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.

Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs.
Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump might partially improve on this.

How do i combine wav files (created using recorder.js) in python?

I am building a client-server application using python and javascript.
In the frontend, i'm recording audio using recorder.js.
After some fixed interval, i use exportWav() and send the audio file to the server.
At the backend i now need to concatenate these files to make the bigger audiofile again.
I saw this question, but i don't have actual .wav files, just the blobs returned by exportWav.
I'm also using app engine, so i cannot write output to a wav file. I need to create another audioblob that i can store in the datastore.
Any ideas?

Is each segment the complete binary data for a wav file? You'll need to use some kind of format-aware library to concatenate the wavs. The implementation you choose is up to you, but of course it will need to be in python. On the other hand, you could use a Compute Engine instance to run a binary which concatenates the wavs, using the cloud storage client library to ultimately put those wav files in the bucket, cleaning up any temporary files afterward.
If they're just segments of a single wav's binary, you can simply transfer the data and use the cloud storage client library to open the relevant cloud storage blob for writing, writing the new portion to the end of the "file".
It really comes down to the fact that you yourself need to understand what's being returned by exportWav.
If you're set on using blob properties in datastore, you can do this of course, just look up the relevant documentation for storing blobs in datastore, and be aware that you can't "update" objects, or "concatenate" to their properties. If you put a wav today and want to concat to it in 3 months, you'll need to grab the full entity and blob, delete it, concat the new portion in-memory and then put it back.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.