Write a list directly to gcs file - python

Currently i have a simple python code which writes elements of a list to file. How can i do the same, but write to file in Google cloud storage
Current code :
with open('/home/nitin/temp.txt', 'w') as f:
for item in ["Nitin", "Agarwal"]:
f.write(item[0]+'\n')

I tend to find blob's "upload_from_string" method preferable when dumping some in-memory data to GSC, instead of dumping to a local file and uploading it to GSC.
Would this work for you?
from google.cloud.storage import Blob
data = ["Nitin", "Agarwal"]
client = storage.Client(project="my-project")
bucket = client.get_bucket("my-bucket")
blob = Blob("data", bucket)
blob.upload_from_string("\n".join(data))
https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.upload_from_string

Google Cloud Storage (GCS) is also known as blob storage as opposed to file storage. Don't think of GCS as holding files, think of it as holding blobs of data. This means that you can't use file system APIs. Instead, think of building the blob of data that you want to write into GCS locally and then writing that complete blob into GCS as a unit. When you write data into GCS, it is immutable ... this means that once written, you can't change it (you can delete it and rewrite a new copy). This means that you can't append to a blob.
Here is a good example:
Using Cloud Storage with Python

Related

How to read and update a field in JSON file that's in Google cloud using Python script

How to read and update a field in JSON file that's in Google cloud using Python script.
I've tried using
with open(filepath, 'r')
Where filepath was a gs path location
gs://bucket-name/path
But I was getting FileNotFoundError.
Is there any other way to read and update a field in JSON that's in Google cloud using Python?
If you want to read or write a file you will need to use Google Cloud Storage library. The Google Cloud Storage library is a client library for accessing Google Cloud Storage services. This library provides several methods to interact with objects in Cloud Storage,which also includes methods for reading and updating the objects.
You can check this document on Reading and writing to Cloud Storage for more information and code samples.
Also check the below similar example
How to Read .json file in python code from google cloud storage bucket

Apache Beam on Dataflow - Load external file

I am running an Apache Beam ETL job on Dataflow from a template I created. I want to add an additional step where I upload the processed data to BigQuery and create the table if needed, so I need to declare the table schema.
The schema is actually pretty large, so maybe the best option is to write it on a .json file and import it with a simple:
with open('filename.json', 'r') as f:
schema = json.load(f)
However, how can I declare this external file on my template so that when the job gets deployed and run on the workers they can find the file? Can I set any option like that on my setup.py so that the file gets saved with the template?
Other alternatives I have thought are:
Saving the file to a GCS bucket and reading it using the cloud storage client and pass it to my functions as a side input
Writing the schema manually on my pipeline file, even if it gets dirty
What would be the best way to proceed?
For big files I would store in a GCS bucket (Alternative 1). If this is not applicable you can create a flex template for your pipeline and add the schema to the container.

How to upload HDF5 file directly to S3 bucket in Python

I want to upload a HDF5 file created with h5py to S3 bucket without saving locally using boto3.
This solution uses pickle.dumps and pickle.loads and other solutions I have found, store the file locally which I like to avoid.
You can use io.BytesIO() to and put_object as illustrated here 6. Hope this helps. Even in this case, you'd have to 'store' the data locally(though 'in memory'). You could also create a tempfile.TemporaryFile and then upload your file with put_object. I don't think you can stream to an S3 Buckets in the sense that the local data would be discarded as it is uploaded to the Bucket.

Storing multiple files with the same name in Google Cloud Storage?

So I am trying to port a Python webapp written with Flask to Google App Engine. The app hosts user uploaded files up to 200mb in size, and for non-image files the original name of the file needs to be retained. To prevent filename conflicts, e.g. two people uploading stuff.zip, each containing completely different and unrelated contents, the app creates a UUID folder on the filesystem and stores the file within that, and serves them to users. Google App Engine's Cloud Storage, which I was planning on using to store the user files, by making a bucket - according to their documentation has "no notion of folders". What is the best way to go about getting this same functionality with their system?
The current method, just for demonstration:
# generates a new folder with a shortened UUID name to save files
# other than images to avoid filename conflicts
else:
# if there is a better way of doing this i'm not clever enough
# to figure it out
new_folder_name = shortuuid.uuid()[:9]
os.mkdir(
os.path.join(app.config['FILE_FOLDER'], new_folder_name))
file.save(
os.path.join(os.path.join(app.config['FILE_FOLDER'], new_folder_name), filename))
new_folder_path = os.path.join(
app.config['FILE_FOLDER'], new_folder_name)
return url_for('uploaded_file', new_folder_name=new_folder_name)
From the Google Cloud Storage Client Library Overview documentation:
GCS and "subdirectories"
Google Cloud Storage documentation refers to "subdirectories" and the GCS client library allows you to supply subdirectory delimiters when you create an object. However, GCS does not actually store the objects into any real subdirectory. Instead, the subdirectories are simply part of the object filename. For example, if I have a bucket my_bucket and store the file somewhere/over/the/rainbow.mp3, the file rainbow.mp3 is not really stored in the subdirectory somewhere/over/the/. It is actually a file named somewhere/over/the/rainbow.mp3. Understanding this is important for using listbucket filtering.
While Cloud Storage does not support subdirectories per se, it allows you to use subdirectory delimiters inside filenames. This basically means that the path to your file will still look exactly as if it was inside a subdirectory, even though it is not. This apparently should concern you only when you're iterating over the entire contents of the bucket.
From the Request URIs documentation:
URIs for Standard Requests
For most operations you can use either of the following URLs to access objects:
storage.googleapis.com/<bucket>/<object>
<bucket>.storage.googleapis.com/<object>
This means that the public URL for their example would be http://storage.googleapis.com/my_bucket/somewhere/over/the/rainbow.mp3. Their service would interpret this as bucket=my_bucket and object=somewhere/over/the/rainbow.mp3 (i.e. no notion of subdirectories, just an object name with embedded slashes in it); the browser however will just see the path /my_bucket/somewhere/over/the/rainbow.mp3 and will interpret it as if the filename is rainbow.mp3.

Writing files to google app engine blobstore as the methods are going to be deprecated

I want to save some data fetched from the web to blobstore, but the google doc says that
Deprecated: The Files API feature used here to write files to Blobstore is going to be removed at some time in the future, in favor of writing files to Google Cloud Storage and using Blobstore to serve them.
The code in python is as follows
from __future__ import with_statement
from google.appengine.api import files
# Create the file
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write('data')
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
blob_key = files.blobstore.get_blob_key(file_name)
I am wondering if there is another way to write to blobstore instead of the official upload method.
If you want to use a file-like API, you have to go with GCS.
Blobstore is for uploading more-or-less static images and serving them.
If you want to write using a a file-like API and then serve from Blobstore, you can write to GCS and get a BlobKey to the file.
https://cloud.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
But writing to BlobStore like you want is deprecated. Stop trying to do it that way.
An option may be to put the data in the datastore using a TextProperty

Categories