read hdf file from google cloud storage using pandas - python

Greetings Coders and Google cloud developers and professionals.
Am trying to read a list of hdf files from google cloud storage with the built in method "pd.read_hdf()" provided by pandas where file name is like that ("client1.h").
My problem is that i always get this error :
NotImplementedError: Support for generic buffers has not been implemented.
after deep searching in different forums and sites i realized that many have encountered the same problem but there is no solution provided.
the code i have used is below:
from google.cloud.storage import blob, bucket
import pandas as pd
from google.cloud import storage
storage_client = storage.Client.from_service_account_json('file___.json')
bucket = storage_client.get_bucket('my_bucket_name')
blob = bucket.blob("data1.h")
df = pd.read_hdf(blob, mode='r+')
print(df)
i tried as well with code below and i got the same error:
blob = bucket.blob("data1.h")
data = download_as_string() #as_bytes as_text
df = pd.read_hdf(io.BytesIO(data), mode='r+')
When i download the file to my local environment and i read it using its path , it works well and there is no problem but unfortunately in cloud storage i have a huge amount of files so i can't download all of them to work with.
!!! please !! anyone have a solution or a suggestion , i ask him to share it.

The feature doesn't seem to be implemented yet.
As you mentioned, downloading the file to your local file system first will let you use read_hdf(). This is a working workaround.
Forread_hdf() to work, one needs to pass a string that os.path.exists(path_or_buf) will result in True. You may want to help pandas developers implementing the feature. If that is the case, see the current implementation here.
The issue you are passing through is already opened in the issues section of pandas GitHub repo, however users are only mentioning that the problem happens with data in S3 (see here). You may want to share your problem in that issue or open a new one as well. To open a new issue, go here.

Related

Data Bricks Community Edition Cannot Read Uploaded Table (Was Working Before)

So when I was trying to work with some ideas on DataBricks Community Edition today, I suddenly find out the python pandas can no longer read a existing uploaded table. The directories were working before, but none of my previously uploaded tables can be read any more, all return in "File ... Not Exist". Anyone know what we can do?
Command used:
import pandas as pd
df = pd.read_csv('/dbfs/FileStore/tables/iris.csv')
File is definitely there, this also affects all my other previously uploaded files.
Any suggestion would be appreciated
How did you load the csv? from blob? if from blob you probably have not mounted.
Can you try read as spark? If you can't read then the file is corrupt. If you can, try apply sdf.to.pandas().
sdf = spark.read.csv('/FileStore/tables/iris.csv', header="true", inferSchema="true")

Is there a way to unpickle data pickled by another program in GAE?

I'm setting up an app-engine on Google Cloud Platform. I would like to get information contains in a particular file in a bucket in Storage.
This particular file is a binary one and has ".gst" for extension and it contains a python dictionary. It is available in "/bucket_name/dir_name/sub_dir_name/binary_file.gst"
I have already try some basics things like this :
with cloudstorage.open(file_path) as cloudstorage_file:
contents = pickle.load(cloudstorage_file)
or this
with cloudstorage.open(file_path) as cloudstorage_file:
contents = cloudstorage_file.read()
It doesn't work, and i don't know the right way to do this, so every help will be appreciate. I hope I've made it clear enough to get help.
If I understood correctly you're trying to download and unpickle a file you have stored in a Cloud Storage Bucket.
You can download objects (no matter what file extension or type of file) using the Python Client Library
After downloading it to a file, you could unpickle it like you normally would.
Pickle can only deserialize objects if every module can be accessed from the same import path, so it is never a good idea to use pickle between different applications. If the file was generated from AppEngine, then it should work correctly.
Also, even with the same classes, if the other application was running on python 3, the default pickle protocol is not compatible with python 2.
Not sure about with GAE, but you can use the Google Cloud Client Libraries:
#pip install google-cloud-storage
import pickle
from google.cloud import storage
blob_name = "dir_name/sub_dir_name/binary_file.gst"
blob = storage.Client(project=project_name).get_bucket(bucket_name).blob(blob_name)
contents = pickle.loads(blob.download_as_string())
Note that GCS doesn't actually have "folders" so everything that isn't the project_name or bucket_name is the blob_name in this example.

Dropbox Python API Upload multiple files

I'm trying to upload a set of pd.DataFrames as CSV to a folder in Dropbox using the Dropbox Python SDK (v2). The set of files is not particularly big, but it's numerous. Using batches will help to reduce the API calls and comply with the developer recommendations outlined in the documentation:
"The idea is to group concurrent file uploads into batches, where files
in each batch are uploaded in parallel via multiple API requests to
maximize throughput, but the whole batch is committed in a single,
asynchronous API call to allow Dropbox to coordinate the acquisition
and release of namespace locks for all files in the batch as
efficiently as possible."
Following several answers in SO (see the most relevant to my problem here), and this answer from the SDK maintainers in the Dropbox Forum I tried the following code:
commit_info = []
for df in list_pandas_df:
df_raw_str = df.to_csv(index=False)
upload_session = dbx.upload_session_start(df_raw_str.encode())
commit_info.append(
dbx.files.CommitInfo(path=/path/to/db/folder.csv
)
dbx.files_upload_finish_batch(commit_info)
Nonetheless, when reading the files_upload_finish_batch docstring I noticed that the function only takes a list of CommitInfo as an argument (documentation), which is confusing since the non-batch version (files_upload_session_finish) does take a CommitInfo object with a path, and a cursor object with data about the session.
I'm fairly lost in the documentation, and even the source code is not so helpful to understand how the batch works to upload several files (and not as a case for uploading heavy files). What I am missing here?

How to load a .pickle file on python from google cloud storage

My problem is the following:
I have two .p (pickle files) in my bucket in the google cloud storage and I would like to load them on my jupyter notebook (where I run my code on the VM instance).
I tried the following
url = "http:\\<localhost>\tree\storage.googleapis.com\<bucket-name>"
data_x = pickle.load(open(url + "\<filename>.p", "rb" ))
but it said that there is no such file or directory. I have also tried to create a public link on the google storage with no success. I managed to load .mat files and .pts files that way but I can't seem to be able to load the .p file.
There was a similar question posted here: pickling python objects to google cloud storage
but it is depreciated and I did not manage to use the answer.
This is also a resource that could be useful: https://pypi.python.org/pypi/google-cloud-storage
Thanks a lot for you help!!
For Python >= 3.5 users :
One thing you can do is to use gcsfs library from Dask creator that works really similar to open but almost just replacing with fs.open
Example reading :
import gcsfs, pickle
fs = gcsfs.GCSFileSystem(project = 'my-google-project')
fs.ls('my-bucket')
>>> ['my-file.txt']
with fs.open('my-bucket/my-file.txt', 'rb') as file:
print(pickle.load(file))
To write, this is similar. You just need to replace rb by wb and load with dump

Reading data from bucket in Google ml-engine (tensorflow)

I am having issues reading data from a bucket hosted by Google.
I have a bucket containing ~1000 files I need to access, held at (for example)
gs://my-bucket/data
Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine.
I need a way to access both the data and the names of the files, either with a default python library (i.e. os) or using tensorflow. I know tensorflow has this functionality built in somewhere, it has been hard for me to find
Ideally I am looking for replacements for one command such as os.listdir() and another for open()
train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]
Where read_training_data uses a tensorflow reader object
Thanks for any help! ( Also p.s. my data is binary )
If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module.
That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant.
Information on reading can be found here. The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here).
Some things to keep in mind:
Are you using a format TensorFlow can read? Can it be converted to that format?
Is the overhead of "feeding" high enough to affect training performance?
Is the training set too big to fit in memory?
If the answer is yes to one or more of the questions, especially the latter two, consider using readers.
For what its worth. I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. This is demonstrated here after the file copy was done.
Here is my setup cell
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
Here is a cell for reading the file locally as a sanity check.
# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3',
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)
Here is reading the file from gs: directly as a binary file.
# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)

Categories