Streaming and caching tabular data with fsspec, parquet and Pyarrow - python

I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python
Following these instructions that’s what I’m trying right now:
from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.implementations.cached import CachingFileSystem
import pandas as pd
cfs = CachingFileSystem(target_protocol="http", cache_storage="cache_fs")
cfs_arrow = ArrowFSWrapper(cfs)
url = "https://www.dropbox.com/s/…./myfile.parquet?dl=0"
f = cfs_arrow.open(url, "rb")
df = pd.read_parquet(f)
but this raises the following error at cfs_arrow.open(url, "rb"):
AttributeError: type object 'HTTPFileSystem' has no attribute 'open_input_stream'
I’ve used fsspec CachingFileSystem before to stream hdf5 data from S3, so I presumed it would work out-of-the-box, but I’m probably doing something wrong.
Can someone help me with that? Or other suggestions on how to accomplish the goal of streaming my tabular data while keeping a cache for fast later access in the same session?

The convenience way to open and pass a file-like object using fsspec alone would be
with fsspec.open(
"blockcache::https://www.dropbox.com/s/…./myfile.parquet?dl=0",
blockcache={"cache_storage": "cache_fs"}
) as f:
df = pd.read_parquet(f)
Of course, instantiating your own filesystem instance is fine too. You may be interested, that there is a dropbox backend to fsspec too, useful for finding and manipulating files. Also, there is an fsspec.parquet module for optimising parquet access when you need only some of the row-groups or columns of the target.

My understanding is that there are 2 types of file systems:
fsspec
arrow
You need an arrow file system if you are going to call pyarrow functions directly. If you have an fsspec file system (eg: CachingFileSystem) and want to use pyarrow, you need to wrap your fsspec file system using this:
from pyarrow.fs import PyFileSystem, FSSpecHandler
pa_fs = PyFileSystem(FSSpecHandler(fs))
ArrowFSWrapper is to go the other way around (from a pyarrow file system to a fsspec file system).
But in your case given ultimately you are using a file object (not a file system) to call pd.read_parquet you can use your fsspec file system (ie CachingFileSystem) directly.

Related

Reading a pickle file in a cloud Jupyter instance from a GCP stream (SList)

I am working with some large data in Google Cloud Platform storage, using a Jupyterlab notebook in GCP Vertex AI Workbench in order to avoid local storage and data transfer.
Some of my problems are solved by using gcloud pipes to run some useful operations in the style of Linux command lines. For example:
s_path_final = 'gs://bucket_name/filename.txt'
s_pattern = 'search_target_text'
!gsutil cp {s_path_final} - | egrep -m 1 '{s_pattern}'
finds the first occurrence of the search text in the text file as desired.
What isn't working is reading a Python pickle file streaming from the GCP bucket. For example,
import io
s_stream_out = !gsutil cp {GS_path_to_pickle} -
df = pd.read_pickle(io.StringIO(s_stream_out.n))
errors with message a bytes-like object is required, not 'str'.
s_stream_out seems to be an object of type SList (cf. https://gist.github.com/parente/b6ee0efe141822dfa18b6feeda0a45e5) that I don't know what to do with. Is there a way to reassemble it appropriately? Simple-minded solutions like running a string join on it didn't help.
I don't really understand pickle, I'm afraid, but I gather it's a sort of serialized format for saving Python objects, so in the best case, a solution to all this would allow some kind of looping through its serial structure and pulling the items one-by-one directly back into Python memory, without trying to save or re-create the whole pickle file locally or in memory.
I suspect that you're going to need to use a Google Client Library directly.
Here's a Python code sample to stream a download to a file|stream that should meet your needs.
I'm unfamiliar with Jupyter|iPython but I suspect that its String lists are only suitable for non-binary data. This is supported by the error message you're receiving too.
I think you could pickle.load the file_obj that's created in the sample.

read hdf file from google cloud storage using pandas

Greetings Coders and Google cloud developers and professionals.
Am trying to read a list of hdf files from google cloud storage with the built in method "pd.read_hdf()" provided by pandas where file name is like that ("client1.h").
My problem is that i always get this error :
NotImplementedError: Support for generic buffers has not been implemented.
after deep searching in different forums and sites i realized that many have encountered the same problem but there is no solution provided.
the code i have used is below:
from google.cloud.storage import blob, bucket
import pandas as pd
from google.cloud import storage
storage_client = storage.Client.from_service_account_json('file___.json')
bucket = storage_client.get_bucket('my_bucket_name')
blob = bucket.blob("data1.h")
df = pd.read_hdf(blob, mode='r+')
print(df)
i tried as well with code below and i got the same error:
blob = bucket.blob("data1.h")
data = download_as_string() #as_bytes as_text
df = pd.read_hdf(io.BytesIO(data), mode='r+')
When i download the file to my local environment and i read it using its path , it works well and there is no problem but unfortunately in cloud storage i have a huge amount of files so i can't download all of them to work with.
!!! please !! anyone have a solution or a suggestion , i ask him to share it.
The feature doesn't seem to be implemented yet.
As you mentioned, downloading the file to your local file system first will let you use read_hdf(). This is a working workaround.
Forread_hdf() to work, one needs to pass a string that os.path.exists(path_or_buf) will result in True. You may want to help pandas developers implementing the feature. If that is the case, see the current implementation here.
The issue you are passing through is already opened in the issues section of pandas GitHub repo, however users are only mentioning that the problem happens with data in S3 (see here). You may want to share your problem in that issue or open a new one as well. To open a new issue, go here.

Is there a way to unpickle data pickled by another program in GAE?

I'm setting up an app-engine on Google Cloud Platform. I would like to get information contains in a particular file in a bucket in Storage.
This particular file is a binary one and has ".gst" for extension and it contains a python dictionary. It is available in "/bucket_name/dir_name/sub_dir_name/binary_file.gst"
I have already try some basics things like this :
with cloudstorage.open(file_path) as cloudstorage_file:
contents = pickle.load(cloudstorage_file)
or this
with cloudstorage.open(file_path) as cloudstorage_file:
contents = cloudstorage_file.read()
It doesn't work, and i don't know the right way to do this, so every help will be appreciate. I hope I've made it clear enough to get help.
If I understood correctly you're trying to download and unpickle a file you have stored in a Cloud Storage Bucket.
You can download objects (no matter what file extension or type of file) using the Python Client Library
After downloading it to a file, you could unpickle it like you normally would.
Pickle can only deserialize objects if every module can be accessed from the same import path, so it is never a good idea to use pickle between different applications. If the file was generated from AppEngine, then it should work correctly.
Also, even with the same classes, if the other application was running on python 3, the default pickle protocol is not compatible with python 2.
Not sure about with GAE, but you can use the Google Cloud Client Libraries:
#pip install google-cloud-storage
import pickle
from google.cloud import storage
blob_name = "dir_name/sub_dir_name/binary_file.gst"
blob = storage.Client(project=project_name).get_bucket(bucket_name).blob(blob_name)
contents = pickle.loads(blob.download_as_string())
Note that GCS doesn't actually have "folders" so everything that isn't the project_name or bucket_name is the blob_name in this example.

Reading data from bucket in Google ml-engine (tensorflow)

I am having issues reading data from a bucket hosted by Google.
I have a bucket containing ~1000 files I need to access, held at (for example)
gs://my-bucket/data
Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine.
I need a way to access both the data and the names of the files, either with a default python library (i.e. os) or using tensorflow. I know tensorflow has this functionality built in somewhere, it has been hard for me to find
Ideally I am looking for replacements for one command such as os.listdir() and another for open()
train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]
Where read_training_data uses a tensorflow reader object
Thanks for any help! ( Also p.s. my data is binary )
If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module.
That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant.
Information on reading can be found here. The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here).
Some things to keep in mind:
Are you using a format TensorFlow can read? Can it be converted to that format?
Is the overhead of "feeding" high enough to affect training performance?
Is the training set too big to fit in memory?
If the answer is yes to one or more of the questions, especially the latter two, consider using readers.
For what its worth. I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. This is demonstrated here after the file copy was done.
Here is my setup cell
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
Here is a cell for reading the file locally as a sanity check.
# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3',
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)
Here is reading the file from gs: directly as a binary file.
# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)

Access HDF files stored on s3 in pandas

I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf. If I understand it correctly the file must be present on the filesystem.
Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315
It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?
For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.
If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?
I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.
Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs.
Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump might partially improve on this.

Categories