Reading data from bucket in Google ml-engine (tensorflow) - python

I am having issues reading data from a bucket hosted by Google.
I have a bucket containing ~1000 files I need to access, held at (for example)
gs://my-bucket/data
Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine.
I need a way to access both the data and the names of the files, either with a default python library (i.e. os) or using tensorflow. I know tensorflow has this functionality built in somewhere, it has been hard for me to find
Ideally I am looking for replacements for one command such as os.listdir() and another for open()
train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]
Where read_training_data uses a tensorflow reader object
Thanks for any help! ( Also p.s. my data is binary )

If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module.
That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant.
Information on reading can be found here. The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here).
Some things to keep in mind:
Are you using a format TensorFlow can read? Can it be converted to that format?
Is the overhead of "feeding" high enough to affect training performance?
Is the training set too big to fit in memory?
If the answer is yes to one or more of the questions, especially the latter two, consider using readers.

For what its worth. I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. This is demonstrated here after the file copy was done.
Here is my setup cell
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
Here is a cell for reading the file locally as a sanity check.
# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3',
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)
Here is reading the file from gs: directly as a binary file.
# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)

Related

Reading a pickle file in a cloud Jupyter instance from a GCP stream (SList)

I am working with some large data in Google Cloud Platform storage, using a Jupyterlab notebook in GCP Vertex AI Workbench in order to avoid local storage and data transfer.
Some of my problems are solved by using gcloud pipes to run some useful operations in the style of Linux command lines. For example:
s_path_final = 'gs://bucket_name/filename.txt'
s_pattern = 'search_target_text'
!gsutil cp {s_path_final} - | egrep -m 1 '{s_pattern}'
finds the first occurrence of the search text in the text file as desired.
What isn't working is reading a Python pickle file streaming from the GCP bucket. For example,
import io
s_stream_out = !gsutil cp {GS_path_to_pickle} -
df = pd.read_pickle(io.StringIO(s_stream_out.n))
errors with message a bytes-like object is required, not 'str'.
s_stream_out seems to be an object of type SList (cf. https://gist.github.com/parente/b6ee0efe141822dfa18b6feeda0a45e5) that I don't know what to do with. Is there a way to reassemble it appropriately? Simple-minded solutions like running a string join on it didn't help.
I don't really understand pickle, I'm afraid, but I gather it's a sort of serialized format for saving Python objects, so in the best case, a solution to all this would allow some kind of looping through its serial structure and pulling the items one-by-one directly back into Python memory, without trying to save or re-create the whole pickle file locally or in memory.
I suspect that you're going to need to use a Google Client Library directly.
Here's a Python code sample to stream a download to a file|stream that should meet your needs.
I'm unfamiliar with Jupyter|iPython but I suspect that its String lists are only suitable for non-binary data. This is supported by the error message you're receiving too.
I think you could pickle.load the file_obj that's created in the sample.

read hdf file from google cloud storage using pandas

Greetings Coders and Google cloud developers and professionals.
Am trying to read a list of hdf files from google cloud storage with the built in method "pd.read_hdf()" provided by pandas where file name is like that ("client1.h").
My problem is that i always get this error :
NotImplementedError: Support for generic buffers has not been implemented.
after deep searching in different forums and sites i realized that many have encountered the same problem but there is no solution provided.
the code i have used is below:
from google.cloud.storage import blob, bucket
import pandas as pd
from google.cloud import storage
storage_client = storage.Client.from_service_account_json('file___.json')
bucket = storage_client.get_bucket('my_bucket_name')
blob = bucket.blob("data1.h")
df = pd.read_hdf(blob, mode='r+')
print(df)
i tried as well with code below and i got the same error:
blob = bucket.blob("data1.h")
data = download_as_string() #as_bytes as_text
df = pd.read_hdf(io.BytesIO(data), mode='r+')
When i download the file to my local environment and i read it using its path , it works well and there is no problem but unfortunately in cloud storage i have a huge amount of files so i can't download all of them to work with.
!!! please !! anyone have a solution or a suggestion , i ask him to share it.
The feature doesn't seem to be implemented yet.
As you mentioned, downloading the file to your local file system first will let you use read_hdf(). This is a working workaround.
Forread_hdf() to work, one needs to pass a string that os.path.exists(path_or_buf) will result in True. You may want to help pandas developers implementing the feature. If that is the case, see the current implementation here.
The issue you are passing through is already opened in the issues section of pandas GitHub repo, however users are only mentioning that the problem happens with data in S3 (see here). You may want to share your problem in that issue or open a new one as well. To open a new issue, go here.

How to pass data generated by a Databricks notebook to a Python step?

I am building an Azure Data Factory v2, which comprises
A Databricks step to query large tables from Azure Blob storage and generate a tabular result intermediate_table;
A Python step (which does several things and would be cumbersome to put in a single notebook) to read the processed_table and generate the final output.
And looks like this
The notebook generates a pyspark.sql.dataframe.DataFrame which I tried to save into parquet format with attempts like
processed_table.write.format("parquet").saveAsTable("intermediate_table", mode='overwrite')
or
processed_table.write.parquet("intermediate_table", mode='overwrite')
Now, I would like the Python step to re-read the intermediate result, ideally with a postprocess.py file with a syntax like
import pandas as pd
intermediate = pd.read_parquet("intermediate_table")
after having installed fastparquet inside my Databricks cluster.
This is (not surprisingly...) failing with errors like
FileNotFoundError: [Errno 2] No such file or directory:
'./my_processed_table'
I assume the file is not found because the Python file is not accessing the data in the right context/path.
How should I amend the code above, and what would be the best/canonical ways to pass data across such steps in a pipeline? (any other advice on common/best practices to do this are welcome)
One way to run the pipeline successfully is to have in the Databricks notebook a cell like
%python
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
import pandas as pd
processed_table.toPandas().to_parquet("/dbfs/intermediate", engine="fastparquet", compression = None)
and then have in preprocess.py
import pandas as pd
intermediate = pd.read_parquet("/dbfs/intermediate")
not sure if that's good practice (it works though).

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Image dataframe from HDFS for Image Classification

I'm trying to write an image classification algorithm using Python and Spark.I'm following this tutorial, which is taken from the official databricks documentation and works perfectly when running locally.
My problem now, shifting the algorithm on a cluster, is that I have to load my images from two folders on the HDFS in .jpg format, and I can't find a way to create a dataframe the way it's done locally in the examples.
I'm looking for a substitute for this code:
from sparkdl import readImages
jobs_df = readImages(img_dir + "/jobs").withColumn("label", lit(1))
It should be pretty much same as reading the files from Local.
Below is implementation from the library. It internally uses binaryFiles api to load binary files. The API documentation (binaryFiles) says it supports Hadoop filesystem too.
rdd = sc.binaryFiles(path, minPartitions=numPartitions).repartition(numPartitions)
Hope this helps.

Categories