Image dataframe from HDFS for Image Classification - python

I'm trying to write an image classification algorithm using Python and Spark.I'm following this tutorial, which is taken from the official databricks documentation and works perfectly when running locally.
My problem now, shifting the algorithm on a cluster, is that I have to load my images from two folders on the HDFS in .jpg format, and I can't find a way to create a dataframe the way it's done locally in the examples.
I'm looking for a substitute for this code:
from sparkdl import readImages
jobs_df = readImages(img_dir + "/jobs").withColumn("label", lit(1))

It should be pretty much same as reading the files from Local.
Below is implementation from the library. It internally uses binaryFiles api to load binary files. The API documentation (binaryFiles) says it supports Hadoop filesystem too.
rdd = sc.binaryFiles(path, minPartitions=numPartitions).repartition(numPartitions)
Hope this helps.

Related

using python to download sentinel imagery directly

I'm trying to download sentinel satellite images directly using python.
the idea is to use sentinelsat API and a geojson polygon to download it.
however it downloads the entire image and not only the polygone.
is there a way to make it download only the polygon or to automatically crop the wanted area?
thank you in advance
There are a few ways you can go about it, but, based on the documentation, sentinelsat doesn't support such an operation. The easier would be to use gdal and a geojson or shapefile. The way how you can do that is answered here.
The much more complicated way, which also gives you immensely more control on what data you can download is with the Sentinel-2 AWS S3 Buckets and the GDAL python api. Specifically, GDAL has a driver called S3 that allows you to load a raster into memory without downloading it locally. Then, you can use the ReadAsArray function to load specific parts of the image. You can look these bits up from the GDAL docs.

read hdf file from google cloud storage using pandas

Greetings Coders and Google cloud developers and professionals.
Am trying to read a list of hdf files from google cloud storage with the built in method "pd.read_hdf()" provided by pandas where file name is like that ("client1.h").
My problem is that i always get this error :
NotImplementedError: Support for generic buffers has not been implemented.
after deep searching in different forums and sites i realized that many have encountered the same problem but there is no solution provided.
the code i have used is below:
from google.cloud.storage import blob, bucket
import pandas as pd
from google.cloud import storage
storage_client = storage.Client.from_service_account_json('file___.json')
bucket = storage_client.get_bucket('my_bucket_name')
blob = bucket.blob("data1.h")
df = pd.read_hdf(blob, mode='r+')
print(df)
i tried as well with code below and i got the same error:
blob = bucket.blob("data1.h")
data = download_as_string() #as_bytes as_text
df = pd.read_hdf(io.BytesIO(data), mode='r+')
When i download the file to my local environment and i read it using its path , it works well and there is no problem but unfortunately in cloud storage i have a huge amount of files so i can't download all of them to work with.
!!! please !! anyone have a solution or a suggestion , i ask him to share it.
The feature doesn't seem to be implemented yet.
As you mentioned, downloading the file to your local file system first will let you use read_hdf(). This is a working workaround.
Forread_hdf() to work, one needs to pass a string that os.path.exists(path_or_buf) will result in True. You may want to help pandas developers implementing the feature. If that is the case, see the current implementation here.
The issue you are passing through is already opened in the issues section of pandas GitHub repo, however users are only mentioning that the problem happens with data in S3 (see here). You may want to share your problem in that issue or open a new one as well. To open a new issue, go here.

Static dataflow graph generator for Python?

I've been struggling for quite some time to find a static dataflow graph generator for Python.
This is my ideal:
Given a small python script example.py, (written in Python3), return some representation of the data flow graph.
I was able to achieve this result using IBM's pyflowgraph, https://github.com/IBM/pyflowgraph which outputs data in graph.ml format, unfortunately this package only performs dynamic analysis.
I'm wondering if anyone knows of a DFG tool that could do this type of static dataflow analysis for Python?
I just found this open source project focused on dataflow analysis for Python. Check it out!
https://github.com/SMAT-Lab/Scalpel/
It's made in Python too; haven't used it, but looks very interesting!
This is the pre-print of their paper:
https://arxiv.org/pdf/2202.11840.pdf

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Reading data from bucket in Google ml-engine (tensorflow)

I am having issues reading data from a bucket hosted by Google.
I have a bucket containing ~1000 files I need to access, held at (for example)
gs://my-bucket/data
Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine.
I need a way to access both the data and the names of the files, either with a default python library (i.e. os) or using tensorflow. I know tensorflow has this functionality built in somewhere, it has been hard for me to find
Ideally I am looking for replacements for one command such as os.listdir() and another for open()
train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]
Where read_training_data uses a tensorflow reader object
Thanks for any help! ( Also p.s. my data is binary )
If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module.
That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant.
Information on reading can be found here. The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here).
Some things to keep in mind:
Are you using a format TensorFlow can read? Can it be converted to that format?
Is the overhead of "feeding" high enough to affect training performance?
Is the training set too big to fit in memory?
If the answer is yes to one or more of the questions, especially the latter two, consider using readers.
For what its worth. I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. This is demonstrated here after the file copy was done.
Here is my setup cell
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
Here is a cell for reading the file locally as a sanity check.
# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3',
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)
Here is reading the file from gs: directly as a binary file.
# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
result = sess.run(waveform)
print (result)

Categories