Summary:
I am working on a use-case where i want to write images via cv2 in the ADLS from within pyspark streaming job in databricks, however it doesn't work if the directory doesn't exist.
But i want to store image in specific structure depending on the image attributes.
so basically i need to check at runtime if directory exists or not and create it if it doesn't exist.
Initially I tried using dbutils, but dbutils can't be used inside pyspark api.
https://github.com/MicrosoftDocs/azure-docs/issues/28070
Expected Results:
To be able to able to create directory from within pyspark streaming job in ADLS Gen2
at runtime.
Reproducible code:
# Read images in batch for simplicity
df = spark.read.format('binaryFile').option('recursiveLookUp',True).option("pathGlobfilter", "*.jpg").load(path_to_source')
# Get necessary columns
df = df.withColumn('ingestion_timestamp',F.current_timestamp())
.withColumn('source_ingestion_date',F.to_date(F.split('path','/')[10]))
.withColumn('source_image_path',F.regexp_replace(F.col('path'),'dbfs:','/dbfs/')
.withColumn('source_image_time',F.substring(F.split('path','/')[12],0,8))
.withColumn('year', F.date_format(F.to_date(F.col('source_ingestion_date')),'yyyy'))
.withColumn('month', F.date_format(F.to_date(F.col('source_ingestion_date')),'MM'))
.withColumn('day', F.date_format(F.to_date(F.col('source_ingestion_date')),'dd'))
.withColumn('base_path', F.concat(F.lit('/dbfs/mnt/development/testing/'),F.lit('/year='),F.col('year'),
F.lit('/month='),F.col('month'),
F.lit('/day='),F.col('day'))
# function to be called in foreach call
def processRow(row):
source_image_path = row['source_image_path']
base_path = row['base_path']
source_image_time = row['source_image_time']
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
# This fails
df.foreach(processRow)
# Due to below code block
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
Do anyone have any suggestions please?
AFAIK, dbutils.fs.mkdirs(base_path) works for the path like dbfs:/mnt/mount_point/folder.
I have reproduced this and when I check the path like /dbfs/mnt/mount_point/folder with mkdirs function, the folder is not created in the ADLS even though it gave me True in databricks.
But for dbfs:/mnt/mount_point/folder it is working fine.
This might be the issue here. So, first check the path exists or not with this path /dbfs/mnt/mount_point/folder and if not then create the directory with dbfs:/ this path.
Example:
import os
base_path="/dbfs/mnt/data/folder1"
print("before : ",os.path.exists(base_path))
if not os.path.exists(base_path):
base_path2="dbfs:"+base_path[5:]
dbutils.fs.mkdirs(base_path2)
print("after : ",os.path.exists(base_path))
You can see the folder is created.
If you don't want to use os directly check if the path exists using the below list and create the directory.
Related
I have a Python script that loads files into a folder hierarchy in an Azure storage container/blob storage. The folder hierarchy structure is High-level-folder_name/Year/Month/Day/filename.json. For example : data/Scanner/2022/07/15/scanner23_30_45_21.json
New folders are added to the hierarchy, based on the current date. My code works, but for some reason a folder with a link to the top of the container hierarchy is also created at each level of the folder structure. See image below
My code is below. Any ideas on what is causing the links to upper level of the hierarchy would be appreciated.
*I have a feeling this issue is linked to the Blob service being based on a flat storage scheme (rather than hierarchical scheme). Link here With help of Python, how can I create dynamic subfolder under Azure Blobstorage?
import os
from pathlib import Path
from datetime import datetime
import json
from azure.storage.blob import BlobClient
#file name
datetime_string = datetime.utcnow().strftime("%Y_%m_%d-%I_%M_%S_%p")
#folders that the files foes in Year -> months -Filename.json
year_utc = datetime.utcnow().strftime("%Y")
month_utc = datetime.utcnow().strftime("%m")
filename_prefix =f"Scanner/{year_utc}/{month_utc}/{datetime_string}.json"
data = json.dumps("test_content_of_file",indent = 4)
blob = BlobClient.from_connection_string(conn_str="Connectionstringetc",
container_name="data", blob_name= filename_prefix)
blob.upload_blob(data, overwrite=False)
when creating a folder hierarchy in blob storage it is the default underlined structure of azure blob storage
Azure made available this UI design that links helps to move one upper level of folder it analysis the previous path and move advances to the subsequent path. which we can't change or alter the view as default one.
I am using below code to access sub-folders inside a folder named 'dataset' in VSCode, however, I am getting an empty list for dataset in the output and hence, unable to get json and image files stored inside that folder. Same code is working in Google Colab.
Code:
if __name__ == "__main__":
"""Dataset path"""
dataset_path = "vehicle images/dataset"
dataset = glob(os.path.join(dataset_path, "*"))
for data in dataset:
image_path = glob(os.path.join(data, "*.jpg"))
json_file = glob(os.path.join(data, "*.json"))
File Structure in VSCode:
File Structure in Google Colab:
Any suggestions would be helpful.
It looks like you used space in folder name. So it can be solvable by two method either change the vehicle images folder to vehicle_images or use row string like
dataset_path = r"vehicle images/dataset"
I have a pandas dataframe that consists of 10000s of image names and these images are in a folder locally.
I want to filter that dataframe to pick certain images (in 1000s) and copy those images from the aformentioned local folder to another local folder.
Is there a way that it can be done in python?
I have tried to do that using glob but couldn't make much sense out of it.
I will create an sample example here: I have the following df:
img_name
2014.png
2015.png
2016.png
2021.png
2022.png
2023.png
I have a folder for ex. "my_images" and I wish to move "2015.png" and "2022.png" to another folder called "proc_images".
Thanks
import os
import shutil
path_to_your_files = '../my_images'
copy_to_path = '../proc_images'
files_list = sorted(os.listdir(path_to_your_files))
file_names= ["2015.png","2022.png"]
for curr_file in file_names:
shutil.copyfile(os.path.join(path_to_your_files, curr_file),
os.path.join(copy_to_path, curr_file))
Something like this ?
I am looking for a working example how to access data on a Azure Machine Learning managed data store from within a train.py script. I followed the instructions in the link and my script is able to resolve the datastore.
However, whatever I tried (as_download(), as_mount()) the only thing I always got was a DataReference object. Or maybe I just don't understand how actually read data from a file with that.
run = Run.get_context()
exp = run.experiment
ws = run.experiment.workspace
ds = Datastore.get(ws, datastore_name='mydatastore')
data_folder_mount = ds.path('mnist').as_mount()
# So far this all works. But how to go from here?
You can pass in the DataReference object you created as the input to your training product (scriptrun/estimator/hyperdrive/pipeline). Then in your training script, you can access the mounted path via argument.
full tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml
I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')