Unable to access sub-folders in flask - python

I am using below code to access sub-folders inside a folder named 'dataset' in VSCode, however, I am getting an empty list for dataset in the output and hence, unable to get json and image files stored inside that folder. Same code is working in Google Colab.
Code:
if __name__ == "__main__":
"""Dataset path"""
dataset_path = "vehicle images/dataset"
dataset = glob(os.path.join(dataset_path, "*"))
for data in dataset:
image_path = glob(os.path.join(data, "*.jpg"))
json_file = glob(os.path.join(data, "*.json"))
File Structure in VSCode:
File Structure in Google Colab:
Any suggestions would be helpful.

It looks like you used space in folder name. So it can be solvable by two method either change the vehicle images folder to vehicle_images or use row string like
dataset_path = r"vehicle images/dataset"

Related

How to create directory in ADLS gen2 from pyspark databricks

Summary:
I am working on a use-case where i want to write images via cv2 in the ADLS from within pyspark streaming job in databricks, however it doesn't work if the directory doesn't exist.
But i want to store image in specific structure depending on the image attributes.
so basically i need to check at runtime if directory exists or not and create it if it doesn't exist.
Initially I tried using dbutils, but dbutils can't be used inside pyspark api.
https://github.com/MicrosoftDocs/azure-docs/issues/28070
Expected Results:
To be able to able to create directory from within pyspark streaming job in ADLS Gen2
at runtime.
Reproducible code:
# Read images in batch for simplicity
df = spark.read.format('binaryFile').option('recursiveLookUp',True).option("pathGlobfilter", "*.jpg").load(path_to_source')
# Get necessary columns
df = df.withColumn('ingestion_timestamp',F.current_timestamp())
.withColumn('source_ingestion_date',F.to_date(F.split('path','/')[10]))
.withColumn('source_image_path',F.regexp_replace(F.col('path'),'dbfs:','/dbfs/')
.withColumn('source_image_time',F.substring(F.split('path','/')[12],0,8))
.withColumn('year', F.date_format(F.to_date(F.col('source_ingestion_date')),'yyyy'))
.withColumn('month', F.date_format(F.to_date(F.col('source_ingestion_date')),'MM'))
.withColumn('day', F.date_format(F.to_date(F.col('source_ingestion_date')),'dd'))
.withColumn('base_path', F.concat(F.lit('/dbfs/mnt/development/testing/'),F.lit('/year='),F.col('year'),
F.lit('/month='),F.col('month'),
F.lit('/day='),F.col('day'))
# function to be called in foreach call
def processRow(row):
source_image_path = row['source_image_path']
base_path = row['base_path']
source_image_time = row['source_image_time']
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
# This fails
df.foreach(processRow)
# Due to below code block
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
Do anyone have any suggestions please?
AFAIK, dbutils.fs.mkdirs(base_path) works for the path like dbfs:/mnt/mount_point/folder.
I have reproduced this and when I check the path like /dbfs/mnt/mount_point/folder with mkdirs function, the folder is not created in the ADLS even though it gave me True in databricks.
But for dbfs:/mnt/mount_point/folder it is working fine.
This might be the issue here. So, first check the path exists or not with this path /dbfs/mnt/mount_point/folder and if not then create the directory with dbfs:/ this path.
Example:
import os
base_path="/dbfs/mnt/data/folder1"
print("before : ",os.path.exists(base_path))
if not os.path.exists(base_path):
base_path2="dbfs:"+base_path[5:]
dbutils.fs.mkdirs(base_path2)
print("after : ",os.path.exists(base_path))
You can see the folder is created.
If you don't want to use os directly check if the path exists using the below list and create the directory.

Copy images from one folder to another using their names on a pandas dataframe

I have a pandas dataframe that consists of 10000s of image names and these images are in a folder locally.
I want to filter that dataframe to pick certain images (in 1000s) and copy those images from the aformentioned local folder to another local folder.
Is there a way that it can be done in python?
I have tried to do that using glob but couldn't make much sense out of it.
I will create an sample example here: I have the following df:
img_name
2014.png
2015.png
2016.png
2021.png
2022.png
2023.png
I have a folder for ex. "my_images" and I wish to move "2015.png" and "2022.png" to another folder called "proc_images".
Thanks
import os
import shutil
path_to_your_files = '../my_images'
copy_to_path = '../proc_images'
files_list = sorted(os.listdir(path_to_your_files))
file_names= ["2015.png","2022.png"]
for curr_file in file_names:
shutil.copyfile(os.path.join(path_to_your_files, curr_file),
os.path.join(copy_to_path, curr_file))
Something like this ?

Loading images for multi-class from csv file

I have Train and Test folders and inside each folder there is many folders with images inside each. In .csv file there is label for each folder and class.
here is csv file
https://i.imgur.com/qMLGOpC.png
and folders
https://i.imgur.com/RrYBZxG.png
How to load these folders with labels in keras ?
I tried to make a dataframe with folders and labels like below :
pth = 'C:/Users/Documents/train/'
folders = os.listdir(pth)
filepath='C:/Users//Documents//keras/labels.csv'
metadf = pd.read_csv(filepath)
metadf.index = metadf.Class
videos = pd.DataFrame([])
for folder in folders:
pth_upd = pth + folder + '/'
for file in allfiles:
videos = pd.DataFrame(metadf.values, index=folders)
the output is :
https://i.imgur.com/CsXAE8f.png
Is that the correct way of doing it ? how can I load each folder with images and corresponding labels ?
you can use inbuild flow_from_directory() method
check out this link keras docs

Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

Reading .dcm files from a nested directory

This is the form of my nested directory:
/data/patient_level/study_level/series_level/
For each patient_level folder, I need to read the ".dcm" files from the corresponding "series_level" folder.
How can I access the ".dcm" file in the "series_level" folders?
I need 3 features from a DICOM object.
This is my source code:
import dicom
record = dicom.read_file("/data/patient_level/study_level/series_level/000001.dcm")
doc = {"PatientID": record.PatientID, "Manufacturer": record.Manufacturer, "SeriesTime": record.SeriesTime}
Then, I will insert this doc to a Mongo DB.
Any suggestions is appreciated.
Thanks in advance.
It is not quite clear the problem you are trying to solve, but if all you want is to get a list of all .dcm files from that the data directory, you can use pathlib.Path():
from pathlib import Path
data = Path('/data')
list(data.rglob('*.dcm'))
To split a file in its components, you can use something like this:
dcm_file = '/patient_level/study_level/series_level/000001.dcm'
_, patient_level, study_level, series_level, filename = dcm_file.split('/')
Which would give you patient_level, study_level, series_level, and filename from dcm_file.
But it would be better to stick with Path() and its methods, like this:
dcm_file = Path('/patient_level/study_level/series_level/000001.dcm')
dcm_file.parts
Which would give you something like this:
('/', 'patient_level', 'study_level', 'series_level', '000001.dcm')
Those are just starting points anyway.

Categories