Unexpected links generated when creating a folder hierarchy in blob storage - python

I have a Python script that loads files into a folder hierarchy in an Azure storage container/blob storage. The folder hierarchy structure is High-level-folder_name/Year/Month/Day/filename.json. For example : data/Scanner/2022/07/15/scanner23_30_45_21.json
New folders are added to the hierarchy, based on the current date. My code works, but for some reason a folder with a link to the top of the container hierarchy is also created at each level of the folder structure. See image below
My code is below. Any ideas on what is causing the links to upper level of the hierarchy would be appreciated.
*I have a feeling this issue is linked to the Blob service being based on a flat storage scheme (rather than hierarchical scheme). Link here With help of Python, how can I create dynamic subfolder under Azure Blobstorage?
import os
from pathlib import Path
from datetime import datetime
import json
from azure.storage.blob import BlobClient
#file name
datetime_string = datetime.utcnow().strftime("%Y_%m_%d-%I_%M_%S_%p")
#folders that the files foes in Year -> months -Filename.json
year_utc = datetime.utcnow().strftime("%Y")
month_utc = datetime.utcnow().strftime("%m")
filename_prefix =f"Scanner/{year_utc}/{month_utc}/{datetime_string}.json"
data = json.dumps("test_content_of_file",indent = 4)
blob = BlobClient.from_connection_string(conn_str="Connectionstringetc",
container_name="data", blob_name= filename_prefix)
blob.upload_blob(data, overwrite=False)

when creating a folder hierarchy in blob storage it is the default underlined structure of azure blob storage
Azure made available this UI design that links helps to move one upper level of folder it analysis the previous path and move advances to the subsequent path. which we can't change or alter the view as default one.

Related

How to create directory in ADLS gen2 from pyspark databricks

Summary:
I am working on a use-case where i want to write images via cv2 in the ADLS from within pyspark streaming job in databricks, however it doesn't work if the directory doesn't exist.
But i want to store image in specific structure depending on the image attributes.
so basically i need to check at runtime if directory exists or not and create it if it doesn't exist.
Initially I tried using dbutils, but dbutils can't be used inside pyspark api.
https://github.com/MicrosoftDocs/azure-docs/issues/28070
Expected Results:
To be able to able to create directory from within pyspark streaming job in ADLS Gen2
at runtime.
Reproducible code:
# Read images in batch for simplicity
df = spark.read.format('binaryFile').option('recursiveLookUp',True).option("pathGlobfilter", "*.jpg").load(path_to_source')
# Get necessary columns
df = df.withColumn('ingestion_timestamp',F.current_timestamp())
.withColumn('source_ingestion_date',F.to_date(F.split('path','/')[10]))
.withColumn('source_image_path',F.regexp_replace(F.col('path'),'dbfs:','/dbfs/')
.withColumn('source_image_time',F.substring(F.split('path','/')[12],0,8))
.withColumn('year', F.date_format(F.to_date(F.col('source_ingestion_date')),'yyyy'))
.withColumn('month', F.date_format(F.to_date(F.col('source_ingestion_date')),'MM'))
.withColumn('day', F.date_format(F.to_date(F.col('source_ingestion_date')),'dd'))
.withColumn('base_path', F.concat(F.lit('/dbfs/mnt/development/testing/'),F.lit('/year='),F.col('year'),
F.lit('/month='),F.col('month'),
F.lit('/day='),F.col('day'))
# function to be called in foreach call
def processRow(row):
source_image_path = row['source_image_path']
base_path = row['base_path']
source_image_time = row['source_image_time']
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
# This fails
df.foreach(processRow)
# Due to below code block
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
Do anyone have any suggestions please?
AFAIK, dbutils.fs.mkdirs(base_path) works for the path like dbfs:/mnt/mount_point/folder.
I have reproduced this and when I check the path like /dbfs/mnt/mount_point/folder with mkdirs function, the folder is not created in the ADLS even though it gave me True in databricks.
But for dbfs:/mnt/mount_point/folder it is working fine.
This might be the issue here. So, first check the path exists or not with this path /dbfs/mnt/mount_point/folder and if not then create the directory with dbfs:/ this path.
Example:
import os
base_path="/dbfs/mnt/data/folder1"
print("before : ",os.path.exists(base_path))
if not os.path.exists(base_path):
base_path2="dbfs:"+base_path[5:]
dbutils.fs.mkdirs(base_path2)
print("after : ",os.path.exists(base_path))
You can see the folder is created.
If you don't want to use os directly check if the path exists using the below list and create the directory.

Using Python and Docxtpl to automate report: Purposefully missing images is breaking / halting code

Context
I have been working for some time on creating a Python Script that uses the docxtpl package (and Jinja2 for managing tags and templates) to automate creation of MS Word reports.
My script (see below) is located in abase directory, along with an excel document for auto-filling tags and a template word document that is referenced. Within the base directory, there is a sub-directory (Image_loop) that contains a further directory for each placeholder image that must be replaced. The images are replaced using the Alt-text that has been assigned to each placeholder image in the template document, and has the same name as the directories within Image_loop (Image1, Image 2, etc). My directory setup can be seen in the photos below.
Directory 1
Directory 2
My Code
import jinja2
import json
import numpy as np
from pathlib import Path
import pandas as pd
from docxtpl import DocxTemplate
import glob
import os, sys
from docxtpl import DocxTemplate, InlineImage # pip install docxtpl
from docx.shared import Cm, Inches, Mm, Emu # pip install python-docx
base_dir = Path('//mnt//c//Users//XXX//Desktop//AUTOMATED_REPORTING') #make sure base directory is your own, the one you are going to be working out of, in Ubuntu directory format
word_template_path = base_dir / "Template1.docx" #set your word document template
excel_path = base_dir / "Book1.xlsx" #set the reference excel document
output_dir = base_dir / "OUTPUT" # set a directory for all outputs
output_dir.mkdir(exist_ok=True) # creates directory if not already existing
df = pd.read_excel(excel_path, sheet_name="Sheet1", dtype=str) #read the excel reference document as a pandas dataframe, datatype as string to avoid formatting issues
df2 = df.fillna(value='', method=None, axis=None, inplace=False, limit=None, downcast=None) #turns N/A values to blanks, as pandas data frame cannot have empty cells, but we want no value to be displayed in some instances
doc = DocxTemplate(word_template_path)
context = {}
image_filepath = Path('//mnt//c//Users//XXX//Desktop//AUTOMATED_REPORTING//Image_loop')
for record in df2.to_dict(orient="records"): #for loop that allows for values from Excel Spreadsheet to be rendered in template document
output_path = output_dir / f"{record['Catchment']}-Test_document.docx"
for address, dirs, files in os.walk(image_filepath): #for loop that iterates through 'image filepath' to find relevant sub-directories and the associated images within, to replace placeholder image in template word document
i = 0
while i < len(dirs):
dir_int = [*dirs[i][-1]]
directory = str(dirs[i])
if os.path.exists(image_filepath / f"{directory}/{record['Catchment']}.png"):
doc.replace_pic(f"{directory}", image_filepath / f"{directory}/{record['Catchment']}.png")
i += 1
doc.render(record)
doc.save(output_path)
Problem (help please)
My problem is that for some of my reports, there are no images for some of the placeholders. So for the sub-directories within Image_loop (Image1, Image 2, etc.), there is no image that corresponds to the template image number for that specific report.
So whilst the sub-directory 'Image_1' may contain for reports A,B,C,D:
Map_A.png (for report A)
Map_B.png (for report B)
Map_C.png (for report C)
Map_D.png (for report D)
i.e a map for every report
The sub-directory 'Image_2' only contains for reports A,B,C,D:
Graph_A (for report A)
Graph_B (for report B)
Graph_D (for report D)
i.e. there is to be no graph for report C
I am able to avoid bullet points or tables from the template document being automatically printed when there is no corresponding value to be filled by the Excel document for a specific report. This is done directly in the template document, using a 'new paragraph if statement' in Jinja 2 (https://jinja.palletsprojects.com/en/3.0.x/templates/). It looks something like this:
{%p if <TEMPLATE_VALUE> != '' %}
{%p endif %}
(i.e. don't print the bullet points, table, etc ,if there is no value to fill them with)
BUT if I wrap this same if statement at the start and end of a template image within the template document, I get an error running the code in Linux Ubuntu: ValueError: Picture ImageXYZ not found in the docx template
The error is attributed to the last line of my code: doc.save(output_path). I assume this is because the Jinja 2 '%p if statement' is removing the placeholder image when there is no replacement image to be found, and this creates a problem when trying to save report documents that are outliers (with no actual image to replace the placeholder image). When the code is run, reports are generated for those that have images for all placeholders, but not the 'outlier' document.
I'm sure there is a way to modify my code to generate the outlier reports, even though the placeholder image is not going to be replaced. Perhaps with a 'try:, except:' statement?
But I'm a bit stuck...

Unable to access sub-folders in flask

I am using below code to access sub-folders inside a folder named 'dataset' in VSCode, however, I am getting an empty list for dataset in the output and hence, unable to get json and image files stored inside that folder. Same code is working in Google Colab.
Code:
if __name__ == "__main__":
"""Dataset path"""
dataset_path = "vehicle images/dataset"
dataset = glob(os.path.join(dataset_path, "*"))
for data in dataset:
image_path = glob(os.path.join(data, "*.jpg"))
json_file = glob(os.path.join(data, "*.json"))
File Structure in VSCode:
File Structure in Google Colab:
Any suggestions would be helpful.
It looks like you used space in folder name. So it can be solvable by two method either change the vehicle images folder to vehicle_images or use row string like
dataset_path = r"vehicle images/dataset"

Using openpyxl with lambda

Python rookie here. I have a requirement for which i have been researching for a couple of days now. The requirement goes as below.
I have an S3 location where I have few excel sheets with unformatted data. I am writing a lambda function to format and convert them to csv format. Now I already have the code for this, but it works on local machine where I pick the excel files from local directory, format/transform them and put them to target folder. We are using openpyxl package for transforming. Now I am migrating this to AWS and there comes the problem. Instead of local directories the source and target will be s3 locations.
The data transforming logic is way too lengthy and I really dont want to rewrite them.
Is there a way I can handle these excel files just like how we does in local machine.
For instance,
wb = openpyxl.load_workbook('C:\User\test.xlsx, data_only=True)
How can I recreate this statement or what it does in lambda with python?
You can do this with BytesIO like so:
file = readS3('test.xlsx') # load file with Boto3
wb = openpyxl.load_workbook(BytesIO(file), data_only=True)
With readS3() being implemented for example like this:
import boto3
bucket = #bucket name
def readS3(file):
s3 = boto3.client('s3')
s3_data = s3.get_object(Bucket=bucket, Key=file)
return s3_data['Body'].read()
Configure Boto3 like so: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

Copy images from one folder to another using their names on a pandas dataframe

I have a pandas dataframe that consists of 10000s of image names and these images are in a folder locally.
I want to filter that dataframe to pick certain images (in 1000s) and copy those images from the aformentioned local folder to another local folder.
Is there a way that it can be done in python?
I have tried to do that using glob but couldn't make much sense out of it.
I will create an sample example here: I have the following df:
img_name
2014.png
2015.png
2016.png
2021.png
2022.png
2023.png
I have a folder for ex. "my_images" and I wish to move "2015.png" and "2022.png" to another folder called "proc_images".
Thanks
import os
import shutil
path_to_your_files = '../my_images'
copy_to_path = '../proc_images'
files_list = sorted(os.listdir(path_to_your_files))
file_names= ["2015.png","2022.png"]
for curr_file in file_names:
shutil.copyfile(os.path.join(path_to_your_files, curr_file),
os.path.join(copy_to_path, curr_file))
Something like this ?

Categories