I'm trying to grab a blob storage file into my python code on databricks only if it exists. How can I check if it exists through pyspark?
I don't think that there is some method to check if blob exist but from below code you can read it before you will write it.
On the application level, first of all as always in spark applications, you need to grab a spark session:
session = SparkSession.builder.getOrCreate()
Then you need to set up an account key:
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
OR SAS token for a container:
session.conf.set(
"fs.azure.sas.<container-name>.blob.core.windows.net",
"<sas-token>"
)
Once an account access key or a SAS is set up you're ready to read/write to Azure blob:
sdf = session.read.parquet(
"wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<prefix>"
)
Though using python you can easliy call the get_blob_reference method to check if blob exist or not.
def blob_exists(self):
container_name = self._create_container()
blob_name = self._get_blob_reference()
# Basic
exists = self.service.exists(container_name, blob_name) # False
self.service.create_blob_from_text(container_name, blob_name, u'hello world')
exists = self.service.exists(container_name, blob_name) # True
self.service.delete_container(container_name)
You can find the reference here:
https://github.com/Azure/azure-storage-python/blob/master/samples/blob/block_blob_usage.py
Related
my case is the following:
Two Azure Storage Accounts (Source/Destination)
Source Account may contain multiple containers, folders, blobs, etc.
All of the above needs to be copied exactly in the same structure to the DESTINATION account.
If any elements already exist in the Destination account then if they are older then in the SOURCE storage account they need to be overriden.
What I made so far:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, BlobLeaseClient, BlobPrefix, ContentSettings
# Set the connection string for the source and destination storage accounts
SOURCE_CONNECTION_STRING = "your SOURCE connection string"
DESTINATION_CONNECTION_STRING = "your DESTINATION connection string"
# Create the BlobServiceClient objects for the source and destination storage accounts
source_blob_service_client = BlobServiceClient.from_connection_string(SOURCE_CONNECTION_STRING)
destination_blob_service_client = BlobServiceClient.from_connection_string(DESTINATION_CONNECTION_STRING)
# List all containers in the source storage account
source_containers = source_blob_service_client.list_containers()
# Iterate through each container in the source storage account
for source_container in source_containers:
print(f"Processing container '{source_container.name}'...")
# Create a new container in the destination storage account (if it doesn't exist already)
destination_container = destination_blob_service_client.get_container_client(source_container.name)
if not destination_container.exists():
print(f"Creating container '{source_container.name}' in the destination storage account...")
destination_container.create_container()
# Get a list of all blobs in the current source container
source_container_client = source_blob_service_client.get_container_client(source_container.name)
source_blobs = source_container_client.list_blobs()
#source_blobs = source_blob_service_client.list_blobs(source_container.name)
# Iterate through each blob in the current source container
for source_blob in source_blobs:
# Check if the blob already exists in the destination container
destination_blob = destination_blob_service_client.get_blob_client(source_container.name, source_blob.name)
print(source_blob)
if not destination_blob.exists() or source_blob.last_modified > destination_blob.get_blob_properties().last_modified:
# Copy the blob to the destination container (with the same directory structure as in the source)
#source_blob_client = BlobClient.from_blob_url(source_blob.url)
source_blob_client = BlobClient.from_blob_url(source_blob.url)
destination_blob.start_copy_from_url(source_url=source_blob.url)
print(f"Copied blob '{source_blob.name}' to container '{source_container.name}' in the destination storage account.")
However I get an error -- AttributeError: 'BlobProperties' object has no attribute 'url' -- while in the this notebook https://github.com/Azure-Samples/AzureStorageSnippets/blob/master/blobs/howto/python/blob-devguide-py/blob-devguide-blobs.py & https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-start-copy-from-url - I see it being used.
Can someone suggest what am I doing wrong? I have opted for python due to the iterative requirement (go to the most granular level of each container), which seemed not doable in Synapse via pipeline activities.
I tried in my environment and got below results:
Initially, I got an same error in my environment.
I got an error -- AttributeError: 'BlobProperties' object has no
attribute 'url' -- while in the this notebook
The above error occurs due to source_blob object is of type BlobProperties, which doesn't have a url attribute. Instead, you should use the source_blob_client object you created earlier to get the source blob URL.
Code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, BlobLeaseClient, BlobPrefix, ContentSettings
# Set the connection string for the source and destination storage accounts
SOURCE_CONNECTION_STRING = "<src_connect_strng>"
DESTINATION_CONNECTION_STRING = "<dest_connect_strng>"
# Create the BlobServiceClient objects for the source and destination storage accounts
source_blob_service_client = BlobServiceClient.from_connection_string(SOURCE_CONNECTION_STRING)
destination_blob_service_client = BlobServiceClient.from_connection_string(DESTINATION_CONNECTION_STRING)
# List all containers in the source storage account
source_containers = source_blob_service_client.list_containers()
# Iterate through each container in the source storage account
for source_container in source_containers:
print(f"Processing container '{source_container.name}'...")
# Create a new container in the destination storage account (if it doesn't exist already)
destination_container = destination_blob_service_client.get_container_client(source_container.name)
if not destination_container.exists():
print(f"Creating container '{source_container.name}' in the destination storage account...")
destination_container.create_container()
# Get a list of all blobs in the current source container
source_container_client = source_blob_service_client.get_container_client(source_container.name)
source_blobs = source_container_client.list_blobs()
# Iterate through each blob in the current source container
for source_blob in source_blobs:
# Check if the blob already exists in the destination container
destination_blob = destination_blob_service_client.get_blob_client(source_container.name, source_blob.name)
print(source_blob.name)
source_blob_client = source_blob_service_client.get_blob_client(source_container.name, source_blob.name)
print(source_blob_client.url)
destination_blob.start_copy_from_url(source_url=source_blob_client.url)
print(f"Copied blob '{source_blob.name}' to container '{source_container.name}' in the destination storage account.")
Console:
The above code executed and successfully copied same structure from one storage account to another storage account using synapse.
Portal:
In portal I can able to see the destination account as same structure as source account.
I am using the below code to append data to Azure blob using python.
from azure.storage.blob import AppendBlobService
append_blob_service = AppendBlobService(account_name='myaccount', account_key='mykey')
# The same containers can hold all types of blobs
append_blob_service.create_container('mycontainer')
# Append blobs must be created before they are appended to
append_blob_service.create_blob('mycontainer', 'myappendblob')
append_blob_service.append_blob_from_text('mycontainer', 'myappendblob', u'Hello, world!')
append_blob = append_blob_service.get_blob_to_text('mycontainer', 'myappendblob')
The above code works fine, but when I tried to insert new data, the old data gets overwritten.
Is there any way I can append data to 'myappendblob'
Considering you are calling the same code to append the data, the issue is with the following line of code:
append_blob_service.create_blob('mycontainer', 'myappendblob')
If you read the documentation for create_blob method, you will notice the following:
Creates a blob or overrides an existing blob. Use if_none_match=* to
prevent overriding an existing blob.
So essentially you are overriding the blob every time you call your code.
You should call this method with if_none_match="*" as the documentation suggests. If the blob exists, your code will throw an exception which you will need to handle.
Try this code which is taken from the Document and it is given by #Harsh Jain ,
from azure.storage.blob import AppendBlobService
def append_data_to_blob(data):
service = AppendBlobService(account_name="<Storage acc name>",
account_key="<Storage acc key>")
try:
service.append_blob_from_text(container_name="<name of Conatiner >", blob_name="<The name of file>", text = data)
except:
service.create_blob(container_name="<name of Conatiner >", blob_name="<the name of file>")
service.append_blob_from_text(container_name="<name of Conatiner>", blob_name="<the name of file>", text = data)
print('Data got Appended ')
append_data_to_blob('Hi blob')
Taken References from:
https://www.educative.io/answers/how-to-append-data-in-blob-storage-in-azure-using-python
I've just uploaded a 5GB of data and would like to verify that the MD5 sums match. I've calculated this for my local copy of the files, but am having problems fetching ContentMD5 from Azure. So far, I get an empty dict, but I can see the blob names. I've limited it to the first 10 items at the moment, just for debugging. I'm aware that MD5 is different on Azure from a typical md5sum call and have allowed for that locally. But, currently, I cannot see any blob properties. The properties are there when I browse via the Azure console (as is the ContentMD5 property).
Where am I going wrong?
Here's my code at the moment:
import os
from os import sys
from azure.storage.blob import BlobServiceClient
def remote_check(connection_str):
blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_name = "global"
container = blob_service_client.get_container_client(container=container_name)
blob_list = container.list_blobs()
count = 0
for blob in blob_list:
if count < 10:
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob)
a = blob_client.get_blob_properties()
print(a.metadata)
print("Blob name: " + str(blob_client.blob_name))
count = count + 1
else:
break
def main():
try:
CONNECTION_STRING = os.environ['AZURE_STORAGE_CONNECTION_STRING']
remote_check(CONNECTION_STRING)
except KeyError:
print("AZURE_STORAGE_CONNECTION_STRING must be set.")
sys.exit(1)
if __name__ == '__main__':
main()
Please make sure you're using the latest version of package azure-storage-blob 12.6.0.
Some properties are in the content_settings, for example, to get content_md5, you should use the following code:
a=blob_client.get_blob_properties()
print(a.content_settings.content_md5)
Here is the my test result:
Maybe you can check the blob properties with a rest (e.g. with an rest client like postman) call described here:
https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties
The "Content-MD5" is returned as HTTP-Response Header.
I'm trying to use the sample provided by Microsoft to connect to an Azure storage table using Python. The code below fail because of tablestorageaccount not found. What I'm missing I installed the azure package but still complaining that it's not found.
import azure.common
from azure.storage import CloudStorageAccount
from tablestorageaccount import TableStorageAccount
print('Azure Table Storage samples for Python')
# Create the storage account object and specify its credentials
# to either point to the local Emulator or your Azure subscription
if IS_EMULATED:
account = TableStorageAccount(is_emulated=True)
else:
account_connection_string = STORAGE_CONNECTION_STRING
# Split into key=value pairs removing empties, then split the pairs into a dict
config = dict(s.split('=', 1) for s in account_connection_string.split(';') if s)
# Authentication
account_name = config.get('AccountName')
account_key = config.get('AccountKey')
# Basic URL Configuration
endpoint_suffix = config.get('EndpointSuffix')
if endpoint_suffix == None:
table_endpoint = config.get('TableEndpoint')
table_prefix = '.table.'
start_index = table_endpoint.find(table_prefix)
end_index = table_endpoint.endswith(':') and len(table_endpoint) or table_endpoint.rfind(':')
endpoint_suffix = table_endpoint[start_index+len(table_prefix):end_index]
account = TableStorageAccount(account_name = account_name, connection_string = account_connection_string, endpoint_suffix=endpoint_suffix)
I find the source sample code, and in the sample code there is still a custom module tablestorageaccount.py, it's just used to return TableService. If you already have the storage connection string and want to have a test, you could connect to table directly.
Sample:
from azure.storage.table import TableService, Entity
account_connection_string = 'DefaultEndpointsProtocol=https;AccountName=account name;AccountKey=account key;EndpointSuffix=core.windows.net'
tableservice=TableService(connection_string=account_connection_string)
Also you could refer to the new sdk to connect table. Here is the official tutorial about Get started with Azure Table storage.
I'm using azure-sdk-for-python to create and delete VMs.
https://github.com/Azure/azure-sdk-for-python
http://azure-sdk-for-python.readthedocs.io/en/latest/
I've successfully managed to write the code to create and delete my VMs using the resource manager approach (not the classic).
The basic to create a VM can be seen here:
http://azure-sdk-for-python.readthedocs.io/en/latest/resourcemanagementcomputenetwork.html
I'm not worried about deleting the resource group and the storage account, as I'm using the same for all my VMs.
To delete a the created VM I have something like this:
# 1. Delete the virtual machine
result = compute_client.virtual_machines.delete(
group_name,
vm_name
)
result.wait()
# 2. Delete the network interface
result = network_client.network_interfaces.delete(
group_name,
network_interface_name
)
result.wait()
# 3. Delete the ip
result = network_client.public_ip_addresses.delete(
group_name,
public_ip_address_name
)
result.wait()
As some know the data disks are not deleted along with its VM.
I know it can be done with the Azure CLI:
https://azure.microsoft.com/en-us/documentation/articles/storage-azure-cli/
azure storage blob delete -a <storage_account_name> -k <storage_account_key> -q vhds <data_disk>.vhd
But I don't know how to do it programmatically with azure-sdk-for-python. And I didn't want to depend on the Azure CLI as the rest of my code is using the python sdk.
I would appreciate some help on how to do it.
Thanks
You can leverage the Azure ComputeManagementClient's disks APIs to obtain list of disks associated with a VM and then iterate over them to delete the disks. Here's some sample code to achieve the same:
def delete_vm(self, vm_name, nic_name, group_name):
# Delete VM
print('Delete VM {}'.format(vm_name))
try:
async_vm_delete = self.compute_client.virtual_machines.delete(group_name, vm_name)
async_vm_delete.wait()
net_del_poller = self.network_client.network_interfaces.delete(group_name, nic_name)
net_del_poller.wait()
disks_list = self.compute_client.disks.list_by_resource_group(group_name)
disk_handle_list = []
for disk in disks_list:
if vm_name in disk.name:
async_disk_delete = self.compute_client.disks.delete(self.group_name, disk.name)
async_disk_handle_list.append(async_disk_delete)
print("Queued disks will be deleted now...")
for async_disk_delete in disk_handle_list:
async_disk_delete.wait()
except CloudError:
print('A VM delete operation failed: {}'.format(traceback.format_exc()))
return False
print("Deleted VM {}".format(vm_name))
return True
You can use the Storage Management SDK to get the storage_account_key without writing it explicitly:
http://azure-sdk-for-python.readthedocs.io/en/latest/resourcemanagementstorage.html#get-storage-account-keys
To delete a VHD inside a storage account, you have to use the Storage Data SDK here:
https://github.com/Azure/azure-storage-python
You have samples in the "samples" folder or here:
https://github.com/Azure-Samples/storage-python-getting-started
Hope it helps :)
Here's a little more code:
storage_account = <name of storage account>
storage_client = StorageManagementClient(...)
keys = storage_client.storage_accounts.list_keys(...)
for key in keys.keys:
# Use the first key; adjust accordingly if your set up is different
break
block_blob_service = BlockBlobService(
account_name=storage_account, account_key=key.value)
for blob in block_blob_service.list_blobs(container):
print blob.name
I hope you find this useful. Thanks to Laurent for the pointers.
To delete the OS disk, one simple way to achieve this is to query for the OS disk name before deleting the VM, and deleting the OS disk after the VM is deleted.
Here is my version of a function that deletes the VM alongside network and storage resources:
def az_delete_vm(resource_group_name, vm_name, delete_os_storage=True, remove_default_network=True):
os_disk_name = None
if delete_os_storage:
vm = compute_client.virtual_machines.get(resource_group_name, vm_name)
os_disk_name = vm.storage_profile.os_disk.name
logger.info("Deleting VM %s", vm_name)
delete_op1 = compute_client.virtual_machines.delete(
resource_group_name, vm_name)
delete_op1.wait()
if delete_os_storage:
delete_op2 = compute_client.disks.delete(resource_group_name, os_disk_name)
delete_op2.wait()
if remove_default_network:
logger.info("Removing VM network components")
vnet_name = "{}-vnet".format(vm_name)
subnet_name = "{}-subnet".format(vm_name)
nic_name = "{}-nic".format(vm_name)
public_ip_name = "{}-public-ip".format(vm_name)
logger.debug("Removing NIC %s", nic_name)
delete_op3 = network_client.network_interfaces.delete(resource_group_name, nic_name)
delete_op3.wait()
# logger.debug("Removing subnet %s", subnet_name)
# network_client.subnets.delete(resource_group_name, subnet_name)
logger.debug("Removing vnet %s", vnet_name)
delete_op4 = network_client.virtual_networks.delete(resource_group_name, vnet_name)
logger.debug("Removing public IP %s", public_ip_name)
delete_op5 = network_client.public_ip_addresses.delete(resource_group_name, public_ip_name)
delete_op4.wait()
delete_op5.wait()
logger.info("Done deleting VM")