Pyspark: loading a zip file from blob storage - python

I'm using Pyspark to try to read a zip file from blob storage. I want to unzip the file once loaded, and then write the unzipped CSVs back to blob storage.
I'm following this guidance which explains how to unzip the file once read: https://docs.databricks.com/_static/notebooks/zip-files-python.html
But it doesn't explain how I read the zip from blob. I have the following code
file_location = "path_to_my.zip"
df = sqlContext.read.format("file_location").load
I expected this to load the zip to databricks as df, and from there I could follow the advice from the article to unzip, load the csvs to a dataframe and then write the dataframes back to blob.
Any ideas on how to initially read the zip file from blob using pyspark?
Thanks,

As shown in the first cell of your DataBricks notebook, you need to download the zip file and decompress it somehow. Your case is different because you are using Azure Blob storage and you want to do everything in Python (no other shell application).
This page documents the process for accessing files in Azure Blob Storage. You need to follow these steps:
Install the package azure-storage-blob.
Import the SDK modules and set the necessary credentials (reference).
Create an instance of BlobServiceClient using a connection string:
# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
Create an instance of BlobClient for the file you want:
blob_client = blob_service_client.get_blob_client(container="container", blob="path_to_my.zip")
Download the blob (the zip file) and unzip it with gzip. I would write something like this:
from pathlib import Path
import gzip
Path("./my/local/filepath.csv").write_bytes(
gzip.decompress(blob_client.download_blob().readall())
)
Use "./my/local/filepath.csv" to create the DataFrame.

Related

How to integrate Python Code in Azure Data Factory

I have a Text delimited file in my Azure data factory. I have to convert it to a json file.
I also have a python code that converts my text delimited file to a json file.
Now how do I integrate this python code in azure data factory. How to run this code from ADF?
You can use the Azure Data Factory Data Flow to do a lot of transforms like csv to JSON without Python (see this answer: Convert csv files,text files,pdf files into json using Azure Data Factory).
If you need the control Python offers, you can use Azure Batch to run your python file. In your python, you can grab the csv from a blob using blob_client.download_blob() to bring it down to a local file on the batch VM. Then you can load the file normally (pd.read_csv()) and do your transform. After you write your json file locally, you can upload back to the blob and then do whatever other ADF actions you want.
I used this set of instructions to get started running Azure Batch python: https://learn.microsoft.com/en-us/azure/batch/quick-run-python

How to update a existing csv in google drive without using google sheets using python

I have a csv file (data.csv) in my Google drive.
I want to append new data (df) to that csv file(stack new below old as both csv's have same column headers) and save it to existing data.csv without using google sheets.
How can I do this?
Open google colab and mount your google drive in colab using :
from google.colab import drive
drive.mount('/content/drive')
Provide your authorization and the file will be available in your current session inside your drive directory. You can then edit the file by importing pandas or csv then :
with open('data.csv','a') as fd:
fd.write(myCsvRow)
Opening a file with the 'a' parameter allows you to append to the end of the file instead of simply overwriting the existing content. Any changes you do will be saved to your original file.

Is it possible to read in all the files from an Azure Blob Storage container, and deleting the files after reading with Python?

I would like to read all files from an Azure Blob storage container, and save them on my local computer. After saving them on my local computer, I would like to delete the ones from Blob storage.
I found this solution on stack. However, that was just for reading one single blob file (where you need to specify the name of the file).
# The example from stack, to read in 1 file.
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='myaccount', account_key='mykey')
block_blob_service.get_blob_to_path('mycontainer', 'myblockblob', 'out-sunset.png')
It's actually pretty straightforward. You just need to list blobs and then download each blob individually.
Something like:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='myaccount', account_key='mykey')
#list blobs in container
blobs = block_blob_service.list_blobs(container_name='mycontainer')
for blob in blobs:
#download each blob
block_blob_service.get_blob_to_path(container_name='mycontainer', blob_name=blob.name, file_path=blob.name)
#delete blob
#block_blob_service.delete_blob(container_name='mycontainer', blob_name=blob.name)
Please note that above code assumes that you don't have blobs inside virtual folders inside your blob container and will fail if you have blobs inside virtual folders in a blob container. You will need to create a directory on the local file system first.

export dataframe excel directly to sharepoint (or a web page)

I created a dataframe to jupyter notebook and I would like to export this dataframe directly to sharepoint as an excel file. Is there a way to do that?
As per the other answers, there doesn't seem to be a way to write a dataframe directly to SharePoint at the moment, but rather than saving it to file locally, you can create a BytesIO object in memory, and write that to SharePoint using the Office365-REST-Python-Client.
from io import BytesIO
buffer = BytesIO() # Create a buffer object
df.to_excel(buffer, index=False) # Write the dataframe to the buffer
buffer.seek(0)
file_content = buffer.read()
Referencing the example script upload_file.py, you would just replace the lines reading in the file with the code above.
You can add the SharePoint site as a network drive on your local computer and then use a file path that way. Here's a link to show you how to map the SharePoint to a network drive. From there, just select the file path for the drive you selected.
df.to_excel(r'Y:\Shared Documents\file_name.xlsx')
I think there is no way directly export pandas dataframe data to sharepoint but you can export your pandas data to excel and then upload the excel data to sharepoint
Export Covid data to Excel file
df.to_excel(r'/home/noh/Desktop/covid19.xls', index=False)

Is there any way to read numpy array in Azure Blob Storage in python?

I would like to read the numpy array stored in Azure Blob Storage from python code in Azure functions. I am not being able to do it. I tried with BlockBlobService but couldn't succeed.
Looking for help/suggestions.
The simple solution is to use Azure Storage SDK for Python to download the blob content to memory, then to use the function numpy.frombuffer to load the memory content of the blob as a 1-dimensional array and continous to do any operations as you want like reshape.
As reference, there is a SO thread numpy.load from io.BytesIO stream that be same as yours and I answered it with three sample codes for loading npy or npz format content from blob.
If you want to read a csv/json file from Azure Blob Storage, you also can generate a blob url with sas token, then to use the function pandas.read_csv or pandas.read_json with an url parameter of <blob url with sas token> to directly read the content of csv or json format as a pandas dataframe and then to convert it to a numpy array.

Categories