How to integrate Python Code in Azure Data Factory - python

I have a Text delimited file in my Azure data factory. I have to convert it to a json file.
I also have a python code that converts my text delimited file to a json file.
Now how do I integrate this python code in azure data factory. How to run this code from ADF?

You can use the Azure Data Factory Data Flow to do a lot of transforms like csv to JSON without Python (see this answer: Convert csv files,text files,pdf files into json using Azure Data Factory).
If you need the control Python offers, you can use Azure Batch to run your python file. In your python, you can grab the csv from a blob using blob_client.download_blob() to bring it down to a local file on the batch VM. Then you can load the file normally (pd.read_csv()) and do your transform. After you write your json file locally, you can upload back to the blob and then do whatever other ADF actions you want.
I used this set of instructions to get started running Azure Batch python: https://learn.microsoft.com/en-us/azure/batch/quick-run-python

Related

Loading csv.gz from url to bigquery

I am trying to load all the csv.gz files from this url to google bigquery. What is the best way to do this?
I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?
I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage. Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?
Thanks for your help!
As #Samuel mentioned, you can use the curl command to download the files from the URL and then copy the files to GCS bucket.
If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.
Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.

Pyspark: loading a zip file from blob storage

I'm using Pyspark to try to read a zip file from blob storage. I want to unzip the file once loaded, and then write the unzipped CSVs back to blob storage.
I'm following this guidance which explains how to unzip the file once read: https://docs.databricks.com/_static/notebooks/zip-files-python.html
But it doesn't explain how I read the zip from blob. I have the following code
file_location = "path_to_my.zip"
df = sqlContext.read.format("file_location").load
I expected this to load the zip to databricks as df, and from there I could follow the advice from the article to unzip, load the csvs to a dataframe and then write the dataframes back to blob.
Any ideas on how to initially read the zip file from blob using pyspark?
Thanks,
As shown in the first cell of your DataBricks notebook, you need to download the zip file and decompress it somehow. Your case is different because you are using Azure Blob storage and you want to do everything in Python (no other shell application).
This page documents the process for accessing files in Azure Blob Storage. You need to follow these steps:
Install the package azure-storage-blob.
Import the SDK modules and set the necessary credentials (reference).
Create an instance of BlobServiceClient using a connection string:
# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
Create an instance of BlobClient for the file you want:
blob_client = blob_service_client.get_blob_client(container="container", blob="path_to_my.zip")
Download the blob (the zip file) and unzip it with gzip. I would write something like this:
from pathlib import Path
import gzip
Path("./my/local/filepath.csv").write_bytes(
gzip.decompress(blob_client.download_blob().readall())
)
Use "./my/local/filepath.csv" to create the DataFrame.

Is there any way to read numpy array in Azure Blob Storage in python?

I would like to read the numpy array stored in Azure Blob Storage from python code in Azure functions. I am not being able to do it. I tried with BlockBlobService but couldn't succeed.
Looking for help/suggestions.
The simple solution is to use Azure Storage SDK for Python to download the blob content to memory, then to use the function numpy.frombuffer to load the memory content of the blob as a 1-dimensional array and continous to do any operations as you want like reshape.
As reference, there is a SO thread numpy.load from io.BytesIO stream that be same as yours and I answered it with three sample codes for loading npy or npz format content from blob.
If you want to read a csv/json file from Azure Blob Storage, you also can generate a blob url with sas token, then to use the function pandas.read_csv or pandas.read_json with an url parameter of <blob url with sas token> to directly read the content of csv or json format as a pandas dataframe and then to convert it to a numpy array.

Python code to load CSV data from Google Storage to Bigquery?

I am pretty new in this, so wanted to have code and process to load data from csv file (Placed in Google Storage) to BigQuery Table using the python code and DataFlow.
Thanks in advance.
There are different BigQuery libraries depending on the language. For Python you would find this one.
But if what you want is the exactly piece of code to upload CSV from Google Cloud Storage to Bigquery, this example might work for you: "Loading CSV data into a new table"
You can see also in the same documentation page "Appending to or overwriting a table with CSV data".
You can go also to the GitHub in order to check all the methods available for Python.

Python dropbox - Opening spreadsheets

I was testing with the dropbox provided API for python..my target was to read a Spreadsheet in my dropbox without downloading it to my local storage.
import dropbox
dbx = dropbox.Dropbox('my-token')
print dbx.users_get_current_account()
fl = dbx.files_get_preview('/CGPA.xlsx')[1] # returns a Response object
After the above code, calling the fl.text() method gives an HTML output which shows the preview that would be seen if opened by browser. And the data can be parsed.
My query is, if there is a built-in method of the SDK for getting any particular info from the spreadsheet, like the data of a row or a cell...preferrably in json format...I previously used butterdb for extracting data from a google drive spreadsheet...is there such functionality for dropbox?....could not understand by reading the docs: http://dropbox-sdk-python.readthedocs.io/en/master/
No, the Dropbox API doesn't offer the ability to selectively query parts of a spreadsheet file like this without downloading the whole file, but we'll consider it a feature request.

Categories