Azure data lake - read using Python - python

I am trying to read a file from Azure Data lake using Python in a Databricks notebook.
this is the code I used,
from azure.storage.filedatalake import DataLakeFileClient
file = DataLakeFileClient.from_connection_string("DefaultEndpointsProtocol=https;AccountName=mydatalake;AccountKey=******;EndpointSuffix=core.windows.net",file_system_name="files", file_path="/2020/50002")
with open("./sample.txt", "wb") as my_file:
download = file.download_file()
content = download.readinto(my_file)
print(content)
The output I get is 0. Can you some point what I am doing wrong. my expectation is to print the file content.

The from_connection_string method returns a DataLakeFileClient, you could not use it to download the file.
If you want to download a file to local, you could refer to my below code.
import os, uuid, sys
from azure.storage.filedatalake import DataLakeServiceClient
service_client = DataLakeServiceClient.from_connection_string("DefaultEndpointsProtocol=https;AccountName=***;AccountKey=*****;EndpointSuffix=core.windows.net")
file_system_client = service_client.get_file_system_client(file_system="test")
directory_client = file_system_client.get_directory_client("testdirectory")
file_client = directory_client.get_file_client("test.txt")
download=file_client.download_file()
downloaded_bytes = download.readall()
with open("./sample.txt", "wb") as my_file:
my_file.write(downloaded_bytes)
my_file.close()
If you want more sample code, you could refer to this doc:Azure Data Lake Storage Gen2.

Related

Read pdf file from storage account (Azure Data lake) without downloading it using python

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python.
I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working.
Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it.
Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function?
Any help is appreciated.
I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser:

Download Blob To Local Storage using Python

I'm trying to download a blob file & store it locally on my machine. The file format is HDF5 (a format I have limited/no experience of so far).
So far I've been successful in downloading something using the scripts below. The key issue is it doesn't seem to be the full file. When downloading the file directly from storage explorer it is circa 4,000kb. The HDF5 file I save is 2kb.
What am I doing wrong? Am I missing a readall() somewhere?
My first time working with blob storage & HDF5's, so coming a little stuck right now. A lot of the old questions seem to be using deprecated commands as the azure.storage.blob module has been updated.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("my_conn_str")
# Initialise container
blob_container_client = blob_service_client.get_container_client("container_name")
# Get blob
blob_client = blob_container_client.get_blob_client("file_path")
# Download
download_stream = blob_client.download_blob()
# Create empty stream
stream = BytesIO()
# Read downloaded blob into stream
download_stream.readinto(stream)
# Create new empty hdf5 file
hf = h5py.File('data.hdf5', 'w')
# Write stream into empty HDF5
hf.create_dataset('dataset_1',stream)
# Close Blob (& save)
hf.close()
I tried to reproduce the scenario in my system facing with same issue with code you tried
So I tried the another solution read the hdf5 file as stream and write it inside another hdf5 file
Try with this solution .Taken some dummy data for testing purpose.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import numpy as np
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("Connection String")
# Initialise container
blob_container_client = blob_service_client.get_container_client("test//Container name")
# Get blob
blob_client = blob_container_client.get_blob_client("test.hdf5 //Blob name")
print("downloaded the blob ")
# Download
download_stream = blob_client.download_blob()
stream = BytesIO()
downloader = blob_client.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
//dummy data
data_matrix = np.random.uniform(-1, 1, size=(10, 3))
with h5py.File(stream, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
data_matrix=data
print(data)
with h5py.File("file1.hdf5", "w") as data_file:
data_file.create_dataset("group_name", data=data_matrix)
OUTPUT

Stream Files to Zip File in Azure Blob Storage using Python?

I have the following problem in Python:
I am looking to create a zipfile in Blob Storage consisting of files from an array of URLs but I don't want to create the entire zipfile in memory and then upload it. I ideally want to stream the files to the zipfile in blob storage. I found this write up for C# https://andrewstevens.dev/posts/stream-files-to-zip-file-in-azure-blob-storage/
as well as this answer also in C# https://stackoverflow.com/a/54767264/10550055 .
I haven't been able to find equivalent functionality in the python azure blob SDK and python zipfile library.
Try this :
from zipfile import ZipFile
from azure.storage.blob import BlobServiceClient
import os,requests
tempPath = '<temp path>'
if not os.path.isdir(tempPath):
os.mkdir(tempPath)
zipFileName = 'test.zip'
storageConnstr = ''
container = ''
blob = BlobServiceClient.from_connection_string(storageConnstr).get_container_client(container).get_blob_client(zipFileName)
fileURLs = {'https://cdn.pixabay.com/photo/2015/04/23/22/00/tree-736885__480.jpg',
'http://1812.img.pp.sohu.com.cn/images/blog/2009/11/18/18/8/125b6560a6ag214.jpg',
'http://513.img.pp.sohu.com.cn/images/blog/2009/11/18/18/27/125b6541abcg215.jpg'}
def download_url(url, save_path, chunk_size=128):
r = requests.get(url, stream=True)
with open(save_path, 'wb') as fd:
for chunk in r.iter_content(chunk_size=chunk_size):
fd.write(chunk)
zipObj = ZipFile(tempPath + zipFileName, 'w')
#download file and write to zip
for url in fileURLs:
localFilePath = tempPath + os.path.basename(url)
download_url(url,localFilePath)
zipObj.write(localFilePath)
zipObj.close()
#upload zip
with open(tempPath + zipFileName, 'rb') as stream:
blob.upload_blob(stream)

How to convert .docx to .txt in Python

I would like to convert a large batch of MS Word files into the plain text format. I have no idea how to do it in Python. I found the following code online. My path is local and all file names are like cx-xxx (i.e. c1-000, c1-001, c2-000, c2-001 etc.):
from docx import [name of file]
import io
import shutil
import os
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"c", encoding="utf-8") as textFile:
for para in document.paragraphs:
textFile.write(unicode(para.text))
path= "/home/python/resumes/"
convertDocxToText(path)
Convert docx to txt with pypandoc:
import pypandoc
# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
assert output == ""
See the official documentation here:
https://pypi.org/project/pypandoc/
You can also use the library docx2txt in Python. Here's an example:
I use glob to iter over all DOCX files in the folder.
Note: I use a little list comprehension on the original name in order to re-use it in the TXT filename.
If there's anything I've forgotten to explain, tag me and I'll edit it in.
import docx2txt
import glob
directory = glob.glob('C:/folder_name/*.docx')
for file_name in directory:
with open(file_name, 'rb') as infile:
outfile = open(file_name[:-5]+'.txt', 'w', encoding='utf-8')
doc = docx2txt.process(infile)
outfile.write(doc)
outfile.close()
infile.close()
print("=========")
print("All done!")`
GroupDocs.Conversion Cloud SDK for Python supports 50+ file formats conversion. Its free plan provides 150 free API calls monthly.
# Import module
import groupdocs_conversion_cloud
from shutil import copyfile
# Get your client_id and client_key at https://dashboard.groupdocs.cloud (free registration is required).
client_id = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxx"
client_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_key)
try:
#Convert DOCX to txt
# Prepare request
request = groupdocs_conversion_cloud.ConvertDocumentDirectRequest("txt", "C:/Temp/sample.docx")
# Convert
result = convert_api.convert_document_direct(request)
copyfile(result, 'C:/Temp/sample.txt')
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

Uploading CSV file to Azure Data Lake Store(ADLS) Gen 2 using Python SDK

[UPDATE - 5/15/2020 - I got this code and the entire flow working with parquet file format. However,I would be still interested in the approach using CSV]
I am trying to upload a csv file from a local machine to ADLS Gen 2 storage using the below command. This works fine, but the resulting csv file in ADLS is a continuous text with no new line character to separate each row. This CSV file cannot be loaded into Azure Synapse as is using Polybase.
Input CSV -
"col1","col2","col3"
"NJ","1","1/3/2020"
"NY","1","1/4/2020"
...
Output CSV that I get is like this -
"col1","col2","col3""NJ","1","1/3/2020""NY","1","1/4/2020"...
How do i make sure my final csv has the new line character after each row? There are few 100,000 records in each CSVs.
import os, uuid, sys
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings
try:
global service_client
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", "<storage-account>"), credential="<secret>")
file_system_client = service_client.get_file_system_client(file_system="import")
dest_directory_client = file_system_client.get_directory_client("Destination")
f = open("file-path/cashreceipts.csv",'r')
dest_file_client = dest_directory_client.create_file("cashreceipts.csv")
file_contents = f.read()
dest_file_client.upload_data(file_contents, overwrite=True)
f.close()
except Exception as e:
print(e)
I tried this approach as well -
dest_file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
dest_file_client.flush_data(len(file_contents))
I am referring to the Microsoft documentation here which describes the approach for a text file - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python

Categories