Processing files stored on cloud (S3 or Spaces)

Processing files stored on cloud (S3 or Spaces) - python

I've setup a script to process excel files uploaded by a user. The scripts works fine when the file is stored on the local disk.
from openpyxl import load_workbook
wb = load_workbook("file_path.xlsx") # Load workbook from disk works fine
ws = wb.worksheets[0]
I've then setup django-storages to allow user uploaded files to be stored on digital ocean spaces.
My problem now is how to access and process the cloud stored file. For the record, if I pass the file URL to load_workbook it fails with the error No such file or directory: file_url.
Do I have to download the file using requests and then process it as a local file? Feels inefficient? What options do I have?

You can get byte content of the file, wrap it in ContentFile and pass it to openpyxl. Assuming your model is FileContainer and field name is file:
from django.core.files.base import ContentFile
from openpyxl import load_workbook
fc = FileContainer.objects.first()
bytefile = fc.file.read()
wb = load_workbook(ContentFile(bytefile))
ws = wb.worksheets[0]
I checked it with S3 and it works just fine.
If you want to actually save file locally, you can try this:
from django.core.files.base import ContentFile
from django.core.files.storage import FileSystemStorage
from openpyxl import load_workbook
fc = FileContainer.objects.first()
local_storage = FileSystemStorage()
bytefile = fc.file.read()
newfile = ContentFile(bytefile)
relative_path = local_storage.save(fc.file.name, newfile)
wb = load_workbook(local_storage.path(relative_path))
ws = wb.worksheets[0]

Related

Read pdf file from storage account (Azure Data lake) without downloading it using python

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python.
I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working.
Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it.
Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function?
Any help is appreciated.

I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser:

Google colab can't read csv file though I entered correct path

I tried to get genres of songs in regional-us-daily-latest, and output genres and other datas as csv file. But colab said,
FileNotFoundError: [Errno 2] No such file or directory: 'regional-us-daily-latest.csv'
I mounted My Drive, but still didn't work.
Could you shed some light on this?
!pip3 install spotipy
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import json
from google.colab import drive
client_id = ‘ID’
client_secret = ’SECRET’
client_credentials_manager = spotipy.oauth2.SpotifyClientCredentials(client_id, client_secret)
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
import csv
csvfile = open('/content/drive/MyDrive/regional-us-daily-latest.csv', encoding='utf-8')
csvreader = csv.DictReader(csvfile)
us = ("regional-us-daily-latest.csv", "us.csv")
for region in (us):
inputfile = region[0]
outputfile = region[1]
songs = pd.read_csv(inputfile, index_col=0, header=1)
songs = songs.assign(Genre=0)
for index, row in songs.iterrows():
artist = row["Artist"]
result = spotify.search(artist, limit=1, type="artist")
genre = result["artists"]["items"][0]["genres"]
songs['Genre'][index] = genre
songs.head(10)
songs.to_csv(outputfile)
files.download(outputfile)

Save the csv file in the Google drive and go to your notebook click on drive and search for your file in the drive . Then copy the path of the csv file in a variable and use the variable using read_csv() method

please mount the drive first
from google.colab import drive
drive.mount('/content/drive')
Change the directory to MyDrive and check current directory
import os
os.chdir("drive/My Drive/")
print(os.getcwd())
!ls
Set the path of file. and use source_file variable where file name required
source_file = os.path.join(os.getcwd(), "regional-us-daily-latest.csv")

Any way to send an xlsxwriter generated file to azure data lake without writing to local disk?

For purposes of security, I have a need to move a file to Azure Datalake storage without writing the file locally. This is an excel workbook that is being created with the xlsxwriter package. Here is what I have tried which returns a ValueError: Seek only available in read mode
import pandas as pd
from azure.datalake.store import core, lib, multithread
import xlsxwriter as xl
# Dataframes have undergone manipulation not listed in this code and come from a DB connection
matrix = pd.DataFrame(Database_Query1)
raw = pd.DataFrame(Database_Query2)
# Name datalake path for workbook
dlpath = '/datalake/file/path/file_name.xlsx'
# List store name
store_name = 'store_name_here'
# Create auth token
token = lib.auth(tenant_id= 'tenant_id_here',
client_id= 'client_id_here',
client_secret= 'client_secret_here')
# Create management file system client object
adl = core.AzureDLFileSystem(token, store_name= store_name)
# Create workbook structure
writer = pd.ExcelWriter(adl.open(dlpath, 'wb'), engine= 'xlsxwriter')
matrix.to_excel(writer, sheet_name= 'Compliance')
raw.to_excel(writer, sheet_name= 'Raw Data')
writer.save()
Any ideas? Thanks in advance.

If the data is not monstrously huge, you might consider keeping the bytes in memory and dump the stream back to your adl:
from io import BytesIO
xlb = BytesIO()
# ... do what you need to do ... #
writer = pd.ExcelWriter(xlb, engine= 'xlsxwriter')
matrix.to_excel(writer, sheet_name= 'Compliance')
raw.to_excel(writer, sheet_name= 'Raw Data')
writer.save()
# Set the cursor of the stream back to the beginning
xlb.seek(0)
with adl.open(dlpath, 'wb') as fl:
# This part I'm not entirely sure - consult what your adl write methods are
fl.write(xlb.read())

how to load workbook using tempfile using openpyxl

In my flask web app, I am writing data from excel to a temporary file which I then parse in memory. This method works fine with xlrd but it does not with openpyxl.
Here is how I am writing to a temporary file which I then parse with xlrd.
xls_str = request.json.get('file')
try:
xls_str = xls_str.split('base64,')[1]
xls_data = b64decode(xls_str)
except IndexError:
return 'Invalid form data', 406
save_path = os.path.join(tempfile.gettempdir(), random_alphanum(10))
with open(save_path, 'wb') as f:
f.write(xls_data)
f.close()
try:
bundle = parse(save_path, current_user)
except UnsupportedFileException:
return 'Unsupported file format', 406
except IncompatibleExcelException as ex:
return str(ex), 406
finally:
os.remove(save_path)]
When I use openpyxl with the code above it complains about an unsupported type but that is because I'm using a temporary file to parse the data hence it doesn't have an ".xlsx" extension and even if I added it, it would not work because its not a excel file after all.
openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support file format,
please check you can open it with Excel first. Supported formats are: .xlsx,.xlsm,.xltx,.xltm
What should I do?

Why not create a temp excel file with openpyxl instead. Give this example a try. I did something similar in the past.
from io import BytesIO
from openpyxl.writer.excel import save_virtual_workbook
from openpyxl import Workbook
def create_xlsx():
wb = Workbook()
ws = wb.active
row = ('Hello', 'Boosted_d16')
ws.append(row)
return wb
#app.route('/', methods=['GET'])
def main():
xlsx = create_xlsx()
filename = BytesIO(save_virtual_workbook(xlsx))
return send_file(
filename,
attachment_filename='test.xlsx',
as_attachment=True
)

Open Excel file from zipfolder in openpyxl

I am trying following code.
from zipfile import ZipFile
from openpyxl import load_workbook
from io import BytesIO
zip_path = r"path/to/zipfile.zip"
with ZipFile(zip_path) as myzip:
with myzip.open(myzip.namelist()[0]) as myfile:
wb = load_workbook(filename=BytesIO(myfile.read()))
data_sheet = wb.worksheets[1]
for row in data_sheet.iter_rows(min_row=3, min_col=3):
print(row[0].value)
it shows
ValueError: stat: path too long for Windows
Is this possible?
I am trying logic from Using openpyxl to read file from memory

With xlrd following code works fine.
with ZipFile(zip_path) as myzip:
with myzip.open(myzip.namelist()[0]) as myfile:
book = xlrd.open_workbook(file_contents=(myfile.read()))
sh = book.sheet_by_index(0)
#your code here

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing files stored on cloud (S3 or Spaces) - python

Related

Read pdf file from storage account (Azure Data lake) without downloading it using python

Google colab can't read csv file though I entered correct path

Any way to send an xlsxwriter generated file to azure data lake without writing to local disk?

how to load workbook using tempfile using openpyxl

Open Excel file from zipfolder in openpyxl

Categories

Resources