How to compress a json lines file and uploading to azure container? - python

I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.
I am first converting the pyspark dataframe to pandas.
pandas_df = df.select("*").toPandas()
Then converting it to a newline delimited json:
json_lines_data = pandas_df.to_json(orient='records', lines=True)
Then writing to blob storage with the following function:
def upload_blob(json_lines_data, connection_string, container_name, blob_name):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
# except if no delete necessary
except:
pass
blob_client.upload_blob(json_lines_data)
This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.
If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.

There is a way you can follow to compress the JSON file before uploading to blob storage.
Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.
Would suggest you add this code before uploading function.
import json
import gzip
def compress_data(data):
# Convert to JSON
json_data = json.dumps(data, indent=2)
# Convert to bytes
encoded = json_data.encode('utf-8')
# Compress
compressed = gzip.compress(encoded)
Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py

Related

Unable to read parquet file using StreamingBody from S3 without holding in memory

I'm trying to read a parquet file from S3, and dump the contents of it on to a Kafka topic.
This isn't too difficult when you hold the entire file in memory, but for large files this isn't feasible.
# using .read() holds the entire file open in memory - not ideal
df = pd.read_parquet(s3_response['Body'].read(), columns=columns)
Instead, I'm trying to take advantage of file-like-objects in order to stream the parquet file.
My issue is that it seems that it's impossible to do this with Parquet, because Parquet encodes data in the footer of the file as well as the header.
Here's an example of my code:
session = boto3.session.Session()
s3_client = session.client(
service_name='s3',
endpoint_url=s3_url,
)
obj = s3_client.get_object(Bucket=s3_bucket, Key=key)
for line in obj['Body'].iter_lines():
pq_file = io.BytesIO(line)
df = pd.read_parquet(pq_file, columns=columns)
# At this point I'd want to iterate over the DF rows
# and send them to kafka
print(df)
This results in the following error:
OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is it at all possible to do what I'm trying to do? Or due to the nature of parquet files is this impossible?

How to read large gzip csv files from azure storage in memory in aws lambda?

I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function.
Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb.
Is there any way of reading and processing those large data in memory in aws lambda. Just to mention I tried stream way as well but no luck.
Below is my sample code-
from azure.storage.blob import *
import pandas as pd
import gzip
blob = BlobClient(account_url="https://" + SOURCE_ACCOUNT + ".blob.core.windows.net",
container_name=SOURCE_CONTAINER,
blob_name=blob_name,
credential=SOURCE_ACCT_KEY)
data = blob.download_blob()
data = data.readall()
unzipped_bytes = gzip.decompress(data)
unzipped_str = unzipped_bytes.decode("utf-8")
df = pd.read_csv(StringIO(unzipped_str), usecols=req_columns,low_memory=False)
Try using chunks and read data by n-rows per one time:
for chunk in pd.read_csv("large.csv", chunksize=10_000):
print(chunk)
Although, I’m not sure about compressed files. In worst case you can uncompress data and then read by chunks.

Reading excel files from "input" blob storage container and exporting to csv in "output" container with python

I'm trying to develop a script in python to read a file in .xlsx from a blob storage container called "source", convert it in .csv and store it in a new container (I'm testing the script locally, if working I should include it in an ADF pipeline). So far, I managed to access to the blob storage, but I'm having problems in reading the file content.
from azure.storage.blob import BlobServiceClient, ContainerClient, BlobClient
import pandas as pd
conn_str = "DefaultEndpointsProtocol=https;AccountName=XXXXXX;AccountKey=XXXXXX;EndpointSuffix=core.windows.net"
container = "source"
blob_name = "prova.xlsx"
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_excel(downloaded_blob)
print(df)
I get following error:
ValueError: Invalid file path or buffer object type: <class 'azure.storage.blob._download.StorageStreamDownloader'>
I tried with a .csv file as input and writing the parsing code as follows:
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()) )
and it works.
Any suggestion on how to modify the code so that the excel file becomes readable?
I summary the solution as below.
When we use the method pd.read_excel() in sdk pandas, we need to provide bytes as input. But when we use download_blob to download the excel file from azure blob, we just get azure.storage.blob.StorageStreamDownloader. So we need to use the method readall() or content_as_bytes() to convert it to bytes. For more details, please refer to the document and the document
Change
df = pd.read_excel(downloaded_blob)
to
df = pd.read_excel(downloaded_blob.content_as_bytes())

Transforming a pandas df to a parquet-file-bytes-object

I have a pandas dataframe and want to write it as a parquet file to the Azure file storage.
So far I have not been able to transform the dataframe directly into a bytes which I then can upload to Azure.
My current workaround is to save it as a parquet file to the local drive, then read it as a bytes object which I can upload to Azure.
Can anyone tell me how I can transform a pandas dataframe directly to a "parquet file"-bytes object without writing it to a disk? The I/O operation is really slowing things down and it feels a lot like really ugly code...
# Transform the data_frame into a parquet file on the local drive
data_frame.to_parquet('temp_p.parquet', engine='auto', compression='snappy')
# Read the parquet file as bytes.
with open("temp_p.parquet", mode='rb') as f:
fileContent = f.read()
# Upload the bytes object to Azure
service.create_file_from_bytes(share_name, file_path, file_name, fileContent, index=0, count=len(fileContent))
I'm looking to implement something like this, where the transform_functionality returns a bytes object:
my_bytes = data_frame.transform_functionality()
service.create_file_from_bytes(share_name, file_path, file_name, my_bytes, index=0, count=len(my_bytes))
I have found a solution, I will post it here in case anyone needs to do the same task. After writing it with the to_parquet file to a buffer, I get the bytes object out of the buffer with the .getvalue() functionality as follows:
buffer = BytesIO()
data_frame.to_parquet(buffer, engine='auto', compression='snappy')
service.create_file_from_bytes(share_name, file_path, file_name, buffer.getvalue(), index=0, count=buffer.getbuffer().nbytes )

Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

Categories