Transforming a pandas df to a parquet-file-bytes-object - python

I have a pandas dataframe and want to write it as a parquet file to the Azure file storage.
So far I have not been able to transform the dataframe directly into a bytes which I then can upload to Azure.
My current workaround is to save it as a parquet file to the local drive, then read it as a bytes object which I can upload to Azure.
Can anyone tell me how I can transform a pandas dataframe directly to a "parquet file"-bytes object without writing it to a disk? The I/O operation is really slowing things down and it feels a lot like really ugly code...
# Transform the data_frame into a parquet file on the local drive
data_frame.to_parquet('temp_p.parquet', engine='auto', compression='snappy')
# Read the parquet file as bytes.
with open("temp_p.parquet", mode='rb') as f:
fileContent = f.read()
# Upload the bytes object to Azure
service.create_file_from_bytes(share_name, file_path, file_name, fileContent, index=0, count=len(fileContent))
I'm looking to implement something like this, where the transform_functionality returns a bytes object:
my_bytes = data_frame.transform_functionality()
service.create_file_from_bytes(share_name, file_path, file_name, my_bytes, index=0, count=len(my_bytes))

I have found a solution, I will post it here in case anyone needs to do the same task. After writing it with the to_parquet file to a buffer, I get the bytes object out of the buffer with the .getvalue() functionality as follows:
buffer = BytesIO()
data_frame.to_parquet(buffer, engine='auto', compression='snappy')
service.create_file_from_bytes(share_name, file_path, file_name, buffer.getvalue(), index=0, count=buffer.getbuffer().nbytes )

Related

How to compress a json lines file and uploading to azure container?

I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.
I am first converting the pyspark dataframe to pandas.
pandas_df = df.select("*").toPandas()
Then converting it to a newline delimited json:
json_lines_data = pandas_df.to_json(orient='records', lines=True)
Then writing to blob storage with the following function:
def upload_blob(json_lines_data, connection_string, container_name, blob_name):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
# except if no delete necessary
except:
pass
blob_client.upload_blob(json_lines_data)
This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.
If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.
There is a way you can follow to compress the JSON file before uploading to blob storage.
Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.
Would suggest you add this code before uploading function.
import json
import gzip
def compress_data(data):
# Convert to JSON
json_data = json.dumps(data, indent=2)
# Convert to bytes
encoded = json_data.encode('utf-8')
# Compress
compressed = gzip.compress(encoded)
Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py

Attach CSV to email in Python without the CSV

I've seen a number of solutions about sending a CSV as an attachment in an email via Python.
In my case, my Python code needs to extract data from a Snowflake view and send it to a user group as a CSV attachment. While I know how I can do it using to_csv out of a Pandas dataframe, my question is: Do I have to create the CSV file externally at all? Can I simply run the Pandas DF through MIMEText? If so, what file name do I use in the header?
You don't have to create a temporary CSV file on disk, but you also can't just "attach a dataframe" since it'd have no specified on-wire format. Use a BytesIO to have Pandas serialize the CSV to memory:
df = pd.DataFrame(...)
bio = io.BytesIO()
df.to_csv(bio, mode="wb", ...)
bio.seek(0) # rewind the in-memory file
# use the file object wherever a file object is supported
# or extract the binary data with `bio.getvalue()`

Unable to read parquet file using StreamingBody from S3 without holding in memory

I'm trying to read a parquet file from S3, and dump the contents of it on to a Kafka topic.
This isn't too difficult when you hold the entire file in memory, but for large files this isn't feasible.
# using .read() holds the entire file open in memory - not ideal
df = pd.read_parquet(s3_response['Body'].read(), columns=columns)
Instead, I'm trying to take advantage of file-like-objects in order to stream the parquet file.
My issue is that it seems that it's impossible to do this with Parquet, because Parquet encodes data in the footer of the file as well as the header.
Here's an example of my code:
session = boto3.session.Session()
s3_client = session.client(
service_name='s3',
endpoint_url=s3_url,
)
obj = s3_client.get_object(Bucket=s3_bucket, Key=key)
for line in obj['Body'].iter_lines():
pq_file = io.BytesIO(line)
df = pd.read_parquet(pq_file, columns=columns)
# At this point I'd want to iterate over the DF rows
# and send them to kafka
print(df)
This results in the following error:
OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is it at all possible to do what I'm trying to do? Or due to the nature of parquet files is this impossible?

Is there a way to also set AWS metadata values when saving a data frame as a parquet file to S3 using df.to_parquet?

When I put an object in python, I can set the metadata at the time. Example:
self.s3_client.put_object(
Bucket=self._bucket,
Key=key,
Body=body,
ContentEncoding=self._compression,
ContentType="application/json",
ContentLanguage="en-US",
Metadata={'other-key':'value'}
)
It seems like both pyarrow and fastparquet don't let me pass those particular keywords despite the pandas documentation saying that extra keywords are passed.
This saves the data how I want it to, but I can't seem to attach the metadata with any syntax that I try.
df.to_parquet(s3_path, compression='gzip')
If there was an easy way to compress the parquet and convert it to a bytestream.
Would rather not write the file twice (either to local then transfer to AWS or twice on AWS)
Ok. Found it quicker than I thought.
import pandas as pd
import io
#read in data to df.
df=pd.read_csv('file.csv')
body = io.BytesIO()
df.to_parquet(
path=body,
compression="gzip",
engine="pyarrow",
)
bucket='MY_BUCKET'
key='prefix/key'
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=body.getvalue(),
ContentEncoding='gzip',
ContentType="application/x-parquet",
ContentLanguage="en-US",
Metadata={'user-key':'value'},
)

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Categories