In my Azure Function, I'm currently writing data frame into CSV using df.to_csv() and passing this object to the append blob function.
output_data = df.to_csv(index=False, encoding="utf=8")
blob_client.append_block(output)
Now I want to store data into .xlsx format and then add auto filter to excel file using xlsxwriter
This is what, I tried but was unable to understand what is wrong here
writer = io.BytesIO()
df.to_excel(writer, index=False)
writer.seek(0)
blob_client.upload_blob(writer.getvalue())
I have already tried the following solution but it didn't work for me.
Either file is created but empty or file not readable in excel apps is happening
Azure Function - Pandas Dataframe to Excel is Empty
Writing pandas dataframe as xlsx file to an azure blob storage without creating a local file
I'm new to Dremio, and I was following this SQS + S3 + Dremiotutorial to learn more about Dremio. In one of the code snippets, it is mentioned that get_messages_from_queue will create a CSV file and which is later used in the upload_file method to upload into S3.
However I'm missing that portion of the command which converts into CSV, can anyone help me how to create CSV using pandas? I'm new to Pandas, still learning.
SQS message body looks like this
"Body": "{\"holiday\":\"None\",\"temp\":288.28,\"rain_1h\":0.0,\"snow_1h\":0.0,\"clouds_all\":40,\"weather_main\":\"Clouds\",\"weather_description\":\"scattered clouds\",\"date_time\":\"2012-10-02 09:00:00\",\"traffic_volume\":5545}"
Add sqs message to dictionary and load it to panda Dataframe. Finally use to_csv to export csv file.
import pandas as pd
df = pd.DataFrame(sqs_message)
df.to_csv('sqs_messages.csv', index = False) # pass path and file name of csv.
I have read a csv file using Pandas and I need to resave the csv file using code instead of opening the csv file and manually saving it.
Is it possible?
There must be something I'm missing in the question. Why not simply:
df = pd.read_csv('file.csv', ...)
# any changes
df.to_csv('file.csv')
?
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)
I have an extremely large dataframe saved as a gzip file. The data also needs a good deal of manipulation before being saved.
One could try to convert this entire gzip dataframe into text format, save this to a variable, parse/clean the data, and then save as a .csv file via pandas.read_csv(). However, this is extremely memory intensive.
I would like to read/decompress this file line by line (as this would be the most memory-efficient solution, I think), parse this (e.g. with regex re or perhaps a pandas solution) and then save each line into a pandas dataframe.
Python has a gzip library for this:
with gzip.open('filename.gzip', 'rb') as input_file:
reader = reader(input_file, delimiter="\t")
data = [row for row in reader]
df = pd.DataFrame(data)
However, this seems to drop all information into the 'reader' variable, and then parses. How can one do this in a more (memory) efficient manner?
Should I be using a different library instead of gzip?
It's not quite clear what do you want to do with your huge GZIP file. IIUC you can't read the whole data into memory, because your GZIP file is huge. So the only option you have is to process your data in chunks.
Assuming that you want to read your data from the GZIP file, process it and write it to compressed HDF5 file:
hdf_key = 'my_hdf_ID'
cols_to_index = ['colA','colZ'] # list of indexed columns, use `cols_to_index=True` if you want to index ALL columns
store = pd.HDFStore('/path/to/filename.h5')
chunksize = 10**5
for chunk in pd.read_csv('filename.gz', sep='\s*', chunksize=chunksize):
# process data in the `chunk` DF
# don't index data columns in each iteration - we'll do it later
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False, complib='blosc', complevel=4)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
Perhaps extract your data with gunzip -c, pipe it to your Python script and work with standard input there:
$ gunzip -c source.gz | python ./line_parser.py | gzip -c - > destination.gz
In the Python script line_parser.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
sys.stdout.write(line)
Replace sys.stdout.write(line) with code to process each line in your custom way.
Have you considered using HDFStore:
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the cookbook for some advanced strategies
Create Store, save DataFrame and close store.
# Note compression.
store = pd.HDFStore('my_store.h5', mode='w', comp_level=9, complib='blosc')
with store:
store['my_dataframe'] = df
Reopen store, retrieve dataframe and close store.
with pd.HDFStore('my_store.h5', mode='r') as store:
df = store.get('my_dataframe')