How can you read a gzipped parquet file in Python - python

I need to open a gzipped file, that has a parquet file inside with some data. I am having so much trouble trying to print/read what is inside the file. I tried the following:
with gzip.open("myFile.parquet.gzip", "rb") as f:
data = f.read()
This does not seem to work, as I get an error that my file id not a gz file. Thanks!

You can use read_parquet function from pandas module:
Install pandas and pyarrow:
pip install pandas pyarrow
use read_parquet which returns DataFrame:
data = read_parquet("myFile.parquet.gzip")
print(data.count()) # example of operation on the returned DataFrame

Related

How to resave a csv file using pandas in Python?

I have read a csv file using Pandas and I need to resave the csv file using code instead of opening the csv file and manually saving it.
Is it possible?
There must be something I'm missing in the question. Why not simply:
df = pd.read_csv('file.csv', ...)
# any changes
df.to_csv('file.csv')
?

Python - read parquet data from a variable

I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)

I am trying to upload a csv file onto Python (Azure) but am running into file IO Error does not exist

My code is:
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100']) with open
('Project_Wind_Data.csv',"r") as csvfile:
I am trying to access certain columns within the csv file. I recive an error message saying that the data file does not exist
My data is in the following form:
This is must a be trivial issue but help would be much appreciated.
If your csv file is in the same working directory as your .py code, you use directly
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100'])
If the file is in another directory, replace 'Project_Wind_Data.csv' with the full path to the file like c:User/Documents/file.txt

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)

How to write in ARFF file using LIAC-ARFF package in Python?

I want to load an ARFF file in python, then change some values of it and then save changes to file. I'm using LIAC-ARFF package (https://pypi.python.org/pypi/liac-arff). I loaded ARFF file with following lines of code:
import arff
data = arff.load(open(FILE_NAME, 'rb'))
After manipulating some values inside data, i want to write data to another ARFF file. Any solution?
Use the following code:
import arff
data = arff.load(open(FILE_NAME, 'rb'))
f = open(outputfilename, 'wb')
arff.dump(data, f)
f.close()
In the LICA-ARFF description you see dump method which serializes to a the file, but it's wrong. It just write object as text file. Serialize means save whole the object, so the output file is binary not a text file.
We can load arff data into python using scipy.
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()

Categories