Python Pandas to convert CSV to Parquet using Fastparquet - python

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')
Error-1
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support
Solution-1
Installed fastparquet 0.2.1
Error-2
File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data
(algorithm, sorted(compressions)))
RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']
I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?
If I use PyArrow 0.12.0 engine, I don't experience the issue.

In fastparquet snappy compression is an optional feature.
To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):
import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value
However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.

I've used the following versions:
python 3.10.9 fastparquet==2022.12.0 pandas==1.5.2
This code works seemlessly for me
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet', engine="fastparquet")
I'd recommend you move away from python 3.6 as it has reached end of life and is no longer supported.

Related

Dask df.to_parquet can't find pyarrow. RuntimeError: `pyarrow` not installed

Environment:
macOS Big Sur v 11.6.1
Python 3.7.7
pyarrow==5.0.0 (from pipfreeze)
From the terminal:
>>> import pyarrow
>>> pyarrow
<module 'pyarrow' from '/Users/garyb/Develop/DS/tools-pay-data-pipeline/env/lib/python3.7/site-packages/pyarrow/__init__.py'
So I confirmed that I have pyarrow installed. But when I try to write a Dask dataframe to parquet I get:
def make_parquet_file(filepath):
parquet_path = f'{PARQUET_DIR}/{company}_{table}_{batch}.parquet'
df.to_parquet(parquet_path, engine='pyarrow')
ModuleNotFoundError: No module named pyarrow
The exception detail:
~/Develop/DS/research-dask-parquet/env/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in get_engine(engine)
970 elif engine in ("pyarrow", "arrow", "pyarrow-legacy", "pyarrow-dataset"):
971
--> 972 pa = import_required("pyarrow", "`pyarrow` not installed")
973 pa_version = parse_version(pa.__version__)
974
This function works. It's a much smaller csv file just to confirm that the df.to_parquet function works:
def make_parquet_file():
csv_file = f'{CSV_DATA_DIR}/diabetes.csv'
parquet_file = f'{PARQUET_DIR}/diabetes.parquet'
# Just to prove I can read the csv file
p_df = pd.read_csv(csv_file)
print(p_df.shape)
d_df = dd.read_csv(csv_file)
d_df.to_parquet(parquet_file)
Is it looking in the right place for the package? I'm stuck
It does seem that dask and pure python are using different environments.
In the first example the path is:
~/Develop/DS/tools-pay-data-pipeline/env/lib/python3.7
In the traceback the path is:
~/Develop/DS/research-dask-parquet/env/lib/python3.7
So a quick fix is to install pyarrow in the second environment. Another fix is to install the packages on workers (this might help).
A more robust fix is to use environment files.
OK, problem solved - maybe. During development and experimentation I use Jupyter to test and debug. Later all the functions get moved into scripts which can then be imported into any notebook that needs them. The role of the notebook at that point in this project is demo and document, i.e. a better alternative to the command line for demo.
In this case I'm still in the experiment mode. So the problem was the Jupyter kernel. I had inadvertency recycled the name, and the virtual environment in the two projects also have the same name - "env". See a pattern here? Lazy bit me.
I deleted the kernel that was being used and created a new one with a unique name. PyArrow was then pulled from the correct virtual environment and worked as expected.

Writing a dask dataframe to parquet using to_parquet() results "RuntimeError: file metadata is only available after writer close"

I am trying to use store Dask dataframe in parquet files. I have pyarrow library installed.
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
ddf = dd.from_pandas(df, npartitions=10)
ddf.to_parquet('saved_data_prqt', compression='snappy')
However, I get this error as a result of my code
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, data_page_version, use_compliant_nested_type, **kwargs)
~\anaconda3\lib\site-packages\pyarrow\parquet.py in close(self)
682 self.is_open = False
683 if self._metadata_collector is not None:
--> 684 self._metadata_collector.append(self.writer.metadata)
685 if self.file_handle is not None:
686 self.file_handle.close()
.................. it's a long error description which I shortened. If the whole error text required please let me know in the comments section and I'll try to add the full version.
~\anaconda3\lib\site-packages\pyarrow\_parquet.pyx in pyarrow._parquet.ParquetWriter.metadata.__get__()
RuntimeError: file metadata is only available after writer close
Does anybody know how to debug the error and what the reason of it?
Thank you!
I ran your exact code snippets and the Parquet files were written without any error. This code snippet also works:
ddf.to_parquet("saved_data_prqt", compression="snappy", engine="pyarrow")
I'm using Python 3.9.7, Dask 2021.8.1, and pyarrow 5.0.0. What versions are you using?
Here's the notebook I ran and here's the environment if you'd like to replicate my computations exactly.
I fixed the error by creating an isolated virtual environment with Python 3.9 and Pyarrow 5.0 in conda followed by installation of corresponding Python kernel in Jupyter Notebook.
It's important to activate the environment in conda followed by launching Jupyter Notebook from conda otherwise (for unknow reason) if I open Jupyter Notebook from windows start menu the error persists.

Rasterio unable to open .jp2 files

I'm beginning to play with GeoPySpark and am implementing an example notebook.
I successfully retrieved the images
!curl -o /tmp/B01.jp2 http://sentinel-s2-l1c.s3.amazonaws.com/tiles/32/T/NM/2017/1/4/0/B01.jp2
!curl -o /tmp/B09.jp2 http://sentinel-s2-l1c.s3.amazonaws.com/tiles/32/T/NM/2017/1/4/0/B09.jp2
!curl -o /tmp/B10.jp2 http://sentinel-s2-l1c.s3.amazonaws.com/tiles/32/T/NM/2017/1/4/0/B10.jp2
Here is the script:
import rasterio
import geopyspark as gps
import numpy as np
from pyspark import SparkContext
conf = gps.geopyspark_conf(master="local[*]", appName="sentinel-ingest-example")
pysc = SparkContext(conf=conf)
jp2s = ["/tmp/B01.jp2", "/tmp/B09.jp2", "/tmp/B10.jp2"]
arrs = []
for jp2 in jp2s:
with rasterio.open(jp2) as f: #CRASHES HERE
arrs.append(f.read(1))
data = np.array(arrs, dtype=arrs[0].dtype)
data
The script crashes where I placed the marker here, with the following error:
RasterioIOError: '/tmp/B01.jp2' not recognized as a supported file format.
I copy-pasted the example code exactly, ad in the Rasterio docs it even uses .jp2 files in examples.
I'm using the following version of Rasterio, installed with pip3. I do not have Anaconda installed (messes up my Python environments) and do not have GDAL installed (it refuses to, that would be the topic of another question if it is my only solution)
Name: rasterio
Version: 1.1.0
Summary: Fast and direct raster I/O for use with Numpy and SciPy
Home-page: https://github.com/mapbox/rasterio
Author: Sean Gillies
Author-email: sean#mapbox.com
License: BSD
Location: /usr/local/lib/python3.6/dist-packages
Requires: click-plugins, snuggs, numpy, click, attrs, cligj, affine
Required-by:
Why does it refuse to read .jp2 files? Is there maybe a way to convert them to something usable? Or do you know of any example files similar to these ones in an acceptable format?
I was stuck in the same situation.
I used the pyvips package and it's resolved.
import pyvips
image = pyvips.Image.new_from_file("000240.jp2") image.write_to_file("000240.jpg")

How can I open a .snappy.parquet file in python?

How can I open a .snappy.parquet file in python 3.5? So far, I used this code:
import numpy
import pyarrow
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pyarrow.parquet.read_table(filename).to_pandas()
But, it gives this error:
AttributeError: module 'pyarrow' has no attribute 'compat'
P.S. I installed pyarrow this way:
pip install pyarrow
I have got the same issue and managed to solve it by following the solutio proposed in https://github.com/dask/fastparquet/issues/366 solution.
1) install python-snappy by using conda install (for some reason with pip install, I couldn't download it)
2) Add the snappy_decompress function.
from fastparquet import ParquetFile
import snappy
def snappy_decompress(data, uncompressed_size):
return snappy.decompress(data)
pf = ParquetFile('filename') # filename includes .snappy.parquet extension
dff=pf.to_pandas()
The error AttributeError: module 'pyarrow' has no attribute 'compat' is sadly a bit misleading. To execute the to_pandas() function on a pyarrow.Table instance you need pandas installed. The above error is a sympton of the missing requirement.
pandas is a not a hard requirement of pyarrow as most of its functionality is usable with just Python built-ins and NumPy. Thus users of pyarrow which include pandas can work with it without needing to have pandas pre-installed.
You can use pandas to read snppay.parquet files into a python pandas dataframe.
import pandas as pd
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pd.read_parquet(filename)

IndexError: pop from empty stack (python)

I tried to import the excel file with the following code and the size file is about 71 Mb and when runing the code, it shows "IndexError: pop from empty stack". Thus, kindly help me with this.
Code:
import pandas as pd
df1 = pd.read_excel('F:/Test PCA/Week-7-MachineLearning/weather.xlsx',
sheetname='Sheet1', header=0)
Data: https://www.dropbox.com/s/zyrry53li55hvha/weather.xlsx?dl=0
Using the latest pandas and xlrd this works fine to read the "weather.xlsx" file you provided:
df1 = pd.read_excel('weather.xlsx',sheet_name='Sheet1')
Can you try running:
pip install --upgrade pandas
pip install --upgrade xlrd
To ensure you have the latest version of the modules for reading the file?
i tried with same code provided by you with below versions of pandas and xlrd and it is working fine just changed sheetname argument to sheet_name
pandas==0.22.0
xlrd==1.1.0
df=pd.read_excel('weather.xlsx',sheet_name='Sheet1',header=0)

Categories