How can I open a .snappy.parquet file in python? - python

How can I open a .snappy.parquet file in python 3.5? So far, I used this code:
import numpy
import pyarrow
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pyarrow.parquet.read_table(filename).to_pandas()
But, it gives this error:
AttributeError: module 'pyarrow' has no attribute 'compat'
P.S. I installed pyarrow this way:
pip install pyarrow

I have got the same issue and managed to solve it by following the solutio proposed in https://github.com/dask/fastparquet/issues/366 solution.
1) install python-snappy by using conda install (for some reason with pip install, I couldn't download it)
2) Add the snappy_decompress function.
from fastparquet import ParquetFile
import snappy
def snappy_decompress(data, uncompressed_size):
return snappy.decompress(data)
pf = ParquetFile('filename') # filename includes .snappy.parquet extension
dff=pf.to_pandas()

The error AttributeError: module 'pyarrow' has no attribute 'compat' is sadly a bit misleading. To execute the to_pandas() function on a pyarrow.Table instance you need pandas installed. The above error is a sympton of the missing requirement.
pandas is a not a hard requirement of pyarrow as most of its functionality is usable with just Python built-ins and NumPy. Thus users of pyarrow which include pandas can work with it without needing to have pandas pre-installed.

You can use pandas to read snppay.parquet files into a python pandas dataframe.
import pandas as pd
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pd.read_parquet(filename)

Related

Python - Unable to find assembly 'OSIsoft.AFSDK'

I am trying to import "clr" in a python script and I have an error with and without "clr" installed. If "clr" is installed, then I get the error:
AttributeError: module 'clr' has no attribute 'AddReference'
If I remove "clr" and install pythonnet (as suggested to fix the "clr" error), then I get this error:
FileNotFoundException: Unable to find assembly 'OSIsoft.AFSDK'.
at Python.Runtime.CLRModule.AddReference(String name)
My imports look like this:
import sys
sys.path.append('C:\\Program Files (x86)\\PIPC\\AF\\PublicAssemblies\\4.0\\')
import clr
clr.AddReference('OSIsoft.AFSDK')
from OSIsoft.AF.PI import *
from OSIsoft.AF.Search import *
from OSIsoft.AF.Asset import *
from OSIsoft.AF.Data import *
from OSIsoft.AF.Time import *
import pandas as pd
from datetime import datetime
It seems like I'm missing something in finding the answer. I have loaded the latest oracle client 14.1 and that folder resided in my python working script environment. thank you for any help!
Try to remove be the wrong 'clr' module. which is mixed up with the one in pythonnet.
pip uninstall clr
pip uninstall pythonnet
pip install pythonnet

How can I solve "NameError: name 'ArffDecoder' is not defined"?

Hi I have written the following code
import pandas as pd
import numpy as np
import arff as arf
file = open("final-dataset.arff")
decoder = arf.ArffDecoder()
data=decoder.decode(file,encode_nominal=True)
There are multiple arff-related packages. It's likely you installed arff (project source, PyPI) when, based on your use of the class ArffDecoder(), the one you are looking for is liac-arff.
To fix your issue you need to uninstall arff and install liac-arff instead. arff needs to be deleted before you can use liac-arff as they both use the same name when you import them with import arff.

AttributeError: module 'fsspec' has no attribute 'utils' when using to_csv in python

I'm trying to save my dataframe into a CSV file
#Create a DataFrame:
new_df = pd.DataFrame({'id': [1,2,3,4,5], 'LETTERS': ['A','B','C','D','E'], 'letters': ['a','b','c','d','e']})
#Save it as csv in your folder:
new_df.to_csv('C:\\my\\file\\location\\new_df.csv')
This is the error I'm getting:
AttributeError: module 'fsspec' has no attribute 'utils'
First of all see your versions using pd.show_versions() because the highly likely case is that the version of fsspec is old and therefore is not that compatible with the other pre-installed packages. If the pd.show_versions() raises an error, then try update pandas: pip install pandas --upgrade. If none of this works, then it is probably a package missing, but I can't seem to find it.

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')
Error-1
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support
Solution-1
Installed fastparquet 0.2.1
Error-2
File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data
(algorithm, sorted(compressions)))
RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']
I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?
If I use PyArrow 0.12.0 engine, I don't experience the issue.
In fastparquet snappy compression is an optional feature.
To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):
import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value
However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.
I've used the following versions:
python 3.10.9 fastparquet==2022.12.0 pandas==1.5.2
This code works seemlessly for me
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet', engine="fastparquet")
I'd recommend you move away from python 3.6 as it has reached end of life and is no longer supported.

IndexError: pop from empty stack (python)

I tried to import the excel file with the following code and the size file is about 71 Mb and when runing the code, it shows "IndexError: pop from empty stack". Thus, kindly help me with this.
Code:
import pandas as pd
df1 = pd.read_excel('F:/Test PCA/Week-7-MachineLearning/weather.xlsx',
sheetname='Sheet1', header=0)
Data: https://www.dropbox.com/s/zyrry53li55hvha/weather.xlsx?dl=0
Using the latest pandas and xlrd this works fine to read the "weather.xlsx" file you provided:
df1 = pd.read_excel('weather.xlsx',sheet_name='Sheet1')
Can you try running:
pip install --upgrade pandas
pip install --upgrade xlrd
To ensure you have the latest version of the modules for reading the file?
i tried with same code provided by you with below versions of pandas and xlrd and it is working fine just changed sheetname argument to sheet_name
pandas==0.22.0
xlrd==1.1.0
df=pd.read_excel('weather.xlsx',sheet_name='Sheet1',header=0)

Categories