TypeError in read_parquet Dask

TypeError in read_parquet Dask - python

I have a parquet file called data.parquet. I'm using the library dask from Python. When I run the line
import dask.dataframe as dd
df = dd.read_parquet('data.parquet',engine='pyarrow')
I get the error
TypeError Traceback (most recent call last)
<ipython-input-22-807fa43763c1> in <module>
----> 1 df = dd.read_parquet('data.parquet',engine='pyarrow')
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1395 categories=categories,
1396 index=index,
-> 1397 infer_divisions=infer_divisions,
1398 )
1399
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
858 _open = lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
859 for piece in dataset.pieces:
--> 860 pf = piece.get_metadata(_open)
861 # non_empty_pieces.append(piece)
862 if pf.num_row_groups > 0:
TypeError: get_metadata() takes 1 positional argument but 2 were given
I just don't understand why this happens, since this is how it is implemented here.
Any help will be appreciated!

I faced the same problem. I resolved by upgrade version dask 2.30.0

Related

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

How can I iterate through elements of a koala groupby?

I would like to iterate through groups in a dataframe. This is possible in pandas, but when I port this to koalas, I get an error.
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
for a in df.groupby('x'):
print(a)
Here is the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-35-d4164d1f71e0> in <module>
----> 1 for a in df.groupby('x'):
2 print(a)
/opt/conda/lib/python3.7/site-packages/databricks/koalas/groupby.py in __getitem__(self, item)
2630 if self._as_index and is_name_like_value(item):
2631 return SeriesGroupBy(
-> 2632 self._kdf._kser_for(item if is_name_like_tuple(item) else (item,)),
2633 self._groupkeys,
2634 dropna=self._dropna,
/opt/conda/lib/python3.7/site-packages/databricks/koalas/frame.py in _kser_for(self, label)
721 Name: id, dtype: int64
722 """
--> 723 return self._ksers[label]
724
725 def _apply_series_op(self, op, should_resolve: bool = False):
KeyError: (0,)
Is this kind of group iteration possible in koalas? The koalas documentation kind of implies it is possible - https://koalas.readthedocs.io/en/latest/reference/groupby.html

Groupby iteration is not yet implemented:
https://github.com/databricks/koalas/issues/2014

Using regex pattern to read files from directories

I have directories with the following names:
s3://bucket/elig_date=2020-06-01/
s3://bucket/elig_date=2020-06-02/
....
s3://bucket/elig_date=2020-09-30/
s3://bucket/elig_date=2020-10-01/
...
s3://bucket/elig_date=2020-12-31/
When I want to read all files inside all directories from 2020-06-01 to 2020-09-30, I use the following and it works:
import dask.dataframe as dd
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]-*/*")
But, I want to extend this upto the directory 2020-12-31, I am trying the following and it doesn't work:
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
This throws the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-61-60da829cf51e> in <module>
----> 1 all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, **kwargs)
333 index = [index]
334
--> 335 meta, statistics, parts, index = engine.read_metadata(
336 fs,
337 paths,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, engine, **kwargs)
497 split_row_groups,
498 gather_statistics,
--> 499 ) = cls._gather_metadata(
500 paths,
501 fs,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, read_from_paths, dataset_kwargs)
1647
1648 # Step 1: Create a ParquetDataset object
-> 1649 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
1650 if fns == [None]:
1651 # This is a single file. No danger in gathering statistics
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
1600 if proxy_metadata:
1601 dataset.metadata = proxy_metadata
-> 1602 elif fs.isdir(paths[0]):
1603 # This is a directory. We can let pyarrow do its thing.
1604 # Note: In the future, it may be best to avoid listing the
IndexError: list index out of range

I only tested it on regExr because I do not have your files.
But this worked on there:
s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-*/*
Same as you had, just with brackets

Using the max function in Pandas for Python

I am doing a tutorial online on Juypter notebook with Python and Pandas, and when I run the following code, I run into this error.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# reading the csv file
titanic = pd.read_csv("titanic.csv")
titanic_class = titanic.groupby("Pclass")
titanic_class.get_group(1)
titanic_class.max()
AssertionError Traceback (most recent call last)
<ipython-input-26-4d1be28a55cb> in <module>
1 #max ticket fare paid
----> 2 titanic_class.max()
~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in f(self, **kwargs)
1369 # try a cython aggregation if we can
1370 try:
-> 1371 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1372 except DataError:
1373 pass
~\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
992 ) -> DataFrame:
993 agg_blocks, agg_items = self._cython_agg_blocks(
--> 994 how, alt=alt, numeric_only=numeric_only, min_count=min_count
995 )
996 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
~\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1098 # Clean up the mess left over from split blocks.
1099 for locs, result in zip(split_items, split_frames):
-> 1100 assert len(locs) == result.shape[1]
1101 for i, loc in enumerate(locs):
1102 new_items.append(np.array([loc], dtype=locs.dtype))
AssertionError:
Can someone tell me what's wrong? The titanic_class.sum() & the titanic_class.mean() works without any error.

The last column of the excel file has letters. Once I removed them, the max function worked.

This happens when any column has empty (Nan) values. Try to remove those columns before using max.

python write to a stata .dta file from notebook

I'm tying to write a pandas data frame to a stata .dta file. Following the advice given in Save .dta files in python, I wrote:
import pandas as pd
df.to_stata(workdir+' generosity.dta')
and I got an error message TypeError: object of type 'float' has no len() and I'm not sure what this means.
most columns in df are objects, but there are three columns that are float64.
I tried following another method(as described in this post Convert .CSV files to .DTA files in Python) via rpy2, but when i tried to install it, I received an error message "Error: tried to guess R's home but no r commnand in the path" so I've given up on it (I have R on my computer but have not used it once)
Thank you very much.
edit: here is the result:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-7a8f8bc8d446> in <module>()
1 #write the dataframe as a Stata file
----> 2 df.to_stata(workdir+group+' generosity.dta')
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label)
1262 time_stamp=time_stamp, data_label=data_label,
1263 write_index=write_index)
-> 1264 writer.write_file()
1265
1266 #Appender(fmt.docstring_to_string, indents=1)
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in write_file(self)
1245 self._write(_pad_bytes("", 5))
1246 if self._convert_dates is None:
-> 1247 self._write_data_nodates()
1248 else:
1249 self._write_data_dates()
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in _write_data_nodates(self)
1327 if var is None or var == np.nan:
1328 var = _pad_bytes('', typ)
-> 1329 if len(var) < typ:
1330 var = _pad_bytes(var, typ)
1331 if compat.PY3:
TypeError: object of type 'float' has no len()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

TypeError in read_parquet Dask - python

I faced the same problem. I resolved by upgrade version dask 2.30.0

Related

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

How can I iterate through elements of a koala groupby?

Using regex pattern to read files from directories

Using the max function in Pandas for Python

python write to a stata .dta file from notebook

Categories

Resources