Dask Dataframe: Resample partitioned data loaded from multiple parquet files

Dask Dataframe: Resample partitioned data loaded from multiple parquet files - python

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

Related

Splitting dataframe column in multiple columns using json_normalize does not work

I have a dataframe df created through an import from a mysql-db
ID CONFIG
0 276 {"pos":[{"type":"geo...
1 349 {"pos":[{"type":"geo...
2 378 {"pos":[{"type":"geo...
3 381 {"pos":[{"type":"geo...
4 385 {"pos":[{"type":"geo...
where the elements in the CONFIG column all have the form:
{"posit":[{"type":"geo_f","priority":1,"validity":0},{"type":"geo_m","priority":2,"validity":0},{"type":"geo_i","priority":3,"validity":0},{"type":"geo_c","priority":4,"validity":0}]}
Now, I was convinced these elements are json-type elements and tried the following method to transform them into columns:
df_new = pd.json_normalize(df['CONFIG'])
However, this return the following error:
AttributeError: 'str' object has no attribute 'values'
What am I missing? Thankful for any help!
EDIT: Full Traceback
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-23db4c0afdab> in <module>
----> 1 df_new = pd.json_normalize(df['CONFIG'])
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'

First issue is the values in CONFIG column are strings in disguise. So, a literal_eval can make them true dictionaries. Then, they are all indexed with "posit" key first that we better get rid of. But then we are left with lists; so explode comes in. Overall,
from ast import literal_eval
pd.json_normalize(df['CONFIG'].apply(lambda x: literal_eval(x)["posit"]).explode())
I get (for a 1-row sample data)
type priority validity
0 geo_f 1 0
1 geo_m 2 0
2 geo_i 3 0
3 geo_c 4 0

How can I iterate through elements of a koala groupby?

I would like to iterate through groups in a dataframe. This is possible in pandas, but when I port this to koalas, I get an error.
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
for a in df.groupby('x'):
print(a)
Here is the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-35-d4164d1f71e0> in <module>
----> 1 for a in df.groupby('x'):
2 print(a)
/opt/conda/lib/python3.7/site-packages/databricks/koalas/groupby.py in __getitem__(self, item)
2630 if self._as_index and is_name_like_value(item):
2631 return SeriesGroupBy(
-> 2632 self._kdf._kser_for(item if is_name_like_tuple(item) else (item,)),
2633 self._groupkeys,
2634 dropna=self._dropna,
/opt/conda/lib/python3.7/site-packages/databricks/koalas/frame.py in _kser_for(self, label)
721 Name: id, dtype: int64
722 """
--> 723 return self._ksers[label]
724
725 def _apply_series_op(self, op, should_resolve: bool = False):
KeyError: (0,)
Is this kind of group iteration possible in koalas? The koalas documentation kind of implies it is possible - https://koalas.readthedocs.io/en/latest/reference/groupby.html

Groupby iteration is not yet implemented:
https://github.com/databricks/koalas/issues/2014

TypeError in read_parquet Dask

I have a parquet file called data.parquet. I'm using the library dask from Python. When I run the line
import dask.dataframe as dd
df = dd.read_parquet('data.parquet',engine='pyarrow')
I get the error
TypeError Traceback (most recent call last)
<ipython-input-22-807fa43763c1> in <module>
----> 1 df = dd.read_parquet('data.parquet',engine='pyarrow')
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1395 categories=categories,
1396 index=index,
-> 1397 infer_divisions=infer_divisions,
1398 )
1399
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
858 _open = lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
859 for piece in dataset.pieces:
--> 860 pf = piece.get_metadata(_open)
861 # non_empty_pieces.append(piece)
862 if pf.num_row_groups > 0:
TypeError: get_metadata() takes 1 positional argument but 2 were given
I just don't understand why this happens, since this is how it is implemented here.
Any help will be appreciated!

I faced the same problem. I resolved by upgrade version dask 2.30.0

Computing sample statistics using NumPy functions with a Pandas DataFrame

I am trying to create a function that returns either the mean, median, or standard deviation of all columns in a Pandas DataFrame using NumPy functions.
It is for a school assignment, so there's no reason for using NumPy other than it is what is being asked of me. I am struggling to figure out how to use a NumPy function with a Pandas DataFrame for this problem.
Here is the text of the problem.
The code cell below contains a function called comp_sample_stat that accepts 2 parameters "df" which contains data from the dow jones for a particular company, and stat which will contain 1 of the 3 strings: "mean", "std", or "median".
For this problem:
if the stat is equal to "mean" return the mean of the dataframe columns using numpy's mean function
if the stat is equal to "median" return the median of the dataframe columns using numpy's median function
if the stat is equal to "std" return the std of the dataframe columns using numpy's std function
Here is the function I have written.
def comp_sample_stat(df, stat='mean'):
'''
Computes a sample statistic for any dataframe passed in
Parameters
----------
df: Pandas dataframe
Returns
-------
a pandas dataframe
'''
df_mean = df.apply(np.mean(df))
df_median = df.apply(np.median(df))
df_std = df.apply(np.std(df))
if stat is str('std'):
return df_std
elif stat is str('median'):
return df_median
else:
return df_mean
df is a DataFrame that has been defined previously in my assignment as follows:
def read_data(file_path):
'''
Reads in a dataset using pandas.
Parameters
----------
file_path : string containing path to a file
Returns
-------
pandas dataframe with data read in from the file path
'''
read_file = pd.read_csv(file_path)
new_df = pd.DataFrame(read_file)
return new_df
df = read_data('data/dow_jones_index.data')
The variable df_AA has also been previously defined as follows:
def select_stock(df, symbol):
'''
Selects data only containing a particular stock symbol.
Parameters
----------
df: dataframe containing data from the dow jones index
stock: string containing the stock symbol to select
Returns
-------
dataframe containing a particular stock
'''
stock = df[df.stock == symbol]
return stock
df_AA = select_stock(df.copy(), 'AA')
When I call the function within a Jupyter Notebook as follows:
comp_sample_stat(df_AA)
I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call
last)
<ipython-input-17-a2bcbeedcc56> in <module>()
22 return df_mean
23
---> 24 comp_sample_stat(df_AA)
<ipython-input-17-a2bcbeedcc56> in comp_sample_stat(df, stat)
11 a pandas dataframe
12 '''
---> 13 df_mean = df.apply(np.mean(df))
14 df_median = df.apply(np.median(df))
15 df_std = df.apply(np.std(df))
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in
apply(self, func, axis, broadcast, raw, reduce, result_type, args,
**kwds)
6012 args=args,
6013 kwds=kwds)
-> 6014 return op.get_result()
6015
6016 def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
316 *self.args, **self.kwds)
317
--> 318 return super(FrameRowApply, self).get_result()
319
320 def apply_broadcast(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ("'Series' object is not callable", 'occurred at index
quarter')

DataFrame.apply expects you to pass it a function, not a dataframe. So you should be passing np.mean without arguments.
That is, you should be doing something like this:
df_mean = df.apply(np.mean)
The docs.

Python-Top Ten Function

I'm trying to create a function where the user puts in the year and the output is the top ten countries by expenditures using this Lynda class as a model.
Here's the data frame
df.dtypes
Country Name object
Country Code object
Year int32
CountryYear object
Population int32
GDP float64
MilExpend float64
Percent float64
dtype: object
Country Name Country Code Year CountryYear Pop GDP Expend Percent
0 Aruba ABW 1960 ABW-1960 54208 0.0 0.0 0.0
I've tried this code and got errors:
Code:
def topten(Year):
simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
simple.index = simple.index + 1
return simple
topten(1990)
This is the rather big error I received:
Can I get some assistance? I can't even figure out what the error is. :-(
C:\Users\mycomputer\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1738 # if kind==mergesort, it can fail for object dtype
-> 1739 return arr.argsort(kind=kind)
1740 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-105-0c974c6a1b44> in <module>()
----> 1 topten(1990)
<ipython-input-104-b8c336014d5b> in topten(Year)
1 def topten(Year):
----> 2 simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
3 simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
4 simple.index = simple.index + 1
5
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort(self, axis, ascending, kind, na_position, inplace)
1831
1832 return self.sort_values(ascending=ascending, kind=kind,
-> 1833 na_position=na_position, inplace=inplace)
1834
1835 def order(self, na_last=None, ascending=True, kind='quicksort',
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort_values(self, axis, ascending, inplace, kind, na_position)
1751 idx = _default_index(len(self))
1752
-> 1753 argsorted = _try_kind_sort(arr[good])
1754
1755 if not ascending:
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1741 # stable sort not available for object dtype
1742 # uses the argsort default quicksort
-> 1743 return arr.argsort(kind='quicksort')
1744
1745 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

The first argument to .loc is the row label.
When you call df_details_merged.loc[1960], pandas will find the row with the label 1960 and return that row as a Series. So you get back a Series with the index Country Name, Country Code, ..., with the values being the values from that row. Then your code tries to sort this by MilExpend, and that's where it fails.
What you need isn't loc, but a simple condition: df[df.Year == Year]. That is "give me the whole dataframe, but only where the 'Year' column contains whatever I've specified in the "Year" variable (1960 in your example).
sort will still work for the time being, but is being deprecated, so use sort_values instead. Putting that together:
simple = df_details_merged[df_details_merged.Year == Year].sort_values(by='MilExpend', ascending=False).reset_index()
Then you can go ahead and drop the columns, and fetch the top 10 rows as you're doing now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask Dataframe: Resample partitioned data loaded from multiple parquet files - python

Related

Splitting dataframe column in multiple columns using json_normalize does not work

How can I iterate through elements of a koala groupby?

TypeError in read_parquet Dask

Computing sample statistics using NumPy functions with a Pandas DataFrame

Python-Top Ten Function

Categories

Resources