How can I iterate through elements of a koala groupby? - python

I would like to iterate through groups in a dataframe. This is possible in pandas, but when I port this to koalas, I get an error.
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
for a in df.groupby('x'):
print(a)
Here is the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-35-d4164d1f71e0> in <module>
----> 1 for a in df.groupby('x'):
2 print(a)
/opt/conda/lib/python3.7/site-packages/databricks/koalas/groupby.py in __getitem__(self, item)
2630 if self._as_index and is_name_like_value(item):
2631 return SeriesGroupBy(
-> 2632 self._kdf._kser_for(item if is_name_like_tuple(item) else (item,)),
2633 self._groupkeys,
2634 dropna=self._dropna,
/opt/conda/lib/python3.7/site-packages/databricks/koalas/frame.py in _kser_for(self, label)
721 Name: id, dtype: int64
722 """
--> 723 return self._ksers[label]
724
725 def _apply_series_op(self, op, should_resolve: bool = False):
KeyError: (0,)
Is this kind of group iteration possible in koalas? The koalas documentation kind of implies it is possible - https://koalas.readthedocs.io/en/latest/reference/groupby.html

Groupby iteration is not yet implemented:
https://github.com/databricks/koalas/issues/2014

Related

Splitting dataframe column in multiple columns using json_normalize does not work

I have a dataframe df created through an import from a mysql-db
ID CONFIG
0 276 {"pos":[{"type":"geo...
1 349 {"pos":[{"type":"geo...
2 378 {"pos":[{"type":"geo...
3 381 {"pos":[{"type":"geo...
4 385 {"pos":[{"type":"geo...
where the elements in the CONFIG column all have the form:
{"posit":[{"type":"geo_f","priority":1,"validity":0},{"type":"geo_m","priority":2,"validity":0},{"type":"geo_i","priority":3,"validity":0},{"type":"geo_c","priority":4,"validity":0}]}
Now, I was convinced these elements are json-type elements and tried the following method to transform them into columns:
df_new = pd.json_normalize(df['CONFIG'])
However, this return the following error:
AttributeError: 'str' object has no attribute 'values'
What am I missing? Thankful for any help!
EDIT: Full Traceback
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-23db4c0afdab> in <module>
----> 1 df_new = pd.json_normalize(df['CONFIG'])
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
First issue is the values in CONFIG column are strings in disguise. So, a literal_eval can make them true dictionaries. Then, they are all indexed with "posit" key first that we better get rid of. But then we are left with lists; so explode comes in. Overall,
from ast import literal_eval
pd.json_normalize(df['CONFIG'].apply(lambda x: literal_eval(x)["posit"]).explode())
I get (for a 1-row sample data)
type priority validity
0 geo_f 1 0
1 geo_m 2 0
2 geo_i 3 0
3 geo_c 4 0

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

TypeError in read_parquet Dask

I have a parquet file called data.parquet. I'm using the library dask from Python. When I run the line
import dask.dataframe as dd
df = dd.read_parquet('data.parquet',engine='pyarrow')
I get the error
TypeError Traceback (most recent call last)
<ipython-input-22-807fa43763c1> in <module>
----> 1 df = dd.read_parquet('data.parquet',engine='pyarrow')
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1395 categories=categories,
1396 index=index,
-> 1397 infer_divisions=infer_divisions,
1398 )
1399
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
858 _open = lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
859 for piece in dataset.pieces:
--> 860 pf = piece.get_metadata(_open)
861 # non_empty_pieces.append(piece)
862 if pf.num_row_groups > 0:
TypeError: get_metadata() takes 1 positional argument but 2 were given
I just don't understand why this happens, since this is how it is implemented here.
Any help will be appreciated!
I faced the same problem. I resolved by upgrade version dask 2.30.0

using negative indices with Pathlib Parents method

I'm trying to use groupby on a pandas data structure containing pathlib information and file sizes from a particular drive. I want to total up the storage used at a particular depth of the file directory structure to see which directories are the most full.
I was trying to do a summation groupby on the Pathlib Parent value for each file but that still doesn't tell you what your total storage is at a particular depth. Pathlib "Parents" looked promising but it starts with the full path and works backwards, so I tried reverse indexing but it doesn't seem to work.
From what I read in the documentation Pathlib Parents are supposed to be sequences, which are supposed to support reverse indexes, but the error messages seem to imply they don't do negatives.
Here is the code I've been using (with help from http://pbpython.com/pathlib-intro.html)
import pandas as pd
from pathlib import Path
import time
dir_to_scan = "c:/Program Files"
p = Path(dir_to_scan)
all_files = []
for i in p.rglob('*.*'):
all_files.append((i.name, i.parent,i.stat().st_size))
columns = ["File_Name", "Parent", "Size"]
df = pd.DataFrame.from_records(all_files, columns=columns)
df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-2] )
The error trace is as follows:
IndexError Traceback (most recent call last)
<ipython-input-3-5748b1f0a9ee> in <module>()
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-3-5748b1f0a9ee> in <lambda>(x)
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\pathlib.py in __getitem__(self, idx)
592 def __getitem__(self, idx):
593 if idx < 0 or idx >= len(self):
--> 594 raise IndexError(idx)
595 return self._pathcls._from_parsed_parts(self._drv, self._root,
596 self._parts[:-idx - 1])
IndexError: -1

Python-Top Ten Function

I'm trying to create a function where the user puts in the year and the output is the top ten countries by expenditures using this Lynda class as a model.
Here's the data frame
df.dtypes
Country Name object
Country Code object
Year int32
CountryYear object
Population int32
GDP float64
MilExpend float64
Percent float64
dtype: object
Country Name Country Code Year CountryYear Pop GDP Expend Percent
0 Aruba ABW 1960 ABW-1960 54208 0.0 0.0 0.0
I've tried this code and got errors:
Code:
def topten(Year):
simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
simple.index = simple.index + 1
return simple
topten(1990)
This is the rather big error I received:
Can I get some assistance? I can't even figure out what the error is. :-(
C:\Users\mycomputer\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1738 # if kind==mergesort, it can fail for object dtype
-> 1739 return arr.argsort(kind=kind)
1740 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-105-0c974c6a1b44> in <module>()
----> 1 topten(1990)
<ipython-input-104-b8c336014d5b> in topten(Year)
1 def topten(Year):
----> 2 simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
3 simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
4 simple.index = simple.index + 1
5
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort(self, axis, ascending, kind, na_position, inplace)
1831
1832 return self.sort_values(ascending=ascending, kind=kind,
-> 1833 na_position=na_position, inplace=inplace)
1834
1835 def order(self, na_last=None, ascending=True, kind='quicksort',
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort_values(self, axis, ascending, inplace, kind, na_position)
1751 idx = _default_index(len(self))
1752
-> 1753 argsorted = _try_kind_sort(arr[good])
1754
1755 if not ascending:
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1741 # stable sort not available for object dtype
1742 # uses the argsort default quicksort
-> 1743 return arr.argsort(kind='quicksort')
1744
1745 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
The first argument to .loc is the row label.
When you call df_details_merged.loc[1960], pandas will find the row with the label 1960 and return that row as a Series. So you get back a Series with the index Country Name, Country Code, ..., with the values being the values from that row. Then your code tries to sort this by MilExpend, and that's where it fails.
What you need isn't loc, but a simple condition: df[df.Year == Year]. That is "give me the whole dataframe, but only where the 'Year' column contains whatever I've specified in the "Year" variable (1960 in your example).
sort will still work for the time being, but is being deprecated, so use sort_values instead. Putting that together:
simple = df_details_merged[df_details_merged.Year == Year].sort_values(by='MilExpend', ascending=False).reset_index()
Then you can go ahead and drop the columns, and fetch the top 10 rows as you're doing now.

Categories