Groupby.transform doesn't work in dask dataframe - python

i'm using the following dask.dataframe AID:
AID FID ANumOfF
0 1 X 1
1 1 Y 5
2 2 Z 6
3 2 A 1
4 2 X 11
5 2 B 18
I know in a pandas dataframe I could use:
AID.groupby('AID')['ANumOfF'].transform('sum')
to get:
0 6
1 6
2 36
3 36
4 36
5 36
I want to use the same with dask.dataframes which usually uses same functions as a pandas dataframe, but in this instance gives me the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'transform'
It could either be one of two things, either that dask doesn't support it, or it's because I'm using python 3?
I tried the following code:
AID.groupby('AID')['ANumOfF'].sum()
but that just gives me the sum of each group like this:
AID
1 6
2 36
I need it to be as the above where a sum is repeated in each row. My question is, if transform isn't supported, is there another way I could achieve the same result?

I think you can use join:
s = AID.groupby('AID')['ANumOfF'].sum()
AID = AID.set_index('AID').drop('ANumOfF', axis=1).join(s).reset_index()
print (AID)
AID FID ANumOfF
0 1 X 6
1 1 Y 6
2 2 Z 36
3 2 A 36
4 2 X 36
5 2 B 36
Or faster solution with map by aggregate Series or dict:
s = AID.groupby('AID')['ANumOfF'].sum()
#a bit faster
#s = AID.groupby('AID')['ANumOfF'].sum().to_dict()
AID['ANumOfF'] = AID['AID'].map(s)
print (AID)
AID FID ANumOfF
0 1 X 6
1 1 Y 6
2 2 Z 36
3 2 A 36
4 2 X 36
5 2 B 36

Currently Dask supports transform , howerver there may be an issues with indexes (depending on original dataframe). see this PR #5327
So your code should work
AID.groupby('AID')['ANumOfF'].transform('sum')

Related

How do I transpose columns into rows of a Pandas DataFrame?

My current data frame is comprised of 10 rows and thousands of columns. The setup currently looks similar to this:
A B A B
1 2 3 4
5 6 7 8
But I desire something more like below, where essentially I would transpose the columns into rows once the headers start repeating themselves.
A B
1 2
5 6
3 4
7 8
I've been trying df.reshape but perhaps can't get the syntax right. Any suggestions on how best to transpose the data like this?
I'd probably go for stacking, grouping and then building a new DataFrame from scratch, eg:
pd.DataFrame({col: vals for col, vals in df.stack().groupby(level=1).agg(list).items()})
That'll also give you:
A B
0 1 2
1 3 4
2 5 6
3 7 8
Try with stack, groupby and pivot:
stacked = df.T.stack().to_frame().assign(idx=df.T.stack().groupby(level=0).cumcount()).reset_index()
output = stacked.pivot("idx", "level_0", 0).rename_axis(None, axis=1).rename_axis(None, axis=0)
>>> output
A B
0 1 2
1 5 6
2 3 4
3 7 8

Create DataFrame from subarrays of existing Series object

Could you suggest the method to create DataFrame from Series like I have described below:
Input Series
s = pd.Series([1,2,3,4,5,6])
Wanted DataFrame:
x y z
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
Of course I could do it by using loop but I hope there is way to do it more elegantly.
I'm not certain that's what you're looking for, but here's a pretty trivial way to do that:
df = pd.DataFrame({"x": s[:-2].values, "y": s[1:-1].values, "z": s[2:].values} )
Output:
x y z
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6

How to modify values which are one row below the values that meet a condition?

Is there an efficient way to change the value of a previous row whenever a conditional is met in a subsequent entry? Specifically I am wondering if there is anyway to adapt pandas.where to modify the entry in a row prior or subsequent to the conditional test. Suppose
Data={'Energy':[12,13,14,12,15,16],'Time':[2,3,4,2,5,6]}
DF = pd.DataFrame(Data)
DF
Out[123]:
Energy Time
0 12 2
1 13 3
2 14 4
3 12 2
4 15 5
5 16 6
If I wanted to change the value of Energy to 'X' whenever Time <= 2 I could just do something like.
DF['ENERGY']=DF['ENERGY'].where(DF['TIME'] >2,'X')
or
DF.loc[DF['Time']<=2,'Energy']='X'
Which would output
Energy Time
0 X 2
1 13 3
2 14 4
3 X 2
4 15 5
5 16 6
But what if I want to change the value of 'Energy' in the row after Time <=2 so that the output would actually be.
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Is there an easy modification for a vectorized approach to this?
Shift the values one row down using Series.shift and then compare:
df.loc[df['Time'].shift() <= 2, 'Energy'] = 'X'
df
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Side note: I assume 'X' is actually something else here, but FYI, mixing strings and numeric data leads to object type columns which is a known pandas anti-pattern.

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

Concatenating by looping through rows in Python

I have been looking for a way to do this but haven't been successful in finding something that will work in python/pandas.
I am looking to loop through rows until a 1 is found again and concatenate the previous rows until the one is found and place that in a third column.
For Example:
df
'A' 'B' 'C'
1 4 4
2 3 43
3 1 431
4 2 4312
1 5 5
2 4 54
1 2 2
2 2 22
3 4 224
df
if df['A'] == 1
df['C'] = df.concat['B']
else
df['C'] = df.concat['B'] + df.concat['B'+1]
If you can't tell, this is my first time trying to write a loop.
Any help generating column C from columns A and B using python code would be well appreciated.
Thank you,
James
This can achieve what you need , create a new key by using cumsum, then we groupby the key we created , using cumsum again
df.groupby(df.A.eq(1).cumsum()).B.apply(lambda x : x.astype(str).cumsum())
Out[838]:
0 4
1 43
2 431
3 4312
4 5
5 54
6 2
7 22
8 224
Name: B, dtype: object

Categories