I have a df that is the result of a join:
ID count
0 A 30
1 A 30
2 B 5
3 C 44
4 C 44
5 C 44
I would like to be able to iterate the count column based on the ID column. Here is an example of the desired result:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
I know there are non-pythonic ways to do this via loops, but I am wondering if there is a more intelligent (and time efficient, as this table is large) way to do this.
Transform the group to get a cumulative count and add it to count, eg:
df['count'] += df.groupby('ID')['count'].cumcount()
Gives you:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
Related
I have two dataframes A and B
A = pd.DataFrame({'a'=[1,2,3,4,5], 'b'=[11,22,33,44,55]})
B = pd.DataFrame({'a'=[7,2,3,4,9], 'b'=[123,234,456,789,1122]})
I want to merge B and A such that I don't want the common values in column 'a' in A and B from B, only non-intersecting values from B in column 'a' should be taken. The final dataframe should look like
a
b
1
11
2
22
3
33
4
44
5
55
7
123
9
1122
If a is unique-valued in both A and B (some sort of unique ID for example), you can try with concat and drop_duplicates:
pd.concat([A,B]).drop_duplicates('a')
Output:
a b
0 1 11
1 2 22
2 3 33
3 4 44
4 5 55
0 7 123
4 9 1122
In the general case, use isin to check for existence of B['a'] in A['a']:
pd.concat([A,B[~B['a'].isin(A['a'])])
I want to get the percentage/weighting of a row per group. An example of the dataframe is seen below.
Place District Count
A 1 12
B 1 13
C 1 34
D 2 56
E 2 1
F 3 23
I need to group by the District but get a percentage or weighting on the Count for each row Place. For example, the calculation for Place A would be 12/(12+13+34) and B would be 13/(12+13+34).
The expected outcome would be:
Place District Count Weighting
A 1 12 0,203389831
B 1 13 0,220338983
C 1 34 0,576271186
D 2 56 0,98245614
E 2 1 0,01754386
F 3 23 1
I am using pandas dataframes.
IIUC, GroupBy.transform
df['Weighting'] = df['Count'].div(df.groupby('District')['Count'].transform('sum'))
Output
Place District Count Weighting
0 A 1 12 0.203390
1 B 1 13 0.220339
2 C 1 34 0.576271
3 D 2 56 0.982456
4 E 2 1 0.017544
5 F 3 23 1.000000
After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
I want to know, how do I do How to a run function per group that return vector, not single value? in pandas.
I have a dataset with on value-column and on group-column.
x group order
1 22 a 1
2 33 a 2
3 11 a 3
4 4 b 1
5 88 b 2
6 77 b 3
7 43 b 4
8 9 b 5
I want to analyse value-column per group. For example I want to use an fft. How can I run a function over each group which returns a sequence, not just value (for each group, the fft produces a vector) and get it back as per row.
I expect something like
y group order
1 21 a 1
2 62 a 2
3 83 a 3
4 4 a 4
6 46 b 1
7 17 b 2
as output.
I would like to have this done in pandas. Extra points if it can be done with https://github.com/kieferk/dfply
Use apply and wrap the result in a pd.Series
df.groupby('group').x.apply(lambda x: pd.Series(np.random.choice(x, 2)))
group
a 0 22
1 33
b 0 88
1 43
Name: x, dtype: int64
So the I'm working with a panda dataframe that looks like this:
Current Panda Table
I want to turn sum all of the times for each individual property on a given week, my idea is to append this to the data frame like this:
Dataframe2
Then to simplify things I'd create a new data frame that looks like this:
Property Name Week Total_weekly_time
A 1 60
A 2 xx
B 1 xx
etc. etc.
I'm new to pandas, trying to learn the ins and outs. Any answers must appreciated as well as references to learn pandas better.
I think you need transform if need new column with same dimension as df after groupby:
df['Total_weekly_time'] = df.groupby(['Property Name', 'Week #'])['Duration']
.transform('sum')
print (df)
Property Name Week # Duration Total_weekly_time
0 A 1 10 60
1 A 1 10 60
2 A 2 5 5
3 B 1 20 70
4 B 1 20 70
5 B 1 20 70
6 C 2 10 10
7 C 3 30 50
8 A 1 40 60
9 A 4 40 40
10 B 1 5 70
11 B 1 5 70
12 C 3 10 50
13 C 3 10 50
Pandas docs