I have a dataframe :
Id age number
1 35 7
5 76 9
1 15 0
2 10 4
5 43 8
What i need to get is :
Id age number freq
1 35 7 2
5 76 9 2
1 15 0 1
2 10 4 1
5 43 8 1
Add a new colum freq , for each value in a column , we takes all rows with same value in ID and count rows where the value of cat is less.
If need counter in descending order use GroupBy.cumcount:
df['freq'] = df.groupby('Id').cumcount(ascending=False).add(1)
But if need counts values by Id use GroupBy.transform with DataFrameGroupBy.size, output is different:
df['freq'] = df.groupby('Id')['Id'].transform('size')
Alternative with Series.map and Series.value_counts:
df['freq'] = df['Id'].map(df['Id'].value_counts())
Related
Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
I have the below sorted dataframe and I want to set the last value of each id in the id column to 0
id value
1 500
1 50
1 36
2 45
2 150
2 70
2 20
2 10
I am able to set the last value of the entire id column to 0 using df['value'].iloc[-1] = 0. How can I set the last value of both id : 1 and id : 2 to get the below output.
id value
1 500
1 50
1 0
2 45
2 150
2 70
2 20
2 0
you can do drop_duplicates and keep last to get the last row of each id. Use the index of these rows and set the value to 0
df.loc[df['id'].drop_duplicates(keep='last').index, 'value'] = 0
print(df)
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
df.loc[~df.id.duplicated('last'),'value']=0
Broken down
m=df.id.duplicated('last')
df.loc[~m,'value']=0
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
How it works
m=df.id.duplicated('last')# Selects the last duplicated in column id
~m reverses that and hence last duplicated becomes true
df.loc[~m,'value']# loc accessor allows us to reach the True value in the nominated column to write with 0
If you are willing to use numpy here is a fast solution:
import numpy as np
# Recreate example
df = pd.DataFrame({
"id":[1,1,1,2,2,2,2,2],
"value": [500,50,36,45,150,70,20,10]
})
# Solution
df["value"] = np.where(~df.id.duplicated(keep="last"),0,df["value"].to_numpy())
I have a df that is the result of a join:
ID count
0 A 30
1 A 30
2 B 5
3 C 44
4 C 44
5 C 44
I would like to be able to iterate the count column based on the ID column. Here is an example of the desired result:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
I know there are non-pythonic ways to do this via loops, but I am wondering if there is a more intelligent (and time efficient, as this table is large) way to do this.
Transform the group to get a cumulative count and add it to count, eg:
df['count'] += df.groupby('ID')['count'].cumcount()
Gives you:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3
Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.
df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)
Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().