Change value of column based on specific id in pandas dataframe - python

I have the below sorted dataframe and I want to set the last value of each id in the id column to 0
id value
1 500
1 50
1 36
2 45
2 150
2 70
2 20
2 10
I am able to set the last value of the entire id column to 0 using df['value'].iloc[-1] = 0. How can I set the last value of both id : 1 and id : 2 to get the below output.
id value
1 500
1 50
1 0
2 45
2 150
2 70
2 20
2 0

you can do drop_duplicates and keep last to get the last row of each id. Use the index of these rows and set the value to 0
df.loc[df['id'].drop_duplicates(keep='last').index, 'value'] = 0
print(df)
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0

df.loc[~df.id.duplicated('last'),'value']=0
Broken down
m=df.id.duplicated('last')
df.loc[~m,'value']=0
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
How it works
m=df.id.duplicated('last')# Selects the last duplicated in column id
~m reverses that and hence last duplicated becomes true
df.loc[~m,'value']# loc accessor allows us to reach the True value in the nominated column to write with 0

If you are willing to use numpy here is a fast solution:
import numpy as np
# Recreate example
df = pd.DataFrame({
"id":[1,1,1,2,2,2,2,2],
"value": [500,50,36,45,150,70,20,10]
})
# Solution
df["value"] = np.where(~df.id.duplicated(keep="last"),0,df["value"].to_numpy())

Related

Adding multiple columns randomly to a dataframe from columns in another dataframe

I've looked everywhere but can't find a solution.
Let's say I have two tables:
Year
1
2
3
4
and
ID Value
1 10
2 50
3 25
4 20
5 40
I need to pick randomly from both columns of the 2nd table to add to the first table - so if ID=3 is picked randomly as a column to add to the first table, I also add Value=25 i.e. end up with something like:
Year ID Value
1 3 25
2 1 10
3 1 10
4 5 40
5 2 50
IIUC, do you want?
df_year[['ID', 'Value']] = df_id.sample(n=len(df_year), replace=True).to_numpy()
Output:
Year ID Value
0 1 4 20
1 2 4 20
2 3 2 50
3 4 3 25

For similar value in column add new column frequence

I have a dataframe :
Id age number
1 35 7
5 76 9
1 15 0
2 10 4
5 43 8
What i need to get is :
Id age number freq
1 35 7 2
5 76 9 2
1 15 0 1
2 10 4 1
5 43 8 1
Add a new colum freq , for each value in a column , we takes all rows with same value in ID and count rows where the value of cat is less.
If need counter in descending order use GroupBy.cumcount:
df['freq'] = df.groupby('Id').cumcount(ascending=False).add(1)
But if need counts values by Id use GroupBy.transform with DataFrameGroupBy.size, output is different:
df['freq'] = df.groupby('Id')['Id'].transform('size')
Alternative with Series.map and Series.value_counts:
df['freq'] = df['Id'].map(df['Id'].value_counts())

Determine size within each each group having the same value in another column

I have dataframe like so,
ID,CLASS_ID,ACTIVE
1,123,0
2,123,0
3,456,1
4,123,0
5,456,1
11,123,1
18,123,0
7,456,0
19,123,0
8,456,1
I'm trying to get the cumulative counts of the CLASS_ID having same value for ACTIVE. In case of the dataframe given above, CLASS_ID is continuously having ACTIVE as 0, until the 4th record post which next value is 1. So up until 4th record, count should be 3. This process has to be continued and the count has to be resetted every time value of ACTIVE changes for the CLASS_ID The expected output is as follows..
ID,CLASS_ID,ACTIVE,ACTIVE_COUNT
1,123,0,3
2,123,0,3
3,456,1,2
4,123,0,3
5,456,1,2
11,123,1,1
18,123,0,2
7,456,0,1
19,123,0,2
8,456,1,1
I tried using df.groupby(..).transform(..) but its not working out for me. Could someone help me out a bit?
You can do this with groupby:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
df['ACTIVE_COUNT'] = df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
df
ID CLASS_ID ACTIVE ACTIVE_COUNT
0 1 123 0 3
1 2 123 0 3
2 3 456 1 2
3 4 123 0 3
4 5 456 1 2
5 11 123 1 1
6 18 123 0 2
7 7 456 0 1
8 19 123 0 2
9 8 456 1 1
Details
First, create an indicator column marking rows with the same value per group:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
ind
0 1
1 1
2 1
3 1
4 1
5 2
6 3
7 2
8 3
9 3
Name: ACTIVE, dtype: int64
We then use ind as a grouper argument to df.groupby along with "CLASS_ID", and then compute the size of each group using transform.
df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
0 3
1 3
2 2
3 3
4 2
5 1
6 2
7 1
8 2
9 1
Name: ACTIVE, dtype: int64

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

Python Pandas add column to multi-index GroupBy DataFrame

I am trying to add a column to a Pandas GroupBy DataFrame with a multi-index. The column is the difference between the max and mean value of a common key after grouping.
Here is the input DataFrame:
Main Reads Test Subgroup
0 1 5 54 1
1 2 2 55 1
2 1 10 56 2
3 2 20 57 3
4 1 7 58 3
Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Main': [1, 2, 1, 2, 1], 'Reads': [5, 2, 10, 20, 7],\
'Test':range(54,59), 'Subgroup':[1,1,2,3,3]})
result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean]})
Here is the variable result before perform the calculation of diff:
Reads
amax mean
Main Subgroup
1 1 5 5
2 10 10
3 7 7
2 1 2 2
3 20 20
Next, I calculate the diff column with:
result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
but here is the output:
/home/userd/test.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/
...stable/indexing.html#indexing-view-versus-copy
...result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
I would like the diff column to be at the same level a amax and mean.
Is there a way to add a column to the innermost (bottom) column index of a multi-index GroupBy() object in Pandas?
You can access the multi-index using a tuple
result[('Reads','diff')] = result[('Reads','amax')] - result[('Reads','mean')]
You get
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
#you can you lambda to build diff directly.
df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean,lambda x: np.max(x)-np.mean(x)]}).rename(columns={'<lambda>':'diff'})
Out[2360]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
Try this:
In [8]: result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean, lambda x: x.max()-x.mean()]})
In [9]: result
Out[9]:
Reads
amax mean <lambda>
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
In [10]: result = result.rename(columns={'<lambda>':'diff'})
In [11]: result
Out[11]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0

Categories