I have df with some ids, days number and running sum:
data = {'id': [0, 0, 0, 1, 1, 2, 1], 'day' : [0, 2, 1, 1, 4, 2, 2], 'running_sum': [1,4,2,1,6,6,3]}
df_1 = pd.DataFrame(data)
id day running_sum
0 0 0 1
1 0 2 4
2 0 1 2
3 1 1 1
4 1 4 6
5 2 2 6
6 1 2 3
I wanna have dataframe of all days for each id with the correct running sum:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6
thanks for the help
Let's see if this logic is what you have in mind:
Set id and day as index:
df_1 = df_1.set_index(['id', 'day'])
Build a new index to reindex df_1 while introducing new numbers; luckily the index is unique, so reindex works fine:
new_index = df_1.index.get_level_values('id').unique()
new_index = pd.MultiIndex.from_product([new_index, range(5)],
names = ['id', 'day'])
df_1 = df_1.reindex(new_index)
Groupby id and filldown on each group, the rest nulls will be replaced with zero:
(df_1.assign(running_sum = df_1.groupby('id')
.running_sum
.ffill()
.fillna(0))
.reset_index()
)
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
If you are not averse to using an add-on library, the complete function/method from pyjanitor could help abstract the process:
# pip install pyjanitor
import pyjanitor
df = df_1.complete('id', {'day':range(5)}) # explicitly expose the missing values
df.assign(running_sum = df.groupby('id').running_sum.ffill().fillna(0))
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
All this is premised on the assumption that I got the logic right
You can unstack/stack and ffill. The tricky part is to get the missing 'days':
missing = set(range(0, df_1['day'].max()+1)).difference(df_1['day'].unique())
(pd.concat([df_1,
pd.DataFrame({'id': 0, 'day': list(missing)})])
.set_index(['id', 'day'])
.unstack()
.stack(dropna=False) ## adds the missing values
.sort_index()
.groupby('id')
.ffill()
.fillna(0, downcast='infer')
.reset_index()
)
output:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6
Related
Below is a toy Pandas dataframe that has three columns: 'id' (group id), 'b' (for condition), and 'c' (target):
df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [1,0,1,10,1,1,20,1,10,0,1,20,1,1]})
print(df)
id b c
0 1 3 1
1 1 4 0
2 1 5 1
3 1 A 10
4 1 3 1
5 1 4 1
6 1 A 20
7 2 1 1
8 2 A 10
9 2 1 0
10 2 3 1
11 2 A 20
12 2 2 1
13 2 3 1
For each group, I want to replace the values in column 'c' with nan (i.e., np.nan) before the first occurrence of 'A' in column 'b'.
The desired output is the following:
desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [np.nan,np.nan,np.nan,10,1,1,20,np.nan,10,0,1,20,1,1]})
print(desired_output_df)
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
I am able to get the index of the values of column c that I want to change using the following command: df.groupby('id').apply(lambda x: x.loc[:(x.b == 'A').idxmax()-1]).index. But the result is a "MultiIndex" and I can't seem to use it to replace the values.
MultiIndex([(1, 0),
(1, 1),
(1, 2),
(2, 7)],
names=['id', None])
Thanks in advance.
Try:
df['c'] = np.where(df.groupby('id').apply(lambda x: x['b'].eq('A').cumsum()) > 0, df['c'], np.nan)
print(df)
Prints:
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
i want an easy subtraction of two values. I want to replace the value in [10, 150] by calculation the value in ([10, 150] - [9, 150]).
Somehow the code does not like the "rows-1"
for columns in listofcolumns:
rows = 0
while rows < row_count:
column= all_columns.index(columns)
df_merged.iloc[rows, column] = (df_merged.iloc[rows, column] - df_merged.iloc[rows-1, columns])
rows = rows+ 1
It seems to be the case that the df_merged.iloc[rows-1, column] takes the last value of the column.
I used the exact same line in another script before and it worked
This would be an example of some columns
Col1 Col2
0 2
0 3
0 4
0 4
1 5
1 7
1 8
1 8
2 8
The output dataframe i want would look like this.
Col1 Col1
nAn nAn
0 1
0 1
0 0
1 1
0 2
0 1
0 1
1 1
If I understood what you want to do, this would be the solution:
data = {'A': [5,7,9,3,2], 'B': [1,4,6,1,2]}
df = pd.DataFrame(data)
df["A"] = df["A"] - df["B"]
DataFrame at the start
A B
0 5 1
1 7 4
2 9 6
3 3 1
4 2 2
DataFrame at the end
A B
0 4 1
1 3 4
2 3 6
3 2 1
4 0 2
df.diff(1)
Col1 Col2
0 NaN NaN
1 0.0 1.0
2 0.0 1.0
3 0.0 0.0
4 1.0 1.0
5 0.0 2.0
6 0.0 1.0
7 0.0 0.0
8 1.0 0.0
above is based on the following data
Col1 Col2
0 0 2
1 0 3
2 0 4
3 0 4
4 1 5
5 1 7
6 1 8
7 1 8
8 2 8
In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.
I have a dataframe that looks like below:
**L_Type L_ID C_Type E_Code**
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
I need to insert a new row after every 4 row and increment the value in third column (C_Type) by 01 like below table while keeping the values same as first two columns and does not want any value in last column:
L_Type L_ID C_Type E_Code
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 1 5
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 2 5
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
0 3 5
I have searched other threads but could not figure out the exact solution:
How to insert n DataFrame to another every nth row in Pandas?
Insert new rows in pandas dataframe
You can seelct rows by slicing, add 1 to column C_Type and 0.5 to index, for 100% sorrect slicing, because default method of sorting in DataFrame.sort_index is quicksort. Last join together, sort index and create default by concat with DataFrame.reset_index and drop=True:
df['C_Type'] = df['C_Type'].astype(int)
df2 = (df.iloc[3::4]
.assign(C_Type = lambda x: x['C_Type'] + 1, E_Code = np.nan)
.rename(lambda x: x + .5))
df1 = pd.concat([df, df2], sort=False).sort_index().reset_index(drop=True)
print (df1)
L_Type L_ID C_Type E_Code
0 0 1 1 9.0
1 0 1 2 9.0
2 0 1 3 9.0
3 0 1 4 9.0
4 0 1 5 NaN
5 0 2 1 2.0
6 0 2 2 2.0
7 0 2 3 2.0
8 0 2 4 2.0
9 0 2 5 NaN
10 0 3 1 3.0
11 0 3 2 3.0
12 0 3 3 3.0
13 0 3 4 3.0
14 0 3 5 NaN
For example, I have a table
A
id price sum
1 2 0
1 6 0
1 4 0
2 2 0
2 10 0
2 1 0
2 5 0
3 1 0
3 5 0
What I want is like (the last row of sum should be the sum of price of a group)
id price sum
1 2 0
1 6 0
1 4 12
2 2 0
2 10 0
2 1 0
2 5 18
3 1 0
3 5 6
What I can do is find out the sum using
A['price'].groupby(A['id']).transform('sum')
However I don't know how to assign this to the sum column (last row).
Thanks
Use last_valid_index to locate rows to fill
g = df.groupby('id')
l = pd.DataFrame.last_valid_index
df.loc[g.apply(l), 'sum'] = g.price.sum().values
df
id price sum
0 1 2 0
1 1 6 0
2 1 4 12
3 2 2 0
4 2 10 0
5 2 1 0
6 2 5 18
7 3 1 0
8 3 5 6
You could do this:
df.assign(sum=df.groupby('id')['price'].transform('sum').drop_duplicates(keep='last')).fillna(0)
OR
df['sum'] = (df.groupby('id')['price']
.transform('sum')
.mask(df.id.duplicated(keep='last'), 0))
Output:
id price sum
0 1 2 0.0
1 1 6 0.0
2 1 4 12.0
3 2 2 0.0
4 2 10 0.0
5 2 1 0.0
6 2 5 18.0
7 3 1 0.0
8 3 5 6.0