Subtraction in dataframe between rows - python

i want an easy subtraction of two values. I want to replace the value in [10, 150] by calculation the value in ([10, 150] - [9, 150]).
Somehow the code does not like the "rows-1"
for columns in listofcolumns:
rows = 0
while rows < row_count:
column= all_columns.index(columns)
df_merged.iloc[rows, column] = (df_merged.iloc[rows, column] - df_merged.iloc[rows-1, columns])
rows = rows+ 1
It seems to be the case that the df_merged.iloc[rows-1, column] takes the last value of the column.
I used the exact same line in another script before and it worked
This would be an example of some columns
Col1 Col2
0 2
0 3
0 4
0 4
1 5
1 7
1 8
1 8
2 8
The output dataframe i want would look like this.
Col1 Col1
nAn nAn
0 1
0 1
0 0
1 1
0 2
0 1
0 1
1 1

If I understood what you want to do, this would be the solution:
data = {'A': [5,7,9,3,2], 'B': [1,4,6,1,2]}
df = pd.DataFrame(data)
df["A"] = df["A"] - df["B"]
DataFrame at the start
A B
0 5 1
1 7 4
2 9 6
3 3 1
4 2 2
DataFrame at the end
A B
0 4 1
1 3 4
2 3 6
3 2 1
4 0 2

df.diff(1)
Col1 Col2
0 NaN NaN
1 0.0 1.0
2 0.0 1.0
3 0.0 0.0
4 1.0 1.0
5 0.0 2.0
6 0.0 1.0
7 0.0 0.0
8 1.0 0.0
above is based on the following data
Col1 Col2
0 0 2
1 0 3
2 0 4
3 0 4
4 1 5
5 1 7
6 1 8
7 1 8
8 2 8

Related

Python Pandas DataFrames compare with next rows

I have dataframe like this.
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
I want to create column out by compare each row. If row 0 less than row 1 then out is 1. If row 1 more than row 2 then out is 0. like this sample.
col1 out
0 1 1 # 1<3 = 1
1 3 0 # 3<3 = 0
2 3 0 # 3<1 = 0
3 1 1 # 1<2 = 1
4 2 1 # 2<3 = 1
5 3 0 # 3<2 = 0
6 2 0 # 2<2 = 0
7 2 -
I try with this code.
def comp_out(a):
return np.concatenate(([1],a[1:] > a[2:]))
df['out'] = comp_out(df.col1.values)
It show error like this.
ValueError: operands could not be broadcast together with shapes (11,) (10,)
Let's use shift instead to "shift" the column up so that rows are aligned with the previous, then use lt to compare less than and astype convert the booleans to 1/0:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).astype(int)
col1 out
0 1 1
1 3 0
2 3 0
3 1 1
4 2 1
5 3 0
6 2 0
7 2 0
We can strip the last value with iloc if needed:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).iloc[:-1].astype(int)
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN
If we want to use the function we should make sure both are the same length, by slicing off the last value:
def comp_out(a):
return np.concatenate([a[0:-1] < a[1:], [np.NAN]])
df['out'] = comp_out(df['col1'].to_numpy())
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN

pd dataframe addings rows by id

I have df with some ids, days number and running sum:
data = {'id': [0, 0, 0, 1, 1, 2, 1], 'day' : [0, 2, 1, 1, 4, 2, 2], 'running_sum': [1,4,2,1,6,6,3]}
df_1 = pd.DataFrame(data)
id day running_sum
0 0 0 1
1 0 2 4
2 0 1 2
3 1 1 1
4 1 4 6
5 2 2 6
6 1 2 3
I wanna have dataframe of all days for each id with the correct running sum:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6
thanks for the help
Let's see if this logic is what you have in mind:
Set id and day as index:
df_1 = df_1.set_index(['id', 'day'])
Build a new index to reindex df_1 while introducing new numbers; luckily the index is unique, so reindex works fine:
new_index = df_1.index.get_level_values('id').unique()
new_index = pd.MultiIndex.from_product([new_index, range(5)],
names = ['id', 'day'])
df_1 = df_1.reindex(new_index)
Groupby id and filldown on each group, the rest nulls will be replaced with zero:
(df_1.assign(running_sum = df_1.groupby('id')
.running_sum
.ffill()
.fillna(0))
.reset_index()
)
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
If you are not averse to using an add-on library, the complete function/method from pyjanitor could help abstract the process:
# pip install pyjanitor
import pyjanitor
df = df_1.complete('id', {'day':range(5)}) # explicitly expose the missing values
df.assign(running_sum = df.groupby('id').running_sum.ffill().fillna(0))
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
All this is premised on the assumption that I got the logic right
You can unstack/stack and ffill. The tricky part is to get the missing 'days':
missing = set(range(0, df_1['day'].max()+1)).difference(df_1['day'].unique())
(pd.concat([df_1,
pd.DataFrame({'id': 0, 'day': list(missing)})])
.set_index(['id', 'day'])
.unstack()
.stack(dropna=False) ## adds the missing values
.sort_index()
.groupby('id')
.ffill()
.fillna(0, downcast='infer')
.reset_index()
)
output:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

Pandas Insert a new row after every nth row

I have a dataframe that looks like below:
**L_Type L_ID C_Type E_Code**
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
I need to insert a new row after every 4 row and increment the value in third column (C_Type) by 01 like below table while keeping the values same as first two columns and does not want any value in last column:
L_Type L_ID C_Type E_Code
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 1 5
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 2 5
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
0 3 5
I have searched other threads but could not figure out the exact solution:
How to insert n DataFrame to another every nth row in Pandas?
Insert new rows in pandas dataframe
You can seelct rows by slicing, add 1 to column C_Type and 0.5 to index, for 100% sorrect slicing, because default method of sorting in DataFrame.sort_index is quicksort. Last join together, sort index and create default by concat with DataFrame.reset_index and drop=True:
df['C_Type'] = df['C_Type'].astype(int)
df2 = (df.iloc[3::4]
.assign(C_Type = lambda x: x['C_Type'] + 1, E_Code = np.nan)
.rename(lambda x: x + .5))
df1 = pd.concat([df, df2], sort=False).sort_index().reset_index(drop=True)
print (df1)
L_Type L_ID C_Type E_Code
0 0 1 1 9.0
1 0 1 2 9.0
2 0 1 3 9.0
3 0 1 4 9.0
4 0 1 5 NaN
5 0 2 1 2.0
6 0 2 2 2.0
7 0 2 3 2.0
8 0 2 4 2.0
9 0 2 5 NaN
10 0 3 1 3.0
11 0 3 2 3.0
12 0 3 3 3.0
13 0 3 4 3.0
14 0 3 5 NaN

Pandas assign the groupby sum value to the last row in the original table

For example, I have a table
A
id price sum
1 2 0
1 6 0
1 4 0
2 2 0
2 10 0
2 1 0
2 5 0
3 1 0
3 5 0
What I want is like (the last row of sum should be the sum of price of a group)
id price sum
1 2 0
1 6 0
1 4 12
2 2 0
2 10 0
2 1 0
2 5 18
3 1 0
3 5 6
What I can do is find out the sum using
A['price'].groupby(A['id']).transform('sum')
However I don't know how to assign this to the sum column (last row).
Thanks
Use last_valid_index to locate rows to fill
g = df.groupby('id')
l = pd.DataFrame.last_valid_index
df.loc[g.apply(l), 'sum'] = g.price.sum().values
df
id price sum
0 1 2 0
1 1 6 0
2 1 4 12
3 2 2 0
4 2 10 0
5 2 1 0
6 2 5 18
7 3 1 0
8 3 5 6
You could do this:
df.assign(sum=df.groupby('id')['price'].transform('sum').drop_duplicates(keep='last')).fillna(0)
OR
df['sum'] = (df.groupby('id')['price']
.transform('sum')
.mask(df.id.duplicated(keep='last'), 0))
Output:
id price sum
0 1 2 0.0
1 1 6 0.0
2 1 4 12.0
3 2 2 0.0
4 2 10 0.0
5 2 1 0.0
6 2 5 18.0
7 3 1 0.0
8 3 5 6.0

Categories