retain function in python - python

Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS,so that I can dynamically referencing the last record. In the following code, I have to manually loop through each line and reference the last record. It seems pretty slow compared to the similar SAS program. Is there anyway that makes it more efficient in pandas? Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 1, 1, 1], 'B': [0, 0, 1, 0]})
df['C'] = np.nan
df['lag_C'] = np.nan
for row in df.index:
if row == df.head(1).index:
df.loc[row, 'C'] = (df.loc[row, 'A'] == 0) + 0
else:
if (df.loc[row, 'B'] == 1):
df.loc[row, 'C'] = 1
elif (df.loc[row, 'lag_C'] == 0):
df.loc[row, 'C'] = 0
elif (df.loc[row, 'lag_C'] != 0):
df.loc[row, 'C'] = df.loc[row, 'lag_C'] + 1
if row != df.tail(1).index:
df.loc[row +1, 'lag_C'] = df.loc[row, 'C']

Very complicated algorithm, but I try vectorized approach.
If I understand it, there can be use cumulative sum as using in this question. Last column lag_C is shifted column C.
But my algorithm can't be use in first rows of df, because only these rows are counted from first value of column A and sometimes column B. So I created column D, where are distinguished rows and latter are copy to output column C, if conditions are True.
I changed input data and test first problematic rows. I try test all three possibilities of first 3 rows of column B with first row of column A.
My input condition are:
Column A and B are only 1 or O. Column C and lag_C are helper columns with only NaN.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
df1 = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
#cumulative sum of column B
df1['C'] = df1['B'].cumsum()
df1['lag_C'] = 1
#first 'group' with min value is problematic, copy to column D for latter use
df1.loc[df1['C'] == df1['C'].min() ,'D'] = df1['B']
#cumulative sums of groups to column C
df1['C']= df1.groupby(['C'])['lag_C'].cumsum()
#correct problematic states in column C, use value from D
if (df1['A'].loc[0] == 1):
df1.loc[df1['D'].notnull() ,'C'] = df1['D']
if ((df1['A'].loc[0] == 1) & (df1['B'].loc[0] == 1)):
df1.loc[df1['D'].notnull() ,'C'] = 0
del df1['D']
#shifted column lag_C from column C
df1['lag_C'] = df1['C'].shift(1)
print df1
# A B C lag_C
#0 1 0 0 NaN
#1 1 0 0 0
#2 1 1 1 0
#3 1 1 1 1
#4 1 0 2 1
#5 0 0 3 2
#6 0 0 4 3
#7 1 1 1 4
#8 1 0 2 1
#9 0 1 1 2
#10 0 0 2 1

Related

Set values in a column based on the values of other columns as a group

I have a df that looks something like this:
name A B C D
1 bar 1 0 1 1
2 foo 0 0 0 1
3 cat 1 0-1 0
4 pet 0 0 0 1
5 ser 0 0-1 0
6 chet 0 0 0 1
I need to use loc method to add values in a new column ('E') based on the values of the other columns as a group for instance if values are [1,0,0,0] value in column E will be 1. I've tried this:
d = {'A': 1, 'B': 0, 'C': 0, 'D': 0}
A = pd.Series(data=d, index=['A', 'B', 'C', 'D'])
df.loc[df.iloc[:, 1:] == A, 'E'] = 1
It didn't work. I need to use loc method or other numpy based method since the dataset is huge. If it is possible to avoid creating a series to compare the row that would also be great, somehow extracting the values of columns A B C D and compare them as a group for each row.
You can compare values with A with test if match all rows in DataFrame.all:
df.loc[(df == A).all(axis=1), 'E'] = 1
For 0,1 column:
df['E'] = (df == A).all(axis=1).astype(int)
df['E'] = np.where(df == A).all(axis=1), 1, 0)

Python. Change numeric data into categorical [duplicate]

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

trim last rows of a pandas dataframe based on a condition

let's assume a dataframe like this:
idx x y
0 a 3
1 b 2
2 c 0
3 d 2
4 e 5
how can I trim the bottom rows, based on a condition, so that any row after the last one matching the condition would be removed?
for example:
with the following condition: y == 0
the output would be
idx x y
0 a 3
1 b 2
2 c 0
the condition can happen many times, but the last one is the one that triggers the cut.
Method 1:
Usng index.max & iloc:
index.max to get the last row with condition y==0
iloc to slice of the dataframe on the index found with df['y'].eq(0)
idx = df.query('y.eq(0)').index.max()+1
# idx = df.query('y==0').index.max()+1 -- if pandas < 0.25
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
Method 2:
Using np.where
idx = np.where(df['y'].eq(0), df.index, 0).max()+1
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
you could do, here np.wherereturns a tuple, so we access the value of the indexes as the first element of the tuple using np.where(df.y == 0), the first occurence is then returned as the last element of this vector, finaly we add 1 to the index so we can include this index of the last occurence while slicing
df_cond = df.iloc[:np.where(df.y == 0)[0][-1]+1, :]
or you could do :
df_cond = df[ :df.y.eq(0).cumsum().idxmax()+1 ]
Set up your dataframe:
data = [
[ 'a', 3],
[ 'b' , 2],
[ 'c' , 0],
[ 'd', 2],
[ 'e' , 5]
]
df = pd.DataFrame(data, columns=['x', 'y']).reset_index().rename(columns={'index':'idx'}).sort_values('idx')
Then find your cutoff (assuming the idx column is already sorted):
cutoff = df[df['y'] == 0].idx.min()
The df['y'] == 0 is your condition. Then get the min idx that meets that condition and save it as our cutoff.
Finally, create a new dataframe using your cutoff:
df_new = df[df.idx <= cutoff].copy()
Output:
df_new
idx x y
0 0 a 3
1 1 b 2
2 2 c 0
I would do something like this:
df.iloc[:df['y'].eq(0).idxmax()+1]
Just look for the largest index where your condition is true.
EDIT
So the above code wont work because idxmax() still only takes the first index where the value is true. So we we can do the following to trick it:
df.iloc[:df['y'].eq(0).sort_index(ascending = False).idxmax()+1]
Flip the index, so the last index is the first index that idxmax picks up.

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

Creating a new column based on if-elif-else condition

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Categories