Reference previous row when iterating through dataframe

Reference previous row when iterating through dataframe - python

Is there a simple way to reference the previous row when iterating through a dataframe?
In the following dataframe I would like column B to change to 1 when A > 1 and remain at 1 until A < -1, when it changes to -1.
In [11]: df
Out[11]:
A B
2000-01-01 -0.182994 0
2000-01-02 1.290203 0
2000-01-03 0.245229 0
2000-01-08 -1.230742 0
2000-01-09 0.534939 0
2000-01-10 1.324027 0
This is what I've tried to do, but clearly you can't just subtract 1 from the index:
for idx,row in df.iterrows():
if df["A"][idx]<-1:
df["B"][idx] = -1
elif df["A"][idx]>1:
df["B"][idx] = 1
else:
df["B"][idx] = df["B"][idx-1]
I also tried using get_loc but got completely lost, I'm sure I'm missing a very simple solution!

This is what you are trying to do?
In [38]: df = DataFrame(randn(10,2),columns=list('AB'))
In [39]: df['B'] = np.nan
In [40]: df.loc[df.A<-1,'B'] = -1
In [41]: df.loc[df.A>1,'B'] = 1
In [42]: df.ffill()
Out[42]:
A B
0 -1.186808 -1
1 -0.095587 -1
2 -1.921372 -1
3 -0.772836 -1
4 0.016883 -1
5 0.350778 -1
6 0.165055 -1
7 1.101561 1
8 -0.346786 1
9 -0.186263 1

Similar question here: Reference values in the previous row with map or apply .
My impression is that pandas should handle iterations and we shouldn't have to do it on our own... Therefore, I chose to use the DataFrame 'apply' method.
Here is the same answer I posted on other question linked above...
You can use the dataframe 'apply' function and leverage the unused the 'kwargs' parameter to store the previous row.
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
This example uses a decorator to store the previous row in a dictionary and then pass it to the function when Pandas calls it on the next row.
Disclaimer 1: The 'prev_row' variable starts off empty for the first row so when using it in the apply function I had to supply a default value to avoid a 'KeyError'.
Disclaimer 2: I am fairly certain this will be slower the apply operation but I did not do any tests to figure out how much.

Try this: If the first value is neither >= 1 or < -1 set to 0 or whatever you like.
df["B"] = None
df["B"] = np.where(df['A'] >= 1, 1,df['B'])
df["B"] = np.where(df['A'] < -1, -1,df['B'])
df = df.ffill().fillna(0)
This solves the problem stated, But the real solution to reference previous row is use .shift() or .index() -1

Related

Pandas get postion of last value based on condition for each column (efficiently)

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:
>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
a b c d
0 0 1 1 0
1 0 1 0 0
2 1 1 0 0
3 0 1 0 0
4 0 1 1 0
Desired result:
>> calculate_recency_vector(df)
[3,1,1,None]
The desired result shows for each column "how many rows ago" the value 1 appeared for the last time. Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector. Any ideas how to implement this?
Edit: to avoid confusion, I changed the desired output for the last column from 0 to None. This column has no recency because the value 1 does not occur at all.
Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.

A loop-less solution which is faster & cleaner:
>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>> non_zero_values_of_col = column[column.astype(bool)]
>> if non_zero_values_of_col.empty:
>> return 0
>> return len(column) - non_zero_values_of_col.index[-1]
>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)
a 3
b 1
c 1
d 0
dtype: int64
Sidenote: Using pd.apply() is slow (SO explanation). There exist faster solutions like using np.where or using apply(...,raw=True). See this question for details.

This
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)
produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan, u could then take the desired column

With this example dataframe, you can define a function as follow:
def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
recency_vector = []
for col in df.columns:
last = 0
for i, y in enumerate(df[col].to_list()):
if y == condition:
last = i
recency = len(df[col].to_list()) - last
if recency == len(df[col].to_list()):
recency = None
recency_vector.append(recency)
return recency_vector
Running the function, it will return this:
calculate_recency_vector(df, 1)
[3, 1, 1, None]

Indexing rows by boolean expression and column by position pandas data frame

How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?

Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0

One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0

I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0

vectoring pandas df by row with multiple conditional statements

I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.

You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?

Pandas delete a row in a dataframe based on a value

I want do delete rows in a pandas dataframe where a the second column = 0
So this ...
Code Int
0 A 0
1 A 1
2 B 1
Would turn into this ...
Code Int
0 A 1
1 B 1
Any help greatly appreciated!

Find the row you want to delete, and use drop.
delete_row = df[df["Int"]==0].index
df = df.drop(delete_row)
print(df)
Code Int
1 A 1
2 B 1
Further more. you can use iloc to find the row, if you know the position of the column
delete_row = df[df.iloc[:,1]==0].index
df = df.drop(delete_row)

You could use loc and drop in one line of code.
df = df.drop(df["Int"].loc[df["Int"]==0].index)

You could use this as well!
df = df[df.Int != 0]

Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

I am trying to create a function that uses df.iterrows() and Series.nlargest. I want to iterate over each row and find the largest number and then mark it as a 1. This is the data frame:
A B C
9 6 5
3 7 2
Here is the output I wish to have:
A B C
1 0 0
0 1 0
This is the function I wish to use here:
def get_top_n(df, top_n):
"""
Parameters
----------
df : DataFrame
top_n : int
The top number to get
Returns
-------
top_numbers : DataFrame
Returns the top number marked with a 1
"""
# Implement Function
for row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
return top_numbers
I get the following error:
AttributeError: 'tuple' object has no attribute 'nlargest'
Help would be appreciated on how to re-write my function in a neater way and to actually work! Thanks in advance

Add i variable, because iterrows return indices with Series for each row:
for i, row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
General solution with numpy.argsort for positions in descending order, then compare and convert boolean array to integers:
def get_top_n(df, top_n):
if top_n > len(df.columns):
raise ValueError("Value is higher as number of columns")
elif not isinstance(top_n, int):
raise ValueError("Value is not integer")
else:
arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
return (df1)
df1 = get_top_n(df, 2)
print (df1)
A B C
0 1 1 0
1 1 1 0
df1 = get_top_n(df, 1)
print (df1)
A B C
0 1 0 0
1 0 1 0
EDIT:
Solution with iterrows is possible, but not recommended, because slow:
top_n = 2
for i, row in df.iterrows():
top = row.nlargest(top_n).index
df.loc[i] = 0
df.loc[i, top] = 1
print (df)
A B C
0 1 1 0
1 1 1 0

For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years
def get_top_n(prev_returns, top_n):
# generate dataframe populated with zeros for merging
top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)
# find top_n largest entries by row
df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)
# merge dataframes
top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)
# return dataframe replacing non_zero answers with a 1
return (top_stocks.notnull()) * 1

Alternatively, the 2-line solution could be
def get_top_n(df, top_n):
# find top_n largest entries by stock
df = df.apply(lambda x: x.nlargest(top_n), axis=1)
# convert dataframe NaN or float entries True and False, and then convert to 0 and 1
top_numbers = (df.notnull()).astype(np.int)
return top_numbers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reference previous row when iterating through dataframe - python

Related

Pandas get postion of last value based on condition for each column (efficiently)

Indexing rows by boolean expression and column by position pandas data frame

vectoring pandas df by row with multiple conditional statements

Pandas delete a row in a dataframe based on a value

Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

Categories

Resources