drop rows after sum condition reached - python

I want to drop rows from my data frame after I hit some value.
example data set:
num value
1 2000
2 3000
3 2000
x = 5000 # my limiter
y = 0 # my bucket for values
# I want to do something like...
for row in df:
if y <= x:
y =+ df["Values"]
elif y > x:
df.drop(row)
continue
The elif might not make sense but it expresses the idea, it is the parsing I am more concerned with. I cant seem to use df["Values"] in my embedded if statement.
I get the error:
ValueError: The truth value of a Series is ambiguous.
which is odd because i can run this line by itself outside of the if statement.

Use boolean indexing with cumsum:
x = 5000
df = df[df['value'].cumsum() <= x]
print (df)
num value
0 1 2000
1 2 3000
Detail:
print (df['value'].cumsum())
0 2000
1 5000
2 7000
Name: value, dtype: int64
print (df['value'].cumsum() <= x)
0 True
1 True
2 False
Name: value, dtype: bool

You get this error message, because you assign the whole column to your variable y. Instead you want to assign only the value from column value and add it to your variable.
#print(df)
#num value
#1 2000
#2 3000
#3 2000
#4 4000
#5 1000
x = 5000
y = 0
#iterate over rows
for index, row in df.iterrows():
if y < x:
#add the value to y
y += row["value"]
elif y >= x:
#drop rest of the dataframe
df = df.drop(df.index[index:])
break
#output from print(df)
# num value
#0 1 2000
#1 2 3000
But it would be faster, if you just used pandas builtin cumsum function. (see jezrael's answer for details)

Related

Pandas get postion of last value based on condition for each column (efficiently)

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:
>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
a b c d
0 0 1 1 0
1 0 1 0 0
2 1 1 0 0
3 0 1 0 0
4 0 1 1 0
Desired result:
>> calculate_recency_vector(df)
[3,1,1,None]
The desired result shows for each column "how many rows ago" the value 1 appeared for the last time. Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector. Any ideas how to implement this?
Edit: to avoid confusion, I changed the desired output for the last column from 0 to None. This column has no recency because the value 1 does not occur at all.
Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.
A loop-less solution which is faster & cleaner:
>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>> non_zero_values_of_col = column[column.astype(bool)]
>> if non_zero_values_of_col.empty:
>> return 0
>> return len(column) - non_zero_values_of_col.index[-1]
>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)
a 3
b 1
c 1
d 0
dtype: int64
Sidenote: Using pd.apply() is slow (SO explanation). There exist faster solutions like using np.where or using apply(...,raw=True). See this question for details.
This
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)
produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan, u could then take the desired column
With this example dataframe, you can define a function as follow:
def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
recency_vector = []
for col in df.columns:
last = 0
for i, y in enumerate(df[col].to_list()):
if y == condition:
last = i
recency = len(df[col].to_list()) - last
if recency == len(df[col].to_list()):
recency = None
recency_vector.append(recency)
return recency_vector
Running the function, it will return this:
calculate_recency_vector(df, 1)
[3, 1, 1, None]

Group by a column ('tenant') and get the max consecutive 1s in ('value') column

I have a df shown below:
Tenant
Value
x
1
x
1
x
0
x
1
y
1
y
0
Results:
Tenant X should be 2 and tenant y should be 1
I am trying to get the max consecutive value 1s per group. If there are any 0 in between the value 1 the count starts over.
I am new to python and not sure where to start. Thank you
You could try:
def max_ones(col):
return col.groupby(col.diff().ne(0).cumsum()).sum().max()
result = df.groupby("Tenant").Value.agg(max_ones)
If the Value column can have other values than 0 or 1 you should rather use:
result = df.assign(Value=df.Value.eq(1)).groupby("Tenant").Value.agg(max_ones)

iterating over a dataframe

I put this dataframe as an example:
import pandas as pd
df = pd.DataFrame({'country':['china','canda','usa' ], 'value':[1000, 850, 1100], 'fact':[1000,200,850]})
df.index=df['country']
df = df.drop('country', axis=1)
I want to iterate over the GDP of each country and into this iteration I want to create a new column that would be full of 1 or 0 in function of a condition:
for x in df['value']:
if x > 900:
df['answer']=1
else:
df['answer']=0
I would expected a column with the following values:
[1,0,1]
Because Canada has a value lower than 900.
But instead I have a column full of ones.
What is wrong?
Use np.where
df["answer"] = np.where(df["value"]> 900, 1,0)
Or
df["answer"] = (df["value"]> 900).astype(int)
Output:
value fact answer
country
china 1000 1000 1
canda 850 200 0
usa 1100 850 1
whats wrong with your code
When you do df['answer']=1, the expression assign 1 to all the rows in answer column.
So last evaluated value is assigned to that column
It can be even done without iterating over each row using:
df['answer'] = df['value'].apply(lambda value: 1 if value > 900 else 0)
EDIT You are assigning df['answer'] to some value. The last value is 1 that is why it applies 1 to the entire answer column and not a particular row.

vectoring pandas df by row with multiple conditional statements

I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?

Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

I am trying to create a function that uses df.iterrows() and Series.nlargest. I want to iterate over each row and find the largest number and then mark it as a 1. This is the data frame:
A B C
9 6 5
3 7 2
Here is the output I wish to have:
A B C
1 0 0
0 1 0
This is the function I wish to use here:
def get_top_n(df, top_n):
"""
Parameters
----------
df : DataFrame
top_n : int
The top number to get
Returns
-------
top_numbers : DataFrame
Returns the top number marked with a 1
"""
# Implement Function
for row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
return top_numbers
I get the following error:
AttributeError: 'tuple' object has no attribute 'nlargest'
Help would be appreciated on how to re-write my function in a neater way and to actually work! Thanks in advance
Add i variable, because iterrows return indices with Series for each row:
for i, row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
General solution with numpy.argsort for positions in descending order, then compare and convert boolean array to integers:
def get_top_n(df, top_n):
if top_n > len(df.columns):
raise ValueError("Value is higher as number of columns")
elif not isinstance(top_n, int):
raise ValueError("Value is not integer")
else:
arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
return (df1)
df1 = get_top_n(df, 2)
print (df1)
A B C
0 1 1 0
1 1 1 0
df1 = get_top_n(df, 1)
print (df1)
A B C
0 1 0 0
1 0 1 0
EDIT:
Solution with iterrows is possible, but not recommended, because slow:
top_n = 2
for i, row in df.iterrows():
top = row.nlargest(top_n).index
df.loc[i] = 0
df.loc[i, top] = 1
print (df)
A B C
0 1 1 0
1 1 1 0
For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years
def get_top_n(prev_returns, top_n):
# generate dataframe populated with zeros for merging
top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)
# find top_n largest entries by row
df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)
# merge dataframes
top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)
# return dataframe replacing non_zero answers with a 1
return (top_stocks.notnull()) * 1
Alternatively, the 2-line solution could be
def get_top_n(df, top_n):
# find top_n largest entries by stock
df = df.apply(lambda x: x.nlargest(top_n), axis=1)
# convert dataframe NaN or float entries True and False, and then convert to 0 and 1
top_numbers = (df.notnull()).astype(np.int)
return top_numbers

Categories