Compare values between rows in pandas python - python

I am new to pandas and I want to compare rows an then only enter into another for loop
for i in node:
temp_df=df[(df['NODE'])==i]
min_time=min(temp_df['time1'])
max_time=max(temp_df['time1'])
while min_time<=max_time:
print(min_time)
df['No.Of_CellDown']=temp_df['time1'].between(min_time,min_time + timedelta(minutes=5)).sum()
print(count)
min_time=min_time + timedelta(minutes=5)
I want to update conditions to check if Tech and Issue column has same value for row and row(-1)
and then proceed to execute for loop in the given code

Try:
(df
.assign(different_from_previous_row = lambda x:
(x['Tech'] == x['Tech'].shift(1))
& (x['Issue']==x['Issue'].shift(1))
)

Try this,
for index, row in temp_df.iterrows():
if index -1 >= 0:
if temp_df['Tech'][index-1] == row['Tech'] and temp_df['Issue'][index-1] == row['Issue]:
//Do your thing here
else:
print('different')

Related

Create new dataframe column and generate values depending on the previous rows value of this same column

I want to make a loop to add new columns to a dataframe.
each time it adds a new column , I want to generate the values in the column using lambda function.
The function I wish to pass in the lambda is the function calcOnOff(). This function has 4 parameters :
v3proba, is the value of another column of this same row
on_to_off, is the current val of the loop iterator
off_to_on, is the current val of the second loop iterator
prevOnOff, is the value of this same column on the previous row.
Here is my code below
import pandas as pd
# I create a simple dataframe
dataExample={'Name':['Karan','Rohit','Sahil','Aryan','dex'],'v3proba':[0.23,0.42,0.51,0.4,0.7]}
dfExample=pd.DataFrame(dataExample)
# func to be applied on each new column of the dataframe
def calcOnOff(v3proba, on_to_off, off_to_on, prevOnOff):
if(prevOnOff == "OFF" and (v3proba*100) >= off_to_on ):
return "ON"
elif(prevOnOff == "OFF" and (v3proba*100) < off_to_on ):
return "OFF"
elif(prevOnOff == "ON" and (v3proba*100) < on_to_off ):
return "OFF"
elif(prevOnOff == "ON" and (v3proba*100) >= on_to_off ):
return "ON"
else:
return "ERROR"
# my iterators
off_to_on = 50
on_to_off = 49
# loops to generate new columns and populate col values
for off_to_on in range(50,90):
for on_to_off in range(10,49):
dfExample[str(off_to_on) + '-' + str(on_to_off)] = dfExample.apply(lambda row: calcOnOff(row['v3proba'], on_to_off, off_to_on, row[str(off_to_on) + '-' + str(on_to_off)].shift()), axis=1)
dfExample
The expected output would be a table with arround 1500 columns that look like this :
I think the problem in my algo is how to handle the first row as .shift() will look for an inexistant row?
Any idea what I am doing wrong?
Preliminary remarks
You can't address the field before it's created. So the code row[f'{off_to_on}-{on_to_off}'].shift() won't work, you'll get a KeyError here.
I guess, you want to shift down one row along the column by expression row[...].shift(). It doesn't work like that. row[...] returns a value, which is contained in a cell, not the column.
It's not clear what should be the previous state for the very first row. What is the value of prevOnOff parameter in this case?
How to fill in the column taking into account previous calculations
Let's use generators for this purpose. They can keep the inner state, so we can reuse a previously calculated value to get the next one.
But first, I'm gonna clarify the logic of calcOnOff. As I can see, it returns On if proba >= threshold or Off otherwise, where threshold is on_off if previous == On or off_on otherwise. So we can rewrite it like this:
def calcOnOff(proba, on_off, off_on, previous):
threshold = on_off if previous == 'On' else off_on
return 'On' if proba >= threshold else 'Off'
Next, let's transform previous to boolean and calcOnOff into a generator:
def calc_on_off(on_off, off_on, prev='Off'):
prev = prev == 'On'
proba = yield
while True:
proba = yield 'On' if (prev:=proba >= (on_off if prev else off_on)) else 'Off'
Here I made an assumption that the initial state is Off (default value of prev), and assume that previous value was On if prev == True or Off otherwise.
Now, I suggest to use itertools.product in order to generate parameters on_off and off_on. For each pair of these values we create an individual generator:
calc = calc_on_off(on_off, off_on).send
calc(None) # push calc to the first yield
This we can apply to the 100 * df['v3proba']:
proba = 100*df['v3proba']
df[...] = proba.apply(calc)
Full code
import pandas as pd
from itertools import product
data = {
'Name': ['Karan','Rohit','Sahil','Aryan','dex'],
'v3proba': [0.23,0.42,0.51,0.4,0.7]
}
df = pd.DataFrame(data)
def calc_on_off(on_off, off_on, prev='Off'):
prev = prev == 'On'
proba = yield
while True:
prev = proba >= (on_off if prev else off_on)
proba = yield 'On' if prev else 'Off'
proba = 100*df.v3proba
on_off = range(10, 50)
off_on = range(50, 90)
for state in product(on_off, off_on):
calc = calc_on_off(*state).send
calc(None)
name = '{1}-{0}'.format(*state) # 0:on_off, 1:off_on
df[name] = proba.apply(calc)
Update: Comparing with the provided expected result
P.S. No Generators
What if I don't want to use generators? Then we have somehow keep intermediate output outside the function. Let's do it with globals:
def calc_on_off(proba):
# get data outside
global prev, on_off, off_on
threshold = on_off if (prev == 'On') else off_on
# save data outside
prev = 'On' if proba >= threshold else 'Off'
return prev
default_state = 'Off'
proba = 100*df.v3proba
r_on_off = range(10, 50)
r_off_on = range(50, 90)
for on_off, off_on in product(r_on_off, r_off_on):
prev = default_state
df[f'{off_on}-{on_off}'] = proba.apply(calc_on_off)

Problem with for-if loop statement operation on pandas dataframe

I have a dataset which I want to create a new column that is based on a division of two other columns using a for-loop with if-conditions.
This is the dataset, with the empty 'solo_fare' column created beforehand.
The task is to loop through each row and divide 'Fare' by 'relatives' to get the per-passenger fare. However, there are certain if-conditions to follow (passengers in this category should see per-passenger prices of between 3 and 8)
The code I have tried here doesn't seem to fill in the 'solo_fare' rows at all. It returns an empty column (same as above df).
for i in range(0, len(fare_result)):
p = fare_result.iloc[i]['Fare']/fare_result.iloc[i]['relatives']
q = fare_result.iloc[i]['Fare']
r = fare_result.iloc[i]['relatives']
# if relatives == 0, return original Fare amount
if (r == 0):
fare_result.iloc[i]['solo_fare'] = q
# if the divided fare is below 3 or more than 8, return original Fare amount again
elif (p < 3) or (p > 8):
fare_result.iloc[i]['solo_fare'] = q
# else, return the divided fare to get solo_fare
else:
fare_result.iloc[i]['solo_fare'] = p
How can I get this to work?
You should probably not use a loop for this but instead just use loc
if you first create the 'solo fare' column and give every row the default value from Fare you can then change the value for the conditions you have set out
fare_result['solo_fare'] = fare_result['Fare']
fare_results.loc[(
(fare_results.Fare / fare_results.relatives) >= 3) & (
(fare_results.Fare / fare_results.relatives) <= 8), 'solo_fare'] = (
fare_results.Fare / fare_results.relatives)
Did you try to initialize those new colums first ?
By that I mean that the statement fare_result.iloc[i]['solo_fare'] = q
only means that you are assigning the value q to the field solo_fare of the line i
The issue there is that at this moment, the line i does not have any solo_fare key. Hence, you are only filling the last value of your table here.
To solve this issue, try declaring the solo_fare column before the for loop like:
fare_result['solo_fare'] = np.nan
One way to do is to define a row-wise function, and apply it to the dataframe:
# row-wise function (mockup)
def foo(fare, relative):
# your logic here. Mine just serves as example
if relative > 100:
res = fare/relative
elif (relative < 10):
res = fare
else:
res = 10
return res
Then apply it to the dataframe (row-wise):
fare_result['solo_fare'] = fare_result.apply(lambda row: foo(row['Fare'], row['relatives']) , axis=1)

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

pandas calculate/show dataframe cumsum() only for positive values and other condition

How to make this cumsum() calculate and show values on new column rows only when df.col_2 == 'closed' and df.col_values > 0 :
df['new_col'] = df.groupby('col_1')['col_values'].cumsum()
Here is a solution (but there might be a more elegant one):
indexes = (df.col_2 == 'closed') & (df.col_values > 0)
df.loc[indexes, 'new_col'] = df.loc[indexes].groupby('col_1')['col_values'].cumsum()

Is there a way to format cells in a data frame based on conditions on multiple columns?

I have a dataframe with a few columns (one boolean and one numeric). I want to put conditional formatting using pandas styling since I am going to output my dataframe as html in an email, based on the following conditions: 1. boolean column = Y and 2. numeric column > 0.
For example,
col1 col2
Y 15
N 0
Y 0
N 40
Y 20
In the example above, I want to highlight the first and last row since they meet those conditions.
Yes, there is a way. Use lambda expressions to apply conditions and dropna() function to disclude None/NaN values:
df["col2"] = df["col2"].apply(lambda x:x if x > 0 else None)
df["col1"] = df["col1"].apply(lambda x:x if x == 'Y' else None)
df.dropna()
I used the following and it worked:
def highlight_color(s):
if s.Col2 > 0 and s.Col1 == "N":
return ['background-color: red'] * 7
else:
return ['background-color: white'] * 7
df.style.apply(highlight_color, axis=1).render()

Categories