Numpy Conditional Max of Range

Numpy Conditional Max of Range - python

I'm trying to make a version of my program faster using as much Pandas and Numpy as possible. I am new to Numpy but have been grasping most of it, but I am having trouble with conditional formatting a column with the max of a range. This is the code I am trying to use to achieve this:
x=3
df1['Max']=numpy.where(df1.index>=x,max(df1.High[-x:],0))
Basically, I am trying to conditionally put the maximum value over the last 3 entries into a cell and repeat down the column. Any and all help is appreciated.

Use Scipy's maximum_filter -
from scipy.ndimage.filters import maximum_filter1d
df['max'] = maximum_filter1d(df.High,size=3,origin=1,mode='nearest')
Basically, maximum_filter operates in a sliding window looking for maximum in that window. Now, by default each such max computation would be performed with window being centered at the index itself. Since, we are looking to go three elements before and ending at the current one, we need to change that centeredness with the parameter origin. Therefore, we have it set at 1.
Sample run -
In [21]: df
Out[21]:
High max
0 13 13
1 77 77
2 16 77
3 30 77
4 25 30
5 98 98
6 79 98
7 58 98
8 51 79
9 23 58
Runtime test
Got me interested to see how this Scipy's sliding max operation performs against Pandas's rolling max method on performance. Here's some results on big datasizes -
In [55]: df = pd.DataFrame(np.random.randint(0,99,(10000)),columns=['High'])
In [56]: %%timeit # #Merlin's rolling based solution :
...: df['max'] = df.High.rolling(window=3, min_periods=1).max()
...:
1000 loops, best of 3: 1.35 ms per loop
In [57]: %%timeit # Using Scipy's max filter :
...: df['max1'] = maximum_filter1d(df.High,size=3,\
...: origin=1,mode='nearest')
...:
1000 loops, best of 3: 487 µs per loop

Here is the logic on np.where
numpy.where('test something,if true ,if false)
I think you need below.
dd= {'to': [100, 200, 300, 400, -500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(dd)
df
to
0 100
1 200
2 300
3 400
4 -500
5 600
6 700
7 800
8 900
9 1000
df['Max'] = df.rolling(window=3, min_periods=1).max()
to Max
0 100 100.0
1 200 200.0
2 300 300.0
3 400 400.0
4 -500 400.0
5 600 600.0
6 700 700.0
7 800 800.0
8 900 900.0
9 1000 1000.0

Related

calculate new column values based on conditions in pandas

I have columns in the pandas dataframe df_profit:
profit_date profit
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100
4 05.04 120
5 06.04 120
6 07.04 120
7 08.04 130
8 09.04 140
9 10.04 140
And I have the second dataframe df_deals:
deals_date
0 03.04
1 05.04
2 06.04
I want to create a new column 'delta' in the df_profit and let it be equal to delta between current value and previous value in 'profit' column. But I want the delta to be calculated only after the first date in the 'profit_date' is equal to the date in the column 'deal_date' of df_deals dataframe and previous value in the delta calculation to be always the same and equal to the value when the first date in 'profit_date' was equal to the first date in 'deals_date'.
So, the result would look like:
profit_date profit delta
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100 20
4 05.04 120 40
5 06.04 120 40
6 07.04 120 40
7 08.04 130 50
8 09.04 140 60
9 10.04 140 60

For the next time you should provide better data to make it easier to help (dataframe creation so that we can copy paste your code).
I think this codes does what you want:
import pandas as pd
df_profit = pd.DataFrame(columns=["profit_date", "profit"],
data=[
["01.04", 70],
["02.04", 80],
["03.04", 80],
["04.04", 100],
["05.04", 120],
["06.04", 120],
["07.04", 120],
["08.04", 130],
["09.04", 140],
["10.04", 140]])
df_deals = pd.DataFrame(columns=["deals_date"], data=["03.04", "05.04", "06.04"])
# combine both dataframes, based on date columns
df = df_profit.merge(right=df_deals, left_on="profit_date", right_on="deals_date", how="left")
# find the first value (first row with deals date) and set it to 'base'
df["base"] = df.loc[df["deals_date"].first_valid_index()]["profit"]
# calculate delta
df["delta"] = df["profit"] - df["base"]
# Remove unused values
df.loc[:df["deals_date"].first_valid_index(), "delta"] = None
# remove temporary cols
df.drop(columns=["base", "deals_date"], inplace=True)
print(df)
output is:
profit_date profit delta
0 01.04 70 NaN
1 02.04 80 NaN
2 03.04 80 NaN
3 04.04 100 20.0
4 05.04 120 40.0
5 06.04 120 40.0
6 07.04 120 40.0
7 08.04 130 50.0
8 09.04 140 60.0
9 10.04 140 60.0

You can try this one for don't get NaN values
start_profit = df_profit.loc[(df_profit["profit_date"] == df_deals.iloc[0][0])]
start_profit = start_profit.iloc[0][1]
for i in range(len(df_profit)):
if int(str(df_profit.iloc[i][0]).split(".")[0]) > 3 and int(str(df_profit.iloc[i][0]).split(".")[1]) >= 4:
df_profit.loc[i,"delta"] = df_profit.iloc[i][1]-start_profit
Hope it helps

group by pandas dataframe and select maximun value within sequence

I have a pandas dataframe that represents elevation differences between points every 10 degrees for several target Turbines. I have selected the elevation differences that follow a criteria and I have added a column that represents if they are consecutive or not (metDegDiff = 10 represents consecutive points).
How can I select the maximum value of elevDif by targTurb in 3 or more consecutive 10 degree points?
ridgeDF2 = pd.DataFrame(data = {
'MetID':['A06_40','A06_50','A06_60','A06_70','A06_80','A06_100','A06_110','A06_140','A07_110','A07_130','A07_140','A08_100','A08_110','A08_120','A08_130','A08_220'],
'targTurb':['A06','A06','A06','A06','A06','A06','A06','A06','A07','A07','A07','A08','A08','A08','A08','A08'],
'metDeg':[30,50,60,70,80,100,110,140,110,130,140,100,110,120,130,220],
'elevDif':[1.433234, 1.602997,3.227997,2.002991,2.414001,2.96402,1.513,1.793976,1.612,2.429993,1.639008,1.500977,3.048004,2.174011,1.813995,1.527008],
'metDegDiff':[20,10,10,10,10,20,10,30,-30,20,10,-40,10,10,10,30]})
[Dbg]>>> ridgeDF2
MetID targTurb metDeg elevDif metDegDiff
0 A06_40 A06 30 1.433234 20
1 A06_50 A06 50 1.602997 10
2 A06_60 A06 60 3.227997 10
3 A06_70 A06 70 2.002991 10
4 A06_80 A06 80 2.414001 10
5 A06_100 A06 100 2.964020 20
6 A06_110 A06 110 1.513000 10
7 A06_140 A06 140 1.793976 30
8 A07_110 A07 110 1.612000 -30
9 A07_130 A07 130 2.429993 20
10 A07_140 A07 140 1.639008 10
11 A08_100 A08 100 1.500977 -40
12 A08_110 A08 110 3.048004 10
13 A08_120 A08 120 2.174011 10
14 A08_130 A08 130 1.813995 10
15 A08_220 A08 220 1.527008 30
In the example, for A06 there are 4 rows that have consecutive 10 metDeg values (rows 1, 2, 3, and 4) and for A8 there are 3 rows (rows 12, 13 and 14). Note that those 2 series have a length of 3 or more.
So, the output would be the maximum elevDif inside those two selected series. Like this:
MetID targTurb metDeg elevDif metDegDiff
A06_60 A06 60 3.227997 10
A08_110 A08 110 3.048004 10

The code below should work. You can run each line separately to see what is happening.
ridgeDF2['t/f'] = ridgeDF2['metDegDiff'] != 10
ridgeDF2['t/f'] = ridgeDF2['t/f'].shift().fillna(0).cumsum()
ridgeDF2['count'] = ridgeDF2.groupby('t/f')['t/f'].transform(len)-1
ridgeDF2['count'] = np.where(ridgeDF2['count'] >= 3,True,False)
ridgeDF2.loc[ridgeDF2['metDegDiff'] != 10,'count'] = False
highest = ridgeDF2.loc[ridgeDF2['count'] == True]
highest = highest.loc[highest.groupby(['targTurb','metDegDiff','t/f'])['elevDif'].idxmax()]
highest.drop(columns = ['t/f','count'])

Chained solution
ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]
Step by step solution
.cumsum() metDegDiff to create groups where the first element is not 10.
ridgeDF2=ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())
Apply multiple filter to get rid of metDegDiff not equal to 10 in groups generated above and to retain groups where count of consecutive values=10 is equal or more than 3. I chain groupby() ,.transform() and boolean selection to a achieve this
g=ridgeDF2[ridgeDF2.groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3))]
From what remains above, select indexes with maximum values
g.loc[g.groupby('targTurb')['elevDif'].idxmax()]
Outcome
MetID targTurb metDeg elevDif metDegDiff
2 A06_60 A06 60 3.227997 10
12 A08_110 A08 110 3.048004 10
Timing
%timeit ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]
9.01 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

What you can do is to create a group column for the same consecutive value in metDegDiff and same targTurb, using shift and cumsum. Then you can use this group column to select where the group has more or equal (ge) 3 values obtained by map the group number with the value_counts of the group number, and where the value in metDegDiff is equal (eq) to 10. Now that you have only the groups of interest, you can sort_values on elevDif and drop_duplicates on the group column to keep the maximum value per group. You end with drop the column gr and sort_values per targTurb if necessary.
ridgeDF2['metDegDiff'] = ridgeDF2['metDeg'].diff() #I assume calculated this way
#create a group number with same consecutive values and same targTurb
ridgeDF2['gr'] = (ridgeDF2['metDegDiff'].ne(ridgeDF2['metDegDiff'].shift())
|(ridgeDF2['targTurb'].ne(ridgeDF2['targTurb'].shift()))
).cumsum()
#get the result dataframe
res_ = (ridgeDF2.loc[ridgeDF2['metDegDiff'].eq(10) #row with 10 in metDegDiff
&ridgeDF2['gr'].map(ridgeDF2['gr'].value_counts()).ge(3)] #and row with group of greater equal 3 values
.sort_values(by='elevDif') # ascending sort of the elevDif
.drop_duplicates('gr', keep='last') #keep the last row pergroup having higher number
.drop('gr', axis=1) #remove the extra group column
.sort_values('targTurb') #if you need
)
and you get the rows you want
print (res_)
MetID targTurb metDeg elevDif metDegDiff
2 A06_60 A06 60 3.227997 10.0
12 A08_110 A08 110 3.048004 10.0

How to assign a value from a range in a list?

I want to assign a value to a list from another list with values. I'm trying to find a in a list where does it belong in a range from another list
I tried .merge but it didn't work, I tried a for loop to go through all the list but I was not able to connect all the pieces.
I have two list and i want to do the 3rd table
import numpy as np
import pandas as pd
s = pd.Series([0,1001,2501])
t = pd.Series([1000,2500,4000])
u=pd.Series([6.5,8.5,10])
df = pd.DataFrame(s,columns = ["LRange"])
df["uRange"] =t
df["Cost"]=u
print (df)
p=pd.Series([550,1240,2530,230])
dp=pd.DataFrame(p,columns = ["Power"])
print (dp)
LRange uRange Cost
0 0 1000 6.5
1 1001 2500 8.5
2 2501 4000 10
Power
1 550
2 1240
3 2530
4 230
I want my result to be:
Power Cost p/kW
1 550 6.5
2 1240 8.5
3 2530 10.0
4 230 6.5

Reset Cumulative sum base on condition Pandas

I have a data frame like:
customer spend hurdle
A 20 50
A 31 50
A 20 50
B 50 100
B 51 100
B 30 100
I want to calculate additional column for Cumulative which will reset base on the same customer when the Cumulative sum greater or equal to the hurdle like following :
customer spend hurdle Cumulative
A 20 50 20
A 31 50 51
A 20 50 20
B 50 100 50
B 51 100 101
B 30 100 30
I used the cumsum and groupby in pandas to but I do not know how to reset it base on the condition.
Following are the code I am currently using:
df1['cum_sum'] = df1.groupby(['customer'])['spend'].apply(lambda x: x.cumsum())
which I know it is just a normal cumulative sum. I very appreciate for your help.

There could be faster, efficient way. Here's one inefficient apply way to do would be.
In [3270]: def custcum(x):
...: total = 0
...: for i, v in x.iterrows():
...: total += v.spend
...: x.loc[i, 'cum'] = total
...: if total >= v.hurdle:
...: total = 0
...: return x
...:
In [3271]: df.groupby('customer').apply(custcum)
Out[3271]:
customer spend hurdle cum
0 A 20 50 20.0
1 A 31 50 51.0
2 A 20 50 20.0
3 B 50 100 50.0
4 B 51 100 101.0
5 B 30 100 30.0
You may consider using cython or numba to speed up the custcum
[Update]
Improved version of Ido s answer.
In [3276]: s = df.groupby('customer').spend.cumsum()
In [3277]: np.where(s > df.hurdle.shift(-1), s, df.spend)
Out[3277]: array([ 20, 51, 20, 50, 101, 30], dtype=int64)

One way would be the below code. But it's a really inefficient and inelegant one-liner.
df1.groupby('customer').apply(lambda x: (x['spend'].cumsum() *(x['spend'].cumsum() > x['hurdle']).astype(int).shift(-1)).fillna(x['spend']))

Pandas: Iterate over rows and find frequency of occurances

I have a dataframe with 2 columns and 3000 rows.
First column is representing time in time-steps. For example first row is 0, second is 1, ..., last one is 2999.
Second column is representing pressure. The pressure changes as we iterate over the rows, but shows a repetitive behaviour. So every few steps we see that it goes to its minimum value (which is 375), then goes up again, then again at 375 etc.
What I want to do in Python, is to iterate over the rows and see:
1) at which time-steps we see pressure is at its minimum
2)Find the frequency between the minimum values.
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.linalg as lin
from matplotlib.pylab import *
import re
from pylab import *
import datetime
df = pd.read_csv('test.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]
print(df[[0, 1]])

You don't need to iterate row-wise, you can compare the entire column against the min value to mask it, you can then use the mask to find the timestep diff:
Data setup:
In [44]:
df = pd.DataFrame({'timestep':np.arange(20), 'value':np.random.randint(375, 400, 20)})
df
Out[44]:
timestep value
0 0 395
1 1 377
2 2 392
3 3 396
4 4 377
5 5 379
6 6 384
7 7 396
8 8 380
9 9 392
10 10 395
11 11 393
12 12 390
13 13 393
14 14 397
15 15 396
16 16 393
17 17 379
18 18 396
19 19 390
mask the df by comparing the column against the min value:
In [45]:
df[df['value']==df['value'].min()]
Out[45]:
timestep value
1 1 377
4 4 377
We can use the mask with loc to find the corresponding 'timestep' value and use diff to find the interval differences:
In [48]:
df.loc[df['value']==df['value'].min(),'timestep'].diff()
Out[48]:
1 NaN
4 3.0
Name: timestep, dtype: float64
You can divide the above by 1/60 to find frequency wrt to 1 minute or whatever frequency unit you desire

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy Conditional Max of Range - python

Related

calculate new column values based on conditions in pandas

group by pandas dataframe and select maximun value within sequence

How to assign a value from a range in a list?

Reset Cumulative sum base on condition Pandas

Pandas: Iterate over rows and find frequency of occurances

Categories

Resources