I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part
Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit
I have been strugling with an optimization problem with Pandas.
I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns).
I relied heavily on the apply() function which was obviously a poor choice in most cases.
After a round of optimization I only have a method which takes time and I haven't found an easy solution for :
Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
video_1
6
1000
500
300
250
5
video_2
30
1000
500
300
250
5
I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"
Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.
The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6).
I build a dataframe containing the time marks as index, and the stats as values in the first column:
index (time marks)
view_stats
0
1000
7.5
500
15
300
22.5
250
30
5
1.5
NaN
3
NaN
4.5
NaN
I then call DataFrame.interpolate(method="index") to fill the NaN values.
It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.
Is there an obvious way achieve the same result "in place", e.g by avoiding the apply / new dataframe method, directly on the original dataframe ?
EDIT: The expected output when calling the function with 6 as the new length parameter would be :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
new_video_0
new_video_25
new_video_50
new_video_75
new_video_100
video_1
6
1000
500
300
250
5
1000
500
300
250
5
video_2
6
1000
500
300
250
5
1000
900
800
700
600
The first line would be untouched because the video is already 6s long.
In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks
EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not.
As a reference, this is my code :
def get_value(marks, asset, mark_index) -> int:
value = marks["count"][asset["new_length_marks"][mark_index]]
if isinstance(value, pandas.Series):
res = value.iloc(0)
else:
res = value
return math.ceil(res)
def length_update_row(row, assets, **kwargs):
asset_name = row["asset_name"]
asset = assets[asset_name]
# assets is a dict containing the list of files and the old and "new" video marks
# pre-calculated
marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
columns=["count"],
index=asset["old_length_marks"])
marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
marks = marks.interpolate(method="index")
row["video_25"] = get_value(marks, asset, 1)
row["video_50"] = get_value(marks, asset, 2)
row["video_75"] = get_value(marks, asset, 3)
row["video_completed"] = get_value(marks, asset, 4)
return row
def length_update_stats(report: pandas.DataFrame,
assets: dict) -> pandas.DataFrame:
new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
return new_report
IIUC, you could use np.interp:
# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]
# the corresponding y values
ys = df.iloc[:, 2:].values
# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]
res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )
print(res)
Output
new_video_0 new_video_25 new_video_50 new_video_75 new_video_100
0 1000.0 500.0 300.0 250.0 5.0
1 1000.0 900.0 800.0 700.0 600.0
And then concat across the second axis:
output = pd.concat((df, res), axis=1)
print(output)
Output (concat)
video_name video_length video_0 ... new_video_50 new_video_75 new_video_100
0 video_1 6 1000 ... 300.0 250.0 5.0
1 video_2 30 1000 ... 800.0 700.0 600.0
[2 rows x 12 columns]
I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])