Python - Count row between interval in dataframe - python
I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])
Related
Concatenate arrays into a single table using pandas
I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values import pandas as pd DF = pd.read_csv("PJME_hourly.csv") for i in range(2002,2019): neblina = DF[DF.Datetime.str.contains(str(i))] dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']}) print(i , pd.concat([dateframe],axis=0,sort= False)) His output is as follows: 2002 PJME_MW max 55934.000000 min 19247.000000 mean 31565.617106 2003 PJME_MW max 53737.000000 min 19414.000000 mean 31698.758621 2004 PJME_MW max 51962.000000 min 19543.000000 mean 32270.434867 I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor: df = pd.read_csv('PJME_hourly.csv') df.Datetime = pd.to_datetime(df.Datetime) df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean']) Toy example: df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]}) # Datetime PJME_MV # 0 2019-01-01 3 # 1 2019-02-01 5 # 2 2020-01-01 30 # 3 2020-02-01 50 # 4 2021-01-01 100 df.Datetime = pd.to_datetime(df.Datetime) df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean']) # PJME_MV # min max mean # Datetime # 2019 3 5 4 # 2020 30 50 40 # 2021 100 100 100
The code could be optimized but how is now works, change this part of your code: for i in range(2002,2019): neblina = DF[DF.Datetime.str.contains(str(i))] dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']}) print(i , pd.concat([dateframe],axis=0,sort= False)) Use this instead aggs = ['max','min','mean'] df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index() out_columns = ['agg_year', 'PJME_MW'] out = [] aux = pd.DataFrame(columns=out_columns) for agg in aggs: aux['agg_year'] = agg + '_' + df_group['Datetime'] aux['PJME_MW'] = df_group[agg] out.append(aux) df_out = pd.concat(out) Edit: Concatenation form has been changed Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function
Python/Pandas For Loop Time Series
I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time? The change column is of type float, with 4 decimals. Index Change 0 0.0410 1 0.0000 2 0.1201 ... ... 74327 0.0000 74328 0.0231 74329 0.0109 74330 0.0462 SEQ_LEN = 50 for i in range(SEQ_LEN, len(df)): df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i]) Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was 20.9 ms ± 1.35 ms This will return a series with the rolling sum for the last 50 Change in the df: df['Change'].rolling(50).sum() you can add it to a new column like so: df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer. Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula: Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50] Code For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way. import pandas as pd import numpy as np # data SEQ_LEN = 5 np.random.seed(111) # reproducibility df = pd.DataFrame( data={ "Change": np.random.normal(0, 1, 10) # a million rows } ) # step 1. Do cumsum df["Change_Cumsum"] = df["Change"].cumsum() # Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50] df["Change_Sum"] = np.nan # or zero as you wish df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)] # add idx=SEQ_LEN-1 df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"] Output df Out[30]: Change Change_Cumsum Change_Sum 0 -1.133838 -1.133838 NaN 1 0.384319 -0.749519 NaN 2 1.496554 0.747035 NaN 3 -0.355382 0.391652 NaN 4 -0.787534 -0.395881 -0.395881 5 -0.459439 -0.855320 0.278518 6 -0.059169 -0.914489 -0.164970 7 -0.354174 -1.268662 -2.015697 8 -0.735523 -2.004185 -2.395838 9 -1.183940 -3.188125 -2.792244
How to get the datetime index corresponding to the n smallest values of a column in python
I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop. I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values. import pandas as pd import numpy as np df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'], 'Benefit':[2000,25,55,400], 'Cost':[1000, 500, 1500, 2000]}) # Let's create an index using Timestamps index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'), pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')] df3.index = index_ print(df3) df3.index = index_ print(df3) df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing. low_cost = df3.sort_values('Cost')[:3] low_cost # Dept Benefit Cost # 2018-04-06 A 25 500 # 2018-01-06 A 2000 1000 # 2018-07-06 B 55 1500 To get the spike column, for efficiency I would recommend a join. spikes = low_cost.assign(spike=1)[['spike']] spikes # spike # 2018-04-06 1 # 2018-01-06 1 # 2018-07-06 1 df3.join(spikes, how='left').fillna(0) # Dept Benefit Cost spike # 2018-01-06 A 2000 1000 1.0 # 2018-04-06 A 25 500 1.0 # 2018-07-06 B 55 1500 1.0 # 2018-10-06 B 400 2000 0.0
Pandas: How to round the values which are closest to the whole number for the values more than 1?
Following is the dataframe. I would like to round the values in 'Period' which are closest to the whole numbers. For example : 1.005479452 rounded to 1.0000, 2.002739726 rounded to 2.0000, 3.002739726 rounded to 3.00000, 5.005479452 rounded to 5.0000, 12.01369863 rounded to 12.0000 and so on. I have a big list. I am trying to do so because in later program I have to concatenate this dataframe with other dataframes based on 'period' column. df = period rate 0.931506849 -0.001469 0.994520548 0.008677 1.005479452 0.11741125 1.008219178 0.073975 1.010958904 0.147474833 1.994520548 -0.007189219 2.002739726 0.1160815 2.005479452 0.06995 2.008219178 0.026808 2.010958904 0.1200695 2.980821918 -0.007745727 3.002739726 0.192208333 3.010958904 0.119895833 3.019178082 0.151857267 3.021917808 0.016165 3.863013699 0.005405321 4 0.06815 4.002739726 0.1240695 4.016438356 0.2410323 4.019178082 0.0459375 4.021917808 0.03161 4.997260274 0.0682 5.005479452 0.1249955 5.01369863 0.03260875 5.016438356 0.238069083 5.019178082 0.04590625 5.021917808 0.0120625 12.01369863 0.136991 12.01643836 0.053327917 12.01917808 0.2309365 I am trying to do something like below but couldn't move further. df['period'] = np.where(df.period>1, df.period.round(), df.period.round(decimals = 4))
You can apply a lambda function. This one will check it the value is greater than one before rounding to whole, otherwise rounding to 4 decimal places for values less than one. I think that's what you seem to want? df['period'] = df['period'].apply(lambda x: round(x, 0) if x > 1 else round(x, 4))
I built a function that basically iterates from 1 to whatever the max whole value should be in the dataframe. This should be faster than a solution that just iterates row-by-row, though it does assume that the dataframe is sorted (like in your example). import pandas as pd df = pd.DataFrame( { "period": [0.931506849, 0.994520548, 1.005479452, 1.008219178, 1.010958904, 1.994520548, 2.002739726, 2.005479452, 2.008219178, 2.010958904, 2.980821918, 3.002739726, 3.010958904, 3.019178082, 3.021917808, 3.863013699, 4, 4.002739726, 4.016438356, 4.019178082, 4.021917808, 4.997260274, 5.005479452, 5.01369863, 5.016438356, 5.019178082, 5.021917808, 12.01369863, 12.01643836, 12.01917808] } ) print(df.head()) """ period 0 0.931507 1 0.994521 2 1.005479 3 1.008219 4 1.010959 """ def process_df(df: pd.DataFrame) -> pd.DataFrame: df_range_vals = [round(period) for period in df['period'].tolist()] out_df = df.loc[df['period'] < 1] for base in range(1, max(df_range_vals) + 1): # only keep the ones in the range we want temp_df = df.loc[(df['period'] >= base) & (df['period'] < base + 1)] # if there's nothing to change, then just skip if temp_df.empty: continue temp_df.loc[temp_df.first_valid_index(), 'period'] = temp_df.loc[temp_df.first_valid_index(), 'period'].round(0) out_df = out_df.append(temp_df, ignore_index = True) return out_df df = process_df(df) print(df.head()) """ period 0 0.931507 1 0.994521 2 1.000000 3 1.008219 4 1.010959 """
Try: # Sort so that we know what is closes to whole no df.sort_values(by=['period']) # Create a new column and round everything. This is done to do # partition effectively df['round_period'] = df['period'].round() df_of_values_close_to_whole_number = list(df.groupby('round_period').tail(1)['period']) def round_func(x, df_of_val_close_to_whole_number): return '{:.5f}'.format(round(x)) if x in df_of_val_close_to_whole_number and x > 1 else x # Apply round only to values closer to whole number. df['period'].apply(round_func, args=(df_of_values_close_to_whole_number,)) Output 0 0.931507 1 0.994521 2 1.00548 3 1.00822 4 1.00000 5 1.99452 6 2.00274 7 2.00548 8 2.00822 9 2.00000 10 2.98082 11 3.00274 12 3.01096 13 3.01918 14 3.00000 15 3.86301 16 4 17 4.00274 18 4.01644 19 4.01918 20 4.00000 21 4.99726 22 5.00548 23 5.0137 24 5.01644 25 5.01918 26 5.00000 27 12.0137 28 12.0164 29 12.00000 Name: period, dtype: object
How to set a minimum value when performing cumsum on a dataframe column (physical inventory cannot go below 0)
How to perform a cumulative sum with a minimum value in python/pandas? In the table below: the "change in inventory" column reflects the daily sales/new stock purchases. data entry/human errors mean that applying cumsum shows a negative inventory level of -5 which is not physically possible. as shown by the "inventory" column, the data entry errors continue to be a problem at the end (100 vs 95). dataframe change in inventory inventory cumsum 2015-01-01 100 100 100 2015-01-02 -20 80 80 2015-01-03 -30 50 50 2015-01-04 -40 10 10 2015-01-05 -15 0 -5 2015-01-06 100 100 95 One way to achieve this would be to use loops however it would be messy and there probably is a more efficient way to do this. Here is the code to generate the dataframe: import pandas as pd df = pd.DataFrame.from_dict({'change in inventory': {'2015-01-01': 100, '2015-01-02': -20, '2015-01-03': -30, '2015-01-04': -40, '2015-01-05': -15, '2015-01-06': 100}, 'inventory': {'2015-01-01': 100, '2015-01-02': 80, '2015-01-03': 50, '2015-01-04': 10, '2015-01-05': 0, '2015-01-06': 100}}) df['cumsum'] = df['change in inventory'].cumsum() df How to apply a cumulative sum with a minimum value in python/pandas to produce the values shown in the "inventory" column?
Depending on the data, it can be far more efficient to loop over blocks with the same sign, eg. with large running sub-blocks all positive or negative. You only have to be careful going back to positive values after a run of negative values. With a minimum limiting value as minS summing over vector: import numpy as np i_sign = np.append(np.where(np.diff(np.sign(vector)) > 0)[0],[len(vector)]) i0 = 1 csum = np.maximum(minS, vector[:1]) for i1 in i_sign: tmp_csum = np.maximum(minS, csum[-1] + np.cumsum(vector[i0:i1+1])) csum = np.append(csum, tmp_csum) i0 = i1 Final output in csum.
You can use looping, unfortunately: lastvalue = 0 newcum = [] for row in df['change in inventory']: thisvalue = row + lastvalue if thisvalue < 0: thisvalue = 0 newcum.append( thisvalue ) lastvalue = thisvalue print pd.Series(newcum, index=df.index) 2015-01-01 100 2015-01-02 80 2015-01-03 50 2015-01-04 10 2015-01-05 0 2015-01-06 100 dtype: int64
very ugly solution start = df.index[0] df['cumsum'] = [max(df['change in inventory'].loc[start:end].sum(), 0) for end in df.index]