Python - Count row between interval in dataframe - python

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you

here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10

I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Related

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

Python/Pandas For Loop Time Series

I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?
The change column is of type float, with 4 decimals.
Index Change
0 0.0410
1 0.0000
2 0.1201
... ...
74327 0.0000
74328 0.0231
74329 0.0109
74330 0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])
Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was
20.9 ms ± 1.35 ms
This will return a series with the rolling sum for the last 50 Change in the df:
df['Change'].rolling(50).sum()
you can add it to a new column like so:
df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.
Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:
Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]
Code
For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.
import pandas as pd
import numpy as np
# data
SEQ_LEN = 5
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"Change": np.random.normal(0, 1, 10) # a million rows
}
)
# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()
# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]
Output
df
Out[30]:
Change Change_Cumsum Change_Sum
0 -1.133838 -1.133838 NaN
1 0.384319 -0.749519 NaN
2 1.496554 0.747035 NaN
3 -0.355382 0.391652 NaN
4 -0.787534 -0.395881 -0.395881
5 -0.459439 -0.855320 0.278518
6 -0.059169 -0.914489 -0.164970
7 -0.354174 -1.268662 -2.015697
8 -0.735523 -2.004185 -2.395838
9 -1.183940 -3.188125 -2.792244

How to get the datetime index corresponding to the n smallest values of a column in python

I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop.
I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values.
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'],
'Benefit':[2000,25,55,400],
'Cost':[1000, 500, 1500, 2000]})
# Let's create an index using Timestamps
index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'),
pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')]
df3.index = index_
print(df3)
df3.index = index_
print(df3)
df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing.
low_cost = df3.sort_values('Cost')[:3]
low_cost
# Dept Benefit Cost
# 2018-04-06 A 25 500
# 2018-01-06 A 2000 1000
# 2018-07-06 B 55 1500
To get the spike column, for efficiency I would recommend a join.
spikes = low_cost.assign(spike=1)[['spike']]
spikes
# spike
# 2018-04-06 1
# 2018-01-06 1
# 2018-07-06 1
df3.join(spikes, how='left').fillna(0)
# Dept Benefit Cost spike
# 2018-01-06 A 2000 1000 1.0
# 2018-04-06 A 25 500 1.0
# 2018-07-06 B 55 1500 1.0
# 2018-10-06 B 400 2000 0.0

Pandas: How to round the values which are closest to the whole number for the values more than 1?

Following is the dataframe. I would like to round the values in 'Period' which are closest to the whole numbers. For example : 1.005479452 rounded to 1.0000, 2.002739726 rounded to 2.0000, 3.002739726 rounded to 3.00000, 5.005479452 rounded to 5.0000, 12.01369863 rounded to 12.0000 and so on. I have a big list. I am trying to do so because in later program I have to concatenate this dataframe with other dataframes based on 'period' column.
df = period rate
0.931506849 -0.001469
0.994520548 0.008677
1.005479452 0.11741125
1.008219178 0.073975
1.010958904 0.147474833
1.994520548 -0.007189219
2.002739726 0.1160815
2.005479452 0.06995
2.008219178 0.026808
2.010958904 0.1200695
2.980821918 -0.007745727
3.002739726 0.192208333
3.010958904 0.119895833
3.019178082 0.151857267
3.021917808 0.016165
3.863013699 0.005405321
4 0.06815
4.002739726 0.1240695
4.016438356 0.2410323
4.019178082 0.0459375
4.021917808 0.03161
4.997260274 0.0682
5.005479452 0.1249955
5.01369863 0.03260875
5.016438356 0.238069083
5.019178082 0.04590625
5.021917808 0.0120625
12.01369863 0.136991
12.01643836 0.053327917
12.01917808 0.2309365
I am trying to do something like below but couldn't move further.
df['period'] = np.where(df.period>1, df.period.round(), df.period.round(decimals = 4))
You can apply a lambda function. This one will check it the value is greater than one before rounding to whole, otherwise rounding to 4 decimal places for values less than one. I think that's what you seem to want?
df['period'] = df['period'].apply(lambda x: round(x, 0) if x > 1 else round(x, 4))
I built a function that basically iterates from 1 to whatever the max whole value should be in the dataframe. This should be faster than a solution that just iterates row-by-row, though it does assume that the dataframe is sorted (like in your example).
import pandas as pd
df = pd.DataFrame(
{
"period": [0.931506849, 0.994520548, 1.005479452, 1.008219178, 1.010958904, 1.994520548, 2.002739726, 2.005479452, 2.008219178, 2.010958904, 2.980821918, 3.002739726, 3.010958904, 3.019178082, 3.021917808, 3.863013699, 4, 4.002739726, 4.016438356, 4.019178082, 4.021917808, 4.997260274, 5.005479452, 5.01369863, 5.016438356, 5.019178082, 5.021917808, 12.01369863, 12.01643836, 12.01917808]
}
)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.005479
3 1.008219
4 1.010959
"""
def process_df(df: pd.DataFrame) -> pd.DataFrame:
df_range_vals = [round(period) for period in df['period'].tolist()]
out_df = df.loc[df['period'] < 1]
for base in range(1, max(df_range_vals) + 1):
# only keep the ones in the range we want
temp_df = df.loc[(df['period'] >= base) & (df['period'] < base + 1)]
# if there's nothing to change, then just skip
if temp_df.empty:
continue
temp_df.loc[temp_df.first_valid_index(), 'period'] = temp_df.loc[temp_df.first_valid_index(), 'period'].round(0)
out_df = out_df.append(temp_df, ignore_index = True)
return out_df
df = process_df(df)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.000000
3 1.008219
4 1.010959
"""
Try:
# Sort so that we know what is closes to whole no
df.sort_values(by=['period'])
# Create a new column and round everything. This is done to do
# partition effectively
df['round_period'] = df['period'].round()
df_of_values_close_to_whole_number = list(df.groupby('round_period').tail(1)['period'])
def round_func(x, df_of_val_close_to_whole_number):
return '{:.5f}'.format(round(x)) if x in df_of_val_close_to_whole_number and x > 1 else x
# Apply round only to values closer to whole number.
df['period'].apply(round_func, args=(df_of_values_close_to_whole_number,))
Output
0 0.931507
1 0.994521
2 1.00548
3 1.00822
4 1.00000
5 1.99452
6 2.00274
7 2.00548
8 2.00822
9 2.00000
10 2.98082
11 3.00274
12 3.01096
13 3.01918
14 3.00000
15 3.86301
16 4
17 4.00274
18 4.01644
19 4.01918
20 4.00000
21 4.99726
22 5.00548
23 5.0137
24 5.01644
25 5.01918
26 5.00000
27 12.0137
28 12.0164
29 12.00000
Name: period, dtype: object

How to set a minimum value when performing cumsum on a dataframe column (physical inventory cannot go below 0)

How to perform a cumulative sum with a minimum value in python/pandas?
In the table below:
the "change in inventory" column reflects the daily sales/new stock purchases.
data entry/human errors mean that applying cumsum shows a negative inventory level of -5 which is not physically possible.
as shown by the "inventory" column, the data entry errors continue to be a problem at the end (100 vs 95).
dataframe
change in inventory inventory cumsum
2015-01-01 100 100 100
2015-01-02 -20 80 80
2015-01-03 -30 50 50
2015-01-04 -40 10 10
2015-01-05 -15 0 -5
2015-01-06 100 100 95
One way to achieve this would be to use loops however it would be messy and there probably is a more efficient way to do this.
Here is the code to generate the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict({'change in inventory': {'2015-01-01': 100,
'2015-01-02': -20,
'2015-01-03': -30,
'2015-01-04': -40,
'2015-01-05': -15,
'2015-01-06': 100},
'inventory': {'2015-01-01': 100,
'2015-01-02': 80,
'2015-01-03': 50,
'2015-01-04': 10,
'2015-01-05': 0,
'2015-01-06': 100}})
df['cumsum'] = df['change in inventory'].cumsum()
df
How to apply a cumulative sum with a minimum value in python/pandas to produce the values shown in the "inventory" column?
Depending on the data, it can be far more efficient to loop over blocks with the same sign, eg. with large running sub-blocks all positive or negative. You only have to be careful going back to positive values after a run of negative values.
With a minimum limiting value as minS summing over vector:
import numpy as np
i_sign = np.append(np.where(np.diff(np.sign(vector)) > 0)[0],[len(vector)])
i0 = 1
csum = np.maximum(minS, vector[:1])
for i1 in i_sign:
tmp_csum = np.maximum(minS, csum[-1] + np.cumsum(vector[i0:i1+1]))
csum = np.append(csum, tmp_csum)
i0 = i1
Final output in csum.
You can use looping, unfortunately:
lastvalue = 0
newcum = []
for row in df['change in inventory']:
thisvalue = row + lastvalue
if thisvalue < 0:
thisvalue = 0
newcum.append( thisvalue )
lastvalue = thisvalue
print pd.Series(newcum, index=df.index)
2015-01-01 100
2015-01-02 80
2015-01-03 50
2015-01-04 10
2015-01-05 0
2015-01-06 100
dtype: int64
very ugly solution
start = df.index[0]
df['cumsum'] = [max(df['change in inventory'].loc[start:end].sum(), 0)
for end in df.index]

Categories