I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])
I am working on a project for my thesis, which has to do with the capitalization of Research & Development (R&D) expenses for a data set of companies that I have.
For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period.
For example if we have Apple's R&D expenses for 5 years at a constant depreciation rate of 20%:
year r&d_exp dep_rate r&d_capital
1999 10 0.2 10
2000 8 0.2 16
2001 12 0.2 24.4
2002 7 0.2 25.4
2003 15 0.2 33
If it was not clear, r&d_capital is retrieved the following way:
2000 = 10*(1-0.2) + 8
2001 = 10*(1-0.4) + 8*(1-0.2) + 12
2002 = 10*(1-0.6) + 8*(1-0.4) + 12*(1-0.2) + 7
2003 = 10*(1-0.8) + 8*(1-0.6) + 12*(1-0.4) + 7*(1-0.2) + 15
How can I automate this calculation in a pandas Dataframe?
Also considering that I have more than 1 firm in my dataframe.
Thank you in advance for the help :)
I'm sure there's a better way to do it, but using a for loop and indexing you can add the 'r&d_exp' and 'dep_rate' appropriately:
import pandas as pd
import numpy as np
df = pd.DataFrame(((1999, 10, 0.2, 10),
(2000, 8 , 0.2, 16),
(2001, 12, 0.2, 24.4),
(2002, 7 , 0.2, 25.4),
(2003, 15, 0.2, 33)),
columns=('year', 'r&d_exp', 'dep_rate', 'r&d_capital'))
we can use indexing and list comprehension to sum for each value up to each year:
# set to zero to show that correct values are recovered
df['r&d_capital'] = 0
print(df['r&d_capital'])
>>> np.array([0, 0, 0, 0, 0])
df['r&d_capital'] = [(df['r&d_exp'].iloc[:i] * (1 - df['dep_rate'].iloc[:i]*np.arange(i)[::-1])).sum()
for i in range(1, len(df)+1)]
df['r&d_capital'].values
>>> array([10. , 16. , 24.4, 25.4, 33. ])
We use df['r&d_exp'].iloc[:i] to extract the series up to index i and then use an array np.arange(i)[::-1] of indices to generate the total depreciation rate at the year in question. Importantly this array is reversed such that the earlier values have multiple integers of depreciation. This generates the value of what I assume is the initial investment after depreciation at the year in question. All of these contributions are then summed to get the total capital. This method will already handle different depreciation rates.
In principle this can be extended to other firms easily.
I hope this helps.
I have a timeseries dataframe that is similar to:
ts = pd.DataFrame([['Jan 2000','WidgetCo',0.5, 2], ['Jan 2000','GadgetCo',0.3, 3], ['Jan 2000','SnazzyCo',0.2, 4],
['Feb 2000','WidgetCo',0.4, 2], ['Feb 2000','GadgetCo',0.5, 2.5], ['Feb 2000','SnazzyCo',0.1, 4],
], columns=['month','company','share','price'])
Which looks like:
month company share price
0 Jan 2000 WidgetCo 0.5 2.0
1 Jan 2000 GadgetCo 0.3 3.0
2 Jan 2000 SnazzyCo 0.2 4.0
3 Feb 2000 WidgetCo 0.4 2.0
4 Feb 2000 GadgetCo 0.5 2.5
5 Feb 2000 SnazzyCo 0.1 4.0
I can pivot this table like so:
pd.pivot_table(ts,index='month', columns='company')
Which gets me:
share price
company GadgetCo SnazzyCo WidgetCo GadgetCo SnazzyCo WidgetCo
month
Feb 2000 0.5 0.1 0.4 2.5 4 2
Jan 2000 0.3 0.2 0.5 3.0 4 2
This is what I want except that I need to collapse the MultiIndex so that the company is used as a prefix for share and price like so:
WidgetCo_share WidgetCo_price GadgetCo_share GadgetCo_price ...
month
Jan 2000 0.5 2 0.3 3.0
Feb 2000 0.4 2 0.5 2.5
I came up with this function to do just that but it seems like a poor solution:
def pivot_table_to_flat(df, column, index):
res = df.set_index(index)
cols = res.drop(column, axis=1).columns.values
resulting_cols = []
for prefix in res[column].unique():
for col in cols:
new_col_name = prefix + '_' + col
res[new_col_name] = res[res[column] == prefix][col]
resulting_cols.append(new_col_name)
return res[resulting_cols]
pivot_table_to_flat(ts, index='month', column='company')
What is a better way of accomplishing a pivot resulting in a columns with prefixes as opposed to a MultiIndex?
This seems even simpler:
df.columns = [' '.join(col).strip() for col in df.columns.values]
It takes a df with a multiindex column and flattens the column labels, with the df remaining in place.
(ref: #andy-haden Python Pandas - How to flatten a hierarchical index in columns )
I figured it out. Using the data on the MultiIndex makes for a pretty clean solution:
def flatten_multi_index(df):
mi = df.columns
suffixes, prefixes = mi.levels
col_names = [prefixes[i_p] + '_' + suffixes[i_s] for (i_s, i_p) in zip(*mi.labels)]
df.columns = col_names
return df
flatten_multi_index(pd.pivot_table(ts,index='month', columns='company'))
The above version only handles a 2D MultiIndex but it could be generalized if needed.
An update (as of early 2017 and pandas 0.19.2). You can use .values on a MultiIndex. So, this snippet should flatten MultiIndexs for those in need. The snippet is both too clever but not clever enough: it can handle either the row index or column names from the DataFrame, but it will blow up if the result of getattr(df,way) isn't nested (i.e., a MultiIndex).
def flatten_multi(df, way='index'): # or way='columns'
assert way in {'index', 'columns'}, "I'm sorry Dave."
mi = getattr(df, way)
flat_names = ["_".join(s) for s in mi.values]
setattr(df, way, flat_names)
return df