Python - add column and calculate value based on condition - python

I'm having a dataset that looks as follows:
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32]}
df = pd.DataFrame(data)
Now what I aim to do is to add a new column "SMB" and calculate it as follows:
Subset data based on year and quarter, e. g. get all values where year = 2012, and quarter = 2
Sort the subset based on column MC and split it based on the size into small and big (0.5 Quantile)
If the value in MC is lower than 0.5 quantile add value "small" to the newly created column "SMB", if it is higher than the 0.5 quantile add value "big"
Repeat the process for all rows where quarter = 2
For all other rows add np.nan
so the output should look like that
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32],
'SMB': ['Small', 'Small', 'Big', 'Big', np.NaN, np.NaN]}
df = pd.DataFrame(data)
I tried to create a loop but I was unable to properly merge it back into the previous dataframe as I need other quarter values for further calculation. Using below code I sort of achieved what I wanted to have, but I had to merge back the data into the original dataset.
I'm sure there is a much nicer way on how to achieve this.
# Quantile 0.5 for MC sorting (small & big)
smbQuantile = 0.5
Years = df['Year'].unique()
dataframes_list = []
# Calculate Small and Big and merge back into dataFrame
for i in Years:
df_temp = df.loc[(df_sb['Year'] == i) & (df['Quarter'] == 2)]
df_temp['SMB'] = ''
#Assign factor size based on market cap
df_temp.SMB[df_temp.MKT_CAP <= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Small'
df_temp.SMB[df_temp.MKT_CAP >= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Big'
dataframes_list.append(df_temp)
df = pd.concat(dataframes_list)

You can use groupby.rank and groupby.transform('size') combined with numpy.select:
g = df.groupby(['Year', 'Quarter'])['MC']
df['SMB'] = np.select([g.rank(pct=True).le(0.5),
g.transform('size').ge(2)],
['Small', 'Big'], np.nan)
output:
Year Quarter ID MC PB SMB
0 2012 2 CH7744 3348.22 2.74 Small
1 2013 2 US4652 8542.55 0.95 Small
2 2012 2 CA47441 11851.20 1.57 Big
3 2013 2 CH1147 15718.10 2.13 Big
4 2014 3 DE7487 29914.70 0.54 nan
5 2013 1 US5174 8731.78 5.32 nan

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
​
combos = list(combinations(df.columns, 2))
​
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
​
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0

Python - Count row between interval in dataframe

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Capitalising Research & Development Expenses with a Pandas df

I am working on a project for my thesis, which has to do with the capitalization of Research & Development (R&D) expenses for a data set of companies that I have.
For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period.
For example if we have Apple's R&D expenses for 5 years at a constant depreciation rate of 20%:
year r&d_exp dep_rate r&d_capital
1999 10 0.2 10
2000 8 0.2 16
2001 12 0.2 24.4
2002 7 0.2 25.4
2003 15 0.2 33
If it was not clear, r&d_capital is retrieved the following way:
2000 = 10*(1-0.2) + 8
2001 = 10*(1-0.4) + 8*(1-0.2) + 12
2002 = 10*(1-0.6) + 8*(1-0.4) + 12*(1-0.2) + 7
2003 = 10*(1-0.8) + 8*(1-0.6) + 12*(1-0.4) + 7*(1-0.2) + 15
How can I automate this calculation in a pandas Dataframe?
Also considering that I have more than 1 firm in my dataframe.
Thank you in advance for the help :)
I'm sure there's a better way to do it, but using a for loop and indexing you can add the 'r&d_exp' and 'dep_rate' appropriately:
import pandas as pd
import numpy as np
df = pd.DataFrame(((1999, 10, 0.2, 10),
(2000, 8 , 0.2, 16),
(2001, 12, 0.2, 24.4),
(2002, 7 , 0.2, 25.4),
(2003, 15, 0.2, 33)),
columns=('year', 'r&d_exp', 'dep_rate', 'r&d_capital'))
we can use indexing and list comprehension to sum for each value up to each year:
# set to zero to show that correct values are recovered
df['r&d_capital'] = 0
print(df['r&d_capital'])
>>> np.array([0, 0, 0, 0, 0])
df['r&d_capital'] = [(df['r&d_exp'].iloc[:i] * (1 - df['dep_rate'].iloc[:i]*np.arange(i)[::-1])).sum()
for i in range(1, len(df)+1)]
df['r&d_capital'].values
>>> array([10. , 16. , 24.4, 25.4, 33. ])
We use df['r&d_exp'].iloc[:i] to extract the series up to index i and then use an array np.arange(i)[::-1] of indices to generate the total depreciation rate at the year in question. Importantly this array is reversed such that the earlier values have multiple integers of depreciation. This generates the value of what I assume is the initial investment after depreciation at the year in question. All of these contributions are then summed to get the total capital. This method will already handle different depreciation rates.
In principle this can be extended to other firms easily.
I hope this helps.

How to pick values from column and add them to a mathematical function (Pandas)

I'm ttrying to pick all the values from a dataframe column I have and apply them to a mathematical function.
Heres how the data looks:
Year % PM
1 2002 3
2 2002 2.3
I am trying to apply this function :
M = 100000
t = (THE PERCENTAGE FROM THE DATAFRAME)/12
n = 15*12
PM = M*((t*(1+t)**n)/(((1+t)**n)-1))
print(PM)
And my goal is to do it to all the rows and append the value of each result to PM in the dF
You can just add the formula as a column directly to the DF, creating t_div_12 as a vector from the column as below:
M = 100000
n = 15*12
t_div_12 = df["%"]/12
df["PM"] = M*((t_div_12 *(1+t_div_12 )**n)/(((1+t_div_12)**n)-1))
First, I would avoid using constants, which are not repeated in the code. You can apply this function to your dataframe by using this code snippet:
dF = pd.DataFrame([[2002, 3], [2002, 2.3]], columns=["Year", "%"])
dF['PM'] = 100000*((dF["%"]/12*(1+dF["%"]/12)**(15*12))/(((1+dF["%"]/12)**(15*12))-1))
It will give you:
Year % PM
0 2002 3.0 25000.000000
1 2002 2.3 19166.666667
df['PM'] = df['%'].map(lambda t: M*(((t/12)*(1+(t/12))**n)/(((1+(t/12))**n)-1)))

pivoting pandas dataframe into prefixed cols, not a MultiIndex

I have a timeseries dataframe that is similar to:
ts = pd.DataFrame([['Jan 2000','WidgetCo',0.5, 2], ['Jan 2000','GadgetCo',0.3, 3], ['Jan 2000','SnazzyCo',0.2, 4],
['Feb 2000','WidgetCo',0.4, 2], ['Feb 2000','GadgetCo',0.5, 2.5], ['Feb 2000','SnazzyCo',0.1, 4],
], columns=['month','company','share','price'])
Which looks like:
month company share price
0 Jan 2000 WidgetCo 0.5 2.0
1 Jan 2000 GadgetCo 0.3 3.0
2 Jan 2000 SnazzyCo 0.2 4.0
3 Feb 2000 WidgetCo 0.4 2.0
4 Feb 2000 GadgetCo 0.5 2.5
5 Feb 2000 SnazzyCo 0.1 4.0
I can pivot this table like so:
pd.pivot_table(ts,index='month', columns='company')
Which gets me:
share price
company GadgetCo SnazzyCo WidgetCo GadgetCo SnazzyCo WidgetCo
month
Feb 2000 0.5 0.1 0.4 2.5 4 2
Jan 2000 0.3 0.2 0.5 3.0 4 2
This is what I want except that I need to collapse the MultiIndex so that the company is used as a prefix for share and price like so:
WidgetCo_share WidgetCo_price GadgetCo_share GadgetCo_price ...
month
Jan 2000 0.5 2 0.3 3.0
Feb 2000 0.4 2 0.5 2.5
I came up with this function to do just that but it seems like a poor solution:
def pivot_table_to_flat(df, column, index):
res = df.set_index(index)
cols = res.drop(column, axis=1).columns.values
resulting_cols = []
for prefix in res[column].unique():
for col in cols:
new_col_name = prefix + '_' + col
res[new_col_name] = res[res[column] == prefix][col]
resulting_cols.append(new_col_name)
return res[resulting_cols]
pivot_table_to_flat(ts, index='month', column='company')
What is a better way of accomplishing a pivot resulting in a columns with prefixes as opposed to a MultiIndex?
This seems even simpler:
df.columns = [' '.join(col).strip() for col in df.columns.values]
It takes a df with a multiindex column and flattens the column labels, with the df remaining in place.
(ref: #andy-haden Python Pandas - How to flatten a hierarchical index in columns )
I figured it out. Using the data on the MultiIndex makes for a pretty clean solution:
def flatten_multi_index(df):
mi = df.columns
suffixes, prefixes = mi.levels
col_names = [prefixes[i_p] + '_' + suffixes[i_s] for (i_s, i_p) in zip(*mi.labels)]
df.columns = col_names
return df
flatten_multi_index(pd.pivot_table(ts,index='month', columns='company'))
The above version only handles a 2D MultiIndex but it could be generalized if needed.
An update (as of early 2017 and pandas 0.19.2). You can use .values on a MultiIndex. So, this snippet should flatten MultiIndexs for those in need. The snippet is both too clever but not clever enough: it can handle either the row index or column names from the DataFrame, but it will blow up if the result of getattr(df,way) isn't nested (i.e., a MultiIndex).
def flatten_multi(df, way='index'): # or way='columns'
assert way in {'index', 'columns'}, "I'm sorry Dave."
mi = getattr(df, way)
flat_names = ["_".join(s) for s in mi.values]
setattr(df, way, flat_names)
return df

Categories