I am working on a project for my thesis, which has to do with the capitalization of Research & Development (R&D) expenses for a data set of companies that I have.
For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period.
For example if we have Apple's R&D expenses for 5 years at a constant depreciation rate of 20%:
year r&d_exp dep_rate r&d_capital
1999 10 0.2 10
2000 8 0.2 16
2001 12 0.2 24.4
2002 7 0.2 25.4
2003 15 0.2 33
If it was not clear, r&d_capital is retrieved the following way:
2000 = 10*(1-0.2) + 8
2001 = 10*(1-0.4) + 8*(1-0.2) + 12
2002 = 10*(1-0.6) + 8*(1-0.4) + 12*(1-0.2) + 7
2003 = 10*(1-0.8) + 8*(1-0.6) + 12*(1-0.4) + 7*(1-0.2) + 15
How can I automate this calculation in a pandas Dataframe?
Also considering that I have more than 1 firm in my dataframe.
Thank you in advance for the help :)
I'm sure there's a better way to do it, but using a for loop and indexing you can add the 'r&d_exp' and 'dep_rate' appropriately:
import pandas as pd
import numpy as np
df = pd.DataFrame(((1999, 10, 0.2, 10),
(2000, 8 , 0.2, 16),
(2001, 12, 0.2, 24.4),
(2002, 7 , 0.2, 25.4),
(2003, 15, 0.2, 33)),
columns=('year', 'r&d_exp', 'dep_rate', 'r&d_capital'))
we can use indexing and list comprehension to sum for each value up to each year:
# set to zero to show that correct values are recovered
df['r&d_capital'] = 0
print(df['r&d_capital'])
>>> np.array([0, 0, 0, 0, 0])
df['r&d_capital'] = [(df['r&d_exp'].iloc[:i] * (1 - df['dep_rate'].iloc[:i]*np.arange(i)[::-1])).sum()
for i in range(1, len(df)+1)]
df['r&d_capital'].values
>>> array([10. , 16. , 24.4, 25.4, 33. ])
We use df['r&d_exp'].iloc[:i] to extract the series up to index i and then use an array np.arange(i)[::-1] of indices to generate the total depreciation rate at the year in question. Importantly this array is reversed such that the earlier values have multiple integers of depreciation. This generates the value of what I assume is the initial investment after depreciation at the year in question. All of these contributions are then summed to get the total capital. This method will already handle different depreciation rates.
In principle this can be extended to other firms easily.
I hope this helps.
Related
I'm having a dataset that looks as follows:
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32]}
df = pd.DataFrame(data)
Now what I aim to do is to add a new column "SMB" and calculate it as follows:
Subset data based on year and quarter, e. g. get all values where year = 2012, and quarter = 2
Sort the subset based on column MC and split it based on the size into small and big (0.5 Quantile)
If the value in MC is lower than 0.5 quantile add value "small" to the newly created column "SMB", if it is higher than the 0.5 quantile add value "big"
Repeat the process for all rows where quarter = 2
For all other rows add np.nan
so the output should look like that
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32],
'SMB': ['Small', 'Small', 'Big', 'Big', np.NaN, np.NaN]}
df = pd.DataFrame(data)
I tried to create a loop but I was unable to properly merge it back into the previous dataframe as I need other quarter values for further calculation. Using below code I sort of achieved what I wanted to have, but I had to merge back the data into the original dataset.
I'm sure there is a much nicer way on how to achieve this.
# Quantile 0.5 for MC sorting (small & big)
smbQuantile = 0.5
Years = df['Year'].unique()
dataframes_list = []
# Calculate Small and Big and merge back into dataFrame
for i in Years:
df_temp = df.loc[(df_sb['Year'] == i) & (df['Quarter'] == 2)]
df_temp['SMB'] = ''
#Assign factor size based on market cap
df_temp.SMB[df_temp.MKT_CAP <= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Small'
df_temp.SMB[df_temp.MKT_CAP >= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Big'
dataframes_list.append(df_temp)
df = pd.concat(dataframes_list)
You can use groupby.rank and groupby.transform('size') combined with numpy.select:
g = df.groupby(['Year', 'Quarter'])['MC']
df['SMB'] = np.select([g.rank(pct=True).le(0.5),
g.transform('size').ge(2)],
['Small', 'Big'], np.nan)
output:
Year Quarter ID MC PB SMB
0 2012 2 CH7744 3348.22 2.74 Small
1 2013 2 US4652 8542.55 0.95 Small
2 2012 2 CA47441 11851.20 1.57 Big
3 2013 2 CH1147 15718.10 2.13 Big
4 2014 3 DE7487 29914.70 0.54 nan
5 2013 1 US5174 8731.78 5.32 nan
I started to use python and i am trying to find outliers per year using the quantile
my data is organized as follows:
columns of years, and for each year i have months and their corresponding salinity and temperature
year=[1997:2021]
month=[1,2...]
SAL=[33,32,50,......,35,...]
Following is my code:
#1st quartile
Q1 = DF['SAL'].quantile(0.25)
#3rd quartile
Q3 = DF['SAL'].quantile(0.75)
#calculate IQR
IQR = Q3 - Q1
print(IQR)
df_out = DF['SAL'][((DF['SAL'] < (Q1 - 1.5 * IQR)) |(DF['SAL'] > (Q3 + 1.5 * IQR)))]
I want to identify the month and year of the outlier and replace it with nan.
To get the outliers per year, you need to compute the quartiles for each year via groupby. Other than that, there's not much to change in your code, but I recently learned about between which seems useful here:
import numpy as np
clean_data = list()
for year, group in DF.groupby('year'):
Q1 = group['SAL'].quantile(0.25)
Q3 = group['SAL'].quantile(0.75)
IQR = Q3 - Q1
# set all values to np.nan that are not (~) in between the two values
group.loc[~group['SAL'].between(Q1 - 1.5 * IQR,
Q3 + 1.5 * IQR,
inclusive=False),
'SAL'] = np.nan
clean_data.append(group)
clean_df = pd.concat(clean_data)
You can use the following function. It uses the definition of an outlier that is below Q1-1.5IQR or above Q3+1.5IQR, such as classically done for boxplots.
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(range(1997,2022), 12),
'month': np.tile(range(12), 25)+1,
'SAL': np.random.randint(20,40, size=12*25)+np.random.choice([0,-20, 20], size=12*25, p=[0.9,0.05,0.05]),
})
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s > (Q1 - 1.5 * IQR)) & (s < (Q3 + 1.5 * IQR)), replace)
# add new column with excluded outliers
df['SAL_excl'] = df.groupby('year')['SAL'].apply(outliers)
Checking that it works:
with outliers:
import seaborn as sns
sns.boxplot(data=df, x='year', y='SAL')
without outliers:
sns.boxplot(data=df, x='year', y='SAL_excl')
NB. it is possible that new outliers appear as data has now new Q1/Q3/IQR due to the filtering.
How to retrieve rows with outliers:
df[df['SAL_excl'].isna()]
output:
year month SAL SAL_excl
28 1999 5 53 NaN
33 1999 10 7 NaN
94 2004 11 52 NaN
100 2005 5 38 NaN
163 2010 8 6 NaN
182 2012 3 25 NaN
188 2012 9 22 NaN
278 2020 3 53 NaN
294 2021 7 9 NaN
I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?
The change column is of type float, with 4 decimals.
Index Change
0 0.0410
1 0.0000
2 0.1201
... ...
74327 0.0000
74328 0.0231
74329 0.0109
74330 0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])
Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was
20.9 ms ± 1.35 ms
This will return a series with the rolling sum for the last 50 Change in the df:
df['Change'].rolling(50).sum()
you can add it to a new column like so:
df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.
Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:
Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]
Code
For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.
import pandas as pd
import numpy as np
# data
SEQ_LEN = 5
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"Change": np.random.normal(0, 1, 10) # a million rows
}
)
# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()
# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]
Output
df
Out[30]:
Change Change_Cumsum Change_Sum
0 -1.133838 -1.133838 NaN
1 0.384319 -0.749519 NaN
2 1.496554 0.747035 NaN
3 -0.355382 0.391652 NaN
4 -0.787534 -0.395881 -0.395881
5 -0.459439 -0.855320 0.278518
6 -0.059169 -0.914489 -0.164970
7 -0.354174 -1.268662 -2.015697
8 -0.735523 -2.004185 -2.395838
9 -1.183940 -3.188125 -2.792244
Noob (trying to learn data_science) who has a simple portfolio in a dataframe. I want to sell a certain number of shares of each company, multiply the number of shares sold by the price, and add same to the existing cash value (15000), rounding to 2 decimal places. Briefly
new_port_df =
Name Price Starting Number_of_shares
0 MMM 10.00 50
1 AXP 20.00 100
2 AAPL 30.00 1000
3 Cash 1.00 15000
shares_sold = [[ 5.] [ 15.] [75.] [ 0.]] #(numpy.ndarray, shape (4,1))
new_port_df['Price'] =
0 10.00
1 20.00
2 30.00
3 1.00
Name: Low, dtype: float64 # pandas.core.series.Series
so basically Cash += 5 * 10 + 15 * 20 + 75 * 30 + 0 * 1 or 15000 + 2600 = 17600
As an intermediate step (after googling and reading other posts on here), I've tried:
cash_proceeds = np.dot(shares_sold, new_port['Price'])
ValueError: shapes (4,1) and (4,) not aligned: 1 (dim 1) != 4 (dim 0). I think I should be reshaping, but haven't had any luck.
Desired result is below (all working except for the 17600 cell)
updated_port_df =
Name Price Starting Number_of_shares
0 MMM 10.00 45
1 AXP 20.00 85
2 AAPL 30.00 925
3 Cash 1.00 17600 # only the 17600 not working
Simply answers I can understand are preferred to complex ones I can't. Thanks for any help.
Rather than initiating shares_sold as a list of lists i.e. [[],[],[]] you can just create a list of numbers in order to resolve your np.dot() error.
shares_sold = [5,15,75,0]
cash_proceeds = np.dot(new_port_df['Price'], shares_sold)
or as Andy pointed out, if shares_sold is already initiated as a list of lists you can convert it to an array and then flatten it and proceed from there. My answer wont address the change of approach that entails.
You can then change the last item in your shares_sold list/array to reflect the change in cash from the sale of stock (notice saved as negative because these will be subtracted from your Number of Shares column):
shares_sold[3] = -cash_proceeds
Now you can subtract shares sold from the Number of Shares column to reflect the change (you indicate you want updated_port_df to house this information so I first duplicate the initial portfolio and then make the change),
updated_port_df = new_port_df.copy()
updated_port_df['Number_of_shares'] = updated_port_df['Number_of_shares'] - shares_sold
You may use pandas dot, instead of np.dot. You need 1-d numpy array to using dot on series, so you need convert shares_sold to 1-d
shares_sold = np.array([[ 5.], [ 15.], [75.] ,[ 0.]])
shares_sold_1d = shares_sold.flatten()
cash_proceeds = new_port_df['Price'].dot(shares_sold_1d)
In [226]: print(cash_proceeds)
2600.0
To get your desired output, simple using .loc assignment and subtraction
(new_port_df.loc[new_port_df.Name.eq('Cash'), 'Starting_Number_of_shares'] =
new_port_df.loc[new_port_df.Name.eq('Cash'), 'Starting_Number_of_shares']
+ cash_proceeds)
new_port_df['Starting_Number_of_shares'] = new_port_df['Starting_Number_of_shares'] - shares_sold_1d
Out[235]:
Name Price Starting_Number_of_shares
0 MMM 10.0 45.0
1 AXP 20.0 85.0
2 AAPL 30.0 925.0
3 Cash 1.0 17600.0
Note: If you really want to use np.dot, you need swapping the order as follows
In [237]: np.dot(new_port_df['Price'], shares_sold)
Out[237]: array([2600.])
I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-