How to calculate with previous values in a Pandas MultiIndex DataFrame? - python

I have the following MultiIndex dataframe.
Close ATR
Date Symbol
1990-01-01 A 24 2
1990-01-01 B 72 7
1990-01-01 C 40 3.4
1990-01-02 A 21 1.5
1990-01-02 B 65 6
1990-01-02 C 45 4.2
1990-01-03 A 19 2.5
1990-01-03 B 70 6.3
1990-01-03 C 51 5
I want to calculate three columns:
Shares = previous day's Equity * 0.02 / ATR, rounded down to whole number
Profit = Shares * Close
Equity = previous day's Equity + sum of Profit for each Symbol
Equity has an initial value of 10,000.
The expected output is:
Close ATR Shares Profit Equity
Date Symbol
1990-01-01 A 24 2 0 0 10000
1990-01-01 B 72 7 0 0 10000
1990-01-01 C 40 3.4 0 0 10000
1990-01-02 A 21 1.5 133 2793 17053
1990-01-02 B 65 6 33 2145 17053
1990-01-02 C 45 4.2 47 2115 17053
1990-01-03 A 19 2.5 136 2584 26885
1990-01-03 B 70 6.3 54 3780 26885
1990-01-03 C 51 5 68 3468 26885
I suppose I need a for loop or a function to be applied to each row. With these I have two issues. One is that I'm not sure how I can create a for loop for this logic in case of a MultiIndex dataframe. The second is that my dataframe is pretty large (something like 10 million rows) so I'm not sure if a for loop would be a good idea. But then how can I create these columns?

This solution can surely be cleaned up, but will produce your desired output. I've included your initial conditions in the construction of your sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['1990-01-01','1990-01-01','1990-01-01','1990-01-02','1990-01-02','1990-01-02','1990-01-03','1990-01-03','1990-01-03'],
'Symbol': ['A','B','C','A','B','C','A','B','C'],
'Close': [24, 72, 40, 21, 65, 45, 19, 70, 51],
'ATR': [2, 7, 3.4, 1.5, 6, 4.2, 2.5, 6.3, 5],
'Shares': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Profit': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
Gives:
Date Symbol Close ATR Shares Profit
0 1990-01-01 A 24 2.0 0.0 0.0
1 1990-01-01 B 72 7.0 0.0 0.0
2 1990-01-01 C 40 3.4 0.0 0.0
3 1990-01-02 A 21 1.5 NaN NaN
4 1990-01-02 B 65 6.0 NaN NaN
5 1990-01-02 C 45 4.2 NaN NaN
6 1990-01-03 A 19 2.5 NaN NaN
7 1990-01-03 B 70 6.3 NaN NaN
8 1990-01-03 C 51 5.0 NaN NaN
Then use groupby() with apply() and track your Equity globally. Took me a second to realize that the nature of this problem requires you to group on two separate columns individually (Symbol and Date):
start = 10000
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x #Skip first group
x['Shares'] = np.floor(Equity*0.02/x['ATR'])
x['Profit'] = x['Shares']*x['Close']
Equity += x['Profit'].sum()
return x
df = df.groupby('Date').apply(calcs)
df['Equity'] = df.groupby('Date')['Profit'].transform('sum')
df['Equity'] = df.groupby('Symbol')['Equity'].cumsum()+start
This yields:
Date Symbol Close ATR Shares Profit Equity
0 1990-01-01 A 24 2.0 0.0 0.0 10000.0
1 1990-01-01 B 72 7.0 0.0 0.0 10000.0
2 1990-01-01 C 40 3.4 0.0 0.0 10000.0
3 1990-01-02 A 21 1.5 133.0 2793.0 17053.0
4 1990-01-02 B 65 6.0 33.0 2145.0 17053.0
5 1990-01-02 C 45 4.2 47.0 2115.0 17053.0
6 1990-01-03 A 19 2.5 136.0 2584.0 26885.0
7 1990-01-03 B 70 6.3 54.0 3780.0 26885.0
8 1990-01-03 C 51 5.0 68.0 3468.0 26885.0

can you try using shift and groupby? Once you have the value of the previous line, all columns operations are straight forward.
table2['previous'] = table2['close'].groupby('symbol').shift(1)
table2
date symbol close atr previous
1990-01-01 A 24 2 NaN
B 72 7 NaN
C 40 3.4 NaN
1990-01-02 A 21 1.5 24
B 65 6 72
C 45 4.2 40
1990-01-03 A 19 2.5 21
B 70 6.3 65
C 51 5 45

Related

How to melt multiple columns into one without the column names

I have a pandas dataframe with multiple columns and I would like to create a new dataframe by flattening all columns into one using the melt function. But I do not want the column names from the original dataframe to be a part of the new dataframe.
Below is the sample dataframe and code. Is there a way to make it more concise?
date Col1 Col2 Col3 Col4
1990-01-02 12:00:00 24 24 24.8 24.8
1990-01-02 01:00:00 59 58 60 60.3
1990-01-02 02:00:00 43.7 43.9 48 49
The output desired:
Rates
0 24
1 59
2 43.7
3 24
4 58
5 43.9
6 24.8
7 60
8 48
9 24.8
10 60.3
11 49
Code :
df = df.melt(var_name='ColumnNames', value_name='Rates') #using melt function to flatten columns
df_main.drop(['ColumnNames'], axis = 1, inplace = True) # dropping 'ColumnNames'
Set value_name and value_vars params for your purpose:
In [137]: pd.melt(df, value_name='Price', value_vars=df.columns[1:]).drop('variable', axis=1)
Out[137]:
Price
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
As an alternative you can use stack() and transpose():
dfx = df.T.stack().reset_index(drop=True) #date must be index.
Output:
0
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Averaging column values by several intervals in Python

I have a dataframe with depth and other value columns:
data = {'Depth': [1.0, 1.0, 1.5, 2.0, 2.5, 2.5, 3.0, 3.5, 4.0, 4.0, 5.0, 5.5, 6.0],
'Value1':[44, 46, 221, 12, 47, 44, 67, 90, 100, 111, 112, 120, 122],
'Value2': [55, 65, 76, 45, 55, 58, 23, 12, 32, 20, 22, 26, 36]}
df = pd.DataFrame(data)
As you can see sometime there are repetitions in the Depth.
I'd like to be able to somehow groupby intervals and average over them.
For example an output I desire would be:
intervals = [1.0, 2.0]
Taking a list of intervals and breaking up the data set on those intervals to average per value (Value1, Value2) to get:
Depth Value1 Value2 Avg1_1 Avg2_1 Avg1_2 Avg2_2
0 1.0 44 55 80.75 60.25 78.2 .
1 1.0 46 65 80.75 60.25 78.2 .
2 1.5 221 76 80.75 60.25 78.2 .
3 2.0 12 45 80.75 60.25 78.2
4 2.5 47 55 52.67 . 78.2
5 2.5 44 58 52.67 . 78.2
6 3.0 67 23 52.67 . 78.2
7 3.5 90 12 100.33 78.2
8 4.0 100 32 100.33 78.2
9 4.0 111 20 100.33 78.2
10 5.0 112 22 112 .
11 5.5 120 26 121 .
12 6.0 122 36 121 .
Where Avg1_ is the Average of Value1 over every interval of 1.0 (which includes (1.0 - 2.0, 2.5 - 3.0,....etc).
Is there an easy way to do this using groupby in a loop?
You can accomplish this with the dataframe's apply method, and then sampling by boolean values the rows (and associated values) that meet the condition like depth + 1.0 or depth + 2.0.
df['avg1_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values)),
axis=1)
df['avg2_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values)),
axis=1)
df['avg1_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values)),
axis=1)
df['avg2_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values)),
axis=1)
This would return:
Depth Value1 Value2 newval avg1_1 avg2_1 avg1_2 avg2_2
0 1.0 44 55 66.0 80.750000 60.250000 68.714286 53.857143
1 1.0 46 65 241.0 80.750000 60.250000 68.714286 53.857143
2 1.5 221 76 32.0 69.000000 59.000000 71.375000 48.625000
3 2.0 12 45 67.0 68.714286 53.857143 78.200000 44.100000
4 2.5 47 55 64.0 71.375000 48.625000 78.200000 44.100000
5 2.5 44 58 87.0 71.375000 48.625000 78.200000 44.100000
6 3.0 67 23 110.0 78.200000 44.100000 81.272727 42.090909
7 3.5 90 12 120.0 78.200000 44.100000 84.500000 40.750000
8 4.0 100 32 131.0 81.272727 42.090909 87.384615 40.384615
9 4.0 111 20 132.0 81.272727 42.090909 87.384615 40.384615
10 5.0 112 22 140.0 87.384615 40.384615 87.384615 40.384615
11 5.5 120 26 142.0 87.384615 40.384615 87.384615 40.384615
12 6.0 122 36 NaN 87.384615 40.384615 87.384615 40.384615

Pandas dataframe Groupby and retrieve date range

Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43

pandas cumulative sum of stock in warehouse

Consider the warehouse stocks on different days
day action quantity symbol
0 1 40 a
1 1 53 b
2 -1 21 a
3 1 21 b
4 -1 2 a
5 1 42 b
Here, day represents time series, action represents buy/sell for specific product (symbol) and of quantity.
For this dataframe, How do I calculate the cumulative sum daily, for each product.
Basically, a resultant dataframe as below:
days a b
0 40 0
1 40 53
2 19 53
3 19 64
4 17 64
5 17 106
I have tried cumsum() with groupby and was unsuccessful with it
Using pivot_table
In [920]: dff = df.pivot_table(
index=['day', 'action'], columns='symbol',
values='quantity').reset_index()
In [921]: dff
Out[921]:
symbol day action a b
0 0 1 40.0 NaN
1 1 1 NaN 53.0
2 2 -1 21.0 NaN
3 3 1 NaN 21.0
4 4 -1 2.0 NaN
5 5 1 NaN 42.0
Then, mul the action, take cumsum, forward fill missing values, and finally replace NaNs with 0
In [922]: dff[['a', 'b']].mul(df.action, 0).cumsum().ffill().fillna(0)
Out[922]:
symbol a b
0 40.0 0.0
1 40.0 53.0
2 19.0 53.0
3 19.0 74.0
4 17.0 74.0
5 17.0 116.0
Final result
In [926]: dff[['a', 'b']].mul(df.action, 0).cumsum().ffill().fillna(0).join(df.day)
Out[926]:
a b day
0 40.0 0.0 0
1 40.0 53.0 1
2 19.0 53.0 2
3 19.0 74.0 3
4 17.0 74.0 4
5 17.0 116.0 5
Nevermind, didn't see pandas tag. This is just plain Python.
Try this:
sums = []
currentsums = {'a': 0, 'b': 0}
for i in data:
currentsums[i['symbol']] += i['action'] * i['quantity']
sums.append({'a': currentsums['a'], 'b': currentsums['b']})
Try it online!
Note that it gives a different result than you posted because you calculated wrong.

Categories