I have a pandas dataframe with multiple columns and I would like to create a new dataframe by flattening all columns into one using the melt function. But I do not want the column names from the original dataframe to be a part of the new dataframe.
Below is the sample dataframe and code. Is there a way to make it more concise?
date Col1 Col2 Col3 Col4
1990-01-02 12:00:00 24 24 24.8 24.8
1990-01-02 01:00:00 59 58 60 60.3
1990-01-02 02:00:00 43.7 43.9 48 49
The output desired:
Rates
0 24
1 59
2 43.7
3 24
4 58
5 43.9
6 24.8
7 60
8 48
9 24.8
10 60.3
11 49
Code :
df = df.melt(var_name='ColumnNames', value_name='Rates') #using melt function to flatten columns
df_main.drop(['ColumnNames'], axis = 1, inplace = True) # dropping 'ColumnNames'
Set value_name and value_vars params for your purpose:
In [137]: pd.melt(df, value_name='Price', value_vars=df.columns[1:]).drop('variable', axis=1)
Out[137]:
Price
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
As an alternative you can use stack() and transpose():
dfx = df.T.stack().reset_index(drop=True) #date must be index.
Output:
0
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
Related
Problem:
For each row of a DataFrame, I want to find the nearest prior row where the 'Datetime' value is at least 20 seconds before the current 'Datetime' value.
For example: if the previous 'Datetime' (at index i-1) is at least 20s earlier than the current one - it will be chosen. Otherwise (e.g. only 5 seconds earlier), move to i-2 and see if it is at least 20s earlier. Repeat until the condition is met, or no such row has been found.
The expected result is a concatenation of the original df and the rows that were found. When no matching row at or more than 20 s before the current Datetime has been found, then the new columns are null (NaT or NaN, depending on the type).
Example data
df = pd.DataFrame({
'Datetime': pd.to_datetime([
f'2016-05-15 08:{M_S}+06:00'
for M_S in ['36:21', '36:41', '36:50', '37:10', '37:19', '37:39']]),
'A': [21, 43, 54, 2, 54, 67],
'B': [3, 3, 45, 23, 8, 6],
})
Example result:
>>> res
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
The last three columns are the newly created columns, and the first three columns are the original dataset.
Two vectorized solutions
Note: we assume that the rows are sorted by Datetime. If that is not the case, then sort them first (O[n log n]).
For 10,000 rows:
3.3 ms, using Numpy's searchsorted.
401 ms, using a rolling window of 20s, left-open.
1. Using np.searchsorted
We use np.searchsorted to find in one call the indices of all matching rows:
import numpy as np
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
E.g., for the OP's data, these indices are:
>>> z
array([-1, 0, 0, 2, 2, 4])
I.e.: z[0] == -1: no matching row; z[1] == 0: row 0 (08:36:21) is the nearest that is 20s or more before row 1 (08:36:41). z[2] == 0: row 0 is the nearest match for row 2 (row 1 is too close). Etc.
Why subtracting 1? We use np.searchsorted to select the first row in the exclusion zone (i.e., too close); then we subtract 1 to get the correct row (the first one at least 20s before).
Why - 1ns? This is to make the search window left-open. A row at exactly 20s before the current one will not be in the exclusion zone, and thus will end up being the one selected as the match.
We then use z to select the matching rows (or nulls) and concatenate into the result. Putting it all in a function:
def select_np(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(pd.Series(z >= 0))
], axis=1)
On the OP's example
>>> select_np(df[['Datetime', 'A', 'B']])
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
2. Using a rolling window (pure Pandas)
This was our original solution and uses pandas rolling with a Timedelta(20s) window size, left-open. It is still more optimized than a naive (O[n^2]) search, but is roughly 100x slower than select_np(), as pandas uses explicit loops in Python to find the window bounds for .rolling(): see get_window_bounds(). There is also some overhead due to having to make sub-frames, applying a function or aggregate, etc.
def select_pd(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
z = (
df.assign(rownum=range(len(df)))
.rolling(pd.Timedelta(min_dt), on='Datetime', closed='right')['rownum']
.apply(min).astype(int) - 1
)
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(z >= 0)
], axis=1)
3. Testing
First, we write an arbitrary-size test data generator:
def gen(n):
return pd.DataFrame({
'Datetime': pd.Timestamp('2020') +\
np.random.randint(0, 30, n).cumsum() * pd.Timedelta('1s'),
'A': np.random.randint(0, 100, n),
'B': np.random.randint(0, 100, n),
})
Example
np.random.seed(0)
tdf = gen(10)
>>> select_np(tdf)
Datetime A B Datetime_nearest A_nearest B_nearest
0 2020-01-01 00:00:12 21 87 NaT NaN NaN
1 2020-01-01 00:00:27 36 46 NaT NaN NaN
2 2020-01-01 00:00:48 87 88 2020-01-01 00:00:27 36.0 46.0
3 2020-01-01 00:00:48 70 81 2020-01-01 00:00:27 36.0 46.0
4 2020-01-01 00:00:51 88 37 2020-01-01 00:00:27 36.0 46.0
5 2020-01-01 00:01:18 88 25 2020-01-01 00:00:51 88.0 37.0
6 2020-01-01 00:01:21 12 77 2020-01-01 00:00:51 88.0 37.0
7 2020-01-01 00:01:28 58 72 2020-01-01 00:00:51 88.0 37.0
8 2020-01-01 00:01:37 65 9 2020-01-01 00:00:51 88.0 37.0
9 2020-01-01 00:01:56 39 20 2020-01-01 00:01:28 58.0 72.0
Speed
tdf = gen(10_000)
% timeit select_np(tdf)
3.31 ms ± 6.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit select_pd(df)
401 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> select_np(df).equals(select_pd(df))
True
Scale sweep
We can now compare speed over a range of sizes, using the excellent perfplot package:
import perfplot
perfplot.plot(
setup=gen,
kernels=[select_np, select_pd],
n_range=[2**k for k in range(4, 16)],
equality_check=lambda a, b: a.equals(b),
)
Focusing on select_np:
perfplot.plot(
setup=gen,
kernels=[select_np],
n_range=[2**k for k in range(4, 24)],
)
The following solution is memory-efficient but it is not the fastest one (because it uses iteration over rows).
The fully vectorized version (that I could think of on my own) would be faster but it would use O(n^2) memory.
Example dataframe:
timestamps = [pd.Timestamp('2016-01-01 00:00:00'),
pd.Timestamp('2016-01-01 00:00:19'),
pd.Timestamp('2016-01-01 00:00:20'),
pd.Timestamp('2016-01-01 00:00:21'),
pd.Timestamp('2016-01-01 00:00:50')]
df = pd.DataFrame({'Datetime': timestamps,
'A': np.arange(10, 15),
'B': np.arange(20, 25)})
Datetime
A
B
0
2016-01-01 00:00:00
10
20
1
2016-01-01 00:00:19
11
21
2
2016-01-01 00:00:20
12
22
3
2016-01-01 00:00:21
13
23
4
2016-01-01 00:00:50
14
24
Solution:
times = df['Datetime'].to_numpy() # it's convenient to have it as an `ndarray`
shifted_times = times - pd.Timedelta(20, unit='s')
useful is a list of "useful" indices of df - i.e. where the appended values will NOT be nan:
useful = np.nonzero(shifted_times >= times[0])[0]
# useful == [2, 3, 4]
Truncate shifted_times from the beginning - to iterate through useful elements only:
if len(useful) == 0:
# all new columns will be `nan`s
first_i = 0 # this value will never actually be used
useful_shifted_times = np.array([], dtype=shifted_times.dtype)
else:
first_i = useful[0] # first_i == 2
useful_shifted_times = shifted_times[first_i : ]
Find the corresponding index positions of df for each "useful" value.
(these index positions are essentially the indices of times
that are selected for each element of useful_shifted_times):
selected_indices = []
# Iterate through `useful_shifted_times` one by one:
# (`i` starts at `first_i`)
for i, shifted_time in enumerate(useful_shifted_times, first_i):
selected_index = np.nonzero(times[: i] <= shifted_time)[0][-1]
selected_indices.append(selected_index)
# selected_indices == [0, 0, 3]
Selected rows:
df_nearest = df.iloc[selected_indices].add_suffix('_nearest')
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
0
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
Replace indices of df_nearest to match those of the corresponding rows of df.
(basically, that is the last len(selected_indices) indices):
df_nearest.index = df.index[len(df) - len(selected_indices) : ]
Datetime_nearest
A_nearest
B_nearest
2
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:21
13
23
Append the selected rows to the original dataframe to get the final result:
new_df = df.join(df_nearest)
Datetime
A
B
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
NaT
nan
nan
1
2016-01-01 00:00:19
11
21
NaT
nan
nan
2
2016-01-01 00:00:20
12
22
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:50
14
24
2016-01-01 00:00:21
13
23
Note: NaT stands for 'Not a Time'. It is the equivalent of nan for time values.
Note: it also works as expected even if all the last 'Datetime' - 20 sec is before the very first 'Datetime' --> all new columns will be nans.
First, I want to forward fill my data for EACH UNIQUE VALUE in Group_Id by 1S, so basically grouping by Group_Id then resample using ffill.
Here is the data:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I previously did this:
def daily_average_temperature(dfdf):
INDEX = dfdf[['Group_Id','Timestamp','Data']]
INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
INDEX = INDEX.set_index('Timestamp')
INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')
return T_index1
This is wrong as it didn't group the data with different value of Group_Id first but rather ignoring the column.
Second, I would like to spread the Data values so each row is a group_id with index as columns replacing Timestamp, looks something like this:
x0 x1 x2 x3 x4 x5 ... Group_Id
0 40 31.05 25.5 25.5 25.5 25 ... 1
1 35 35.75 36.5 36.5 36.5 36.5 ... 2
2 25.5 25.5 25.5 25.5 25.5 25.5 ... 3
3 25.5 25.5 25.5 25.5 25.5 25.5 ... 4
4 25 25 25 25 25 25 ... 5
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Please note that this table above is not related to the previous dataset but just used to show the format.
Thanks
Use DataFrame.groupby with DataFrameGroupBy.resample:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.ffill()
.reset_index())
return dfdf
print (daily_average_temperature(dfdf))
Group_Id Timestamp Data
0 50 2018-01-03 00:00:15 125.5
1 52 2018-01-02 00:00:09 127.0
2 52 2018-01-02 00:00:10 127.0
3 52 2018-01-02 00:00:11 127.0
4 52 2018-01-02 00:00:12 127.0
5 52 2018-01-02 00:00:13 126.5
6 101 2018-01-01 00:00:05 125.0
7 120 2018-01-01 00:00:07 125.5
8 120 2018-01-01 00:00:08 125.0
9 300 2018-01-04 00:00:14 127.0
10 300 2018-01-04 00:00:15 126.5
11 350 2018-01-05 00:00:19 125.5
EDIT: This solution use minimal and maximal datetimes for DataFrame.reindex by date_range in DattimeIndex in columns after reshape by Series.unstack, also is added back filling if necessary:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 345615
300 345615
120 345615
101 345615
52 345615
50 345615
Name: Group_Id, dtype: int64
Another solution is similar, only date_range is specified by values from strings (not dynamic by min and max):
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'),
axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 604801
300 604801
120 604801
101 604801
52 604801
50 604801
Name: Group_Id, dtype: int64
Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43
I have the following MultiIndex dataframe.
Close ATR
Date Symbol
1990-01-01 A 24 2
1990-01-01 B 72 7
1990-01-01 C 40 3.4
1990-01-02 A 21 1.5
1990-01-02 B 65 6
1990-01-02 C 45 4.2
1990-01-03 A 19 2.5
1990-01-03 B 70 6.3
1990-01-03 C 51 5
I want to calculate three columns:
Shares = previous day's Equity * 0.02 / ATR, rounded down to whole number
Profit = Shares * Close
Equity = previous day's Equity + sum of Profit for each Symbol
Equity has an initial value of 10,000.
The expected output is:
Close ATR Shares Profit Equity
Date Symbol
1990-01-01 A 24 2 0 0 10000
1990-01-01 B 72 7 0 0 10000
1990-01-01 C 40 3.4 0 0 10000
1990-01-02 A 21 1.5 133 2793 17053
1990-01-02 B 65 6 33 2145 17053
1990-01-02 C 45 4.2 47 2115 17053
1990-01-03 A 19 2.5 136 2584 26885
1990-01-03 B 70 6.3 54 3780 26885
1990-01-03 C 51 5 68 3468 26885
I suppose I need a for loop or a function to be applied to each row. With these I have two issues. One is that I'm not sure how I can create a for loop for this logic in case of a MultiIndex dataframe. The second is that my dataframe is pretty large (something like 10 million rows) so I'm not sure if a for loop would be a good idea. But then how can I create these columns?
This solution can surely be cleaned up, but will produce your desired output. I've included your initial conditions in the construction of your sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['1990-01-01','1990-01-01','1990-01-01','1990-01-02','1990-01-02','1990-01-02','1990-01-03','1990-01-03','1990-01-03'],
'Symbol': ['A','B','C','A','B','C','A','B','C'],
'Close': [24, 72, 40, 21, 65, 45, 19, 70, 51],
'ATR': [2, 7, 3.4, 1.5, 6, 4.2, 2.5, 6.3, 5],
'Shares': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Profit': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
Gives:
Date Symbol Close ATR Shares Profit
0 1990-01-01 A 24 2.0 0.0 0.0
1 1990-01-01 B 72 7.0 0.0 0.0
2 1990-01-01 C 40 3.4 0.0 0.0
3 1990-01-02 A 21 1.5 NaN NaN
4 1990-01-02 B 65 6.0 NaN NaN
5 1990-01-02 C 45 4.2 NaN NaN
6 1990-01-03 A 19 2.5 NaN NaN
7 1990-01-03 B 70 6.3 NaN NaN
8 1990-01-03 C 51 5.0 NaN NaN
Then use groupby() with apply() and track your Equity globally. Took me a second to realize that the nature of this problem requires you to group on two separate columns individually (Symbol and Date):
start = 10000
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x #Skip first group
x['Shares'] = np.floor(Equity*0.02/x['ATR'])
x['Profit'] = x['Shares']*x['Close']
Equity += x['Profit'].sum()
return x
df = df.groupby('Date').apply(calcs)
df['Equity'] = df.groupby('Date')['Profit'].transform('sum')
df['Equity'] = df.groupby('Symbol')['Equity'].cumsum()+start
This yields:
Date Symbol Close ATR Shares Profit Equity
0 1990-01-01 A 24 2.0 0.0 0.0 10000.0
1 1990-01-01 B 72 7.0 0.0 0.0 10000.0
2 1990-01-01 C 40 3.4 0.0 0.0 10000.0
3 1990-01-02 A 21 1.5 133.0 2793.0 17053.0
4 1990-01-02 B 65 6.0 33.0 2145.0 17053.0
5 1990-01-02 C 45 4.2 47.0 2115.0 17053.0
6 1990-01-03 A 19 2.5 136.0 2584.0 26885.0
7 1990-01-03 B 70 6.3 54.0 3780.0 26885.0
8 1990-01-03 C 51 5.0 68.0 3468.0 26885.0
can you try using shift and groupby? Once you have the value of the previous line, all columns operations are straight forward.
table2['previous'] = table2['close'].groupby('symbol').shift(1)
table2
date symbol close atr previous
1990-01-01 A 24 2 NaN
B 72 7 NaN
C 40 3.4 NaN
1990-01-02 A 21 1.5 24
B 65 6 72
C 45 4.2 40
1990-01-03 A 19 2.5 21
B 70 6.3 65
C 51 5 45
I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)