Repeat rows and iterate time python - python

I need help with this. I want to repeat the previous row and iterate time to one second before the next row based on a condition. That is, when the indicator >= 1, repeat previous row and iterate time to the next time as shown in the output. Below is my code.
The Input:
b0
b1
time
Indicator
325
350
10:59:40
0
200
42333
10:59:45
1
This is what I was thinking...
Code:
To sort and filter the data
df_new = df #new dataframe
new_index = 0 #to keep track of index in df_new inside the loop
for i,row in df.iterrows():
new_row = {}
if (row['indicator'] > 1) and (i!=0):
for column in df.columns:
if column == 'time':
new_row[column] = row[column] + datetime.timedelta(seconds=-1)
else:
new_row[column] = prev_row[column]
The problem I'm having is iterating the time. At the moment, it is only doing it for the second before the next timestamp.
The Output:
b0
b1
time
Indicator
325
350
10:59:40
0
325
350
10:59:41
0
325
350
10:59:42
0
325
350
10:59:43
0
325
350
10:59:44
0
200
42333
10:59:45
1

This was achieved by using several functions instead of loop processing. First, use asfrq() to fill in the holes at one-second intervals. The interval value = 0 in the created data frame and the interval value = 1 in the original data frame are concatenated and sorted into a time series.
import pandas as pd
import numpy as np
import io
data = '''
b0 b1 time Indicator
325 350 10:59:40 0
200 42333 10:59:45 1
424 236 11:00:00 0
525 361 11:00:10 0
623 896 11:00:20 1
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
dfs = df.asfreq('1s', method='ffill')
dfs = pd.concat([dfs[dfs['Indicator'] == 0], df[df['Indicator'] == 1]], axis=0)
dfs.sort_values(by='time', ascending=True, inplace=True)
dfs
b0 b1 Indicator
time
2021-02-06 10:59:40 325 350 0
2021-02-06 10:59:41 325 350 0
2021-02-06 10:59:42 325 350 0
2021-02-06 10:59:43 325 350 0
2021-02-06 10:59:44 325 350 0
2021-02-06 10:59:45 200 42333 1
2021-02-06 11:00:00 424 236 0
2021-02-06 11:00:01 424 236 0
2021-02-06 11:00:02 424 236 0
2021-02-06 11:00:03 424 236 0
2021-02-06 11:00:04 424 236 0
2021-02-06 11:00:05 424 236 0
2021-02-06 11:00:06 424 236 0
2021-02-06 11:00:07 424 236 0
2021-02-06 11:00:08 424 236 0
2021-02-06 11:00:09 424 236 0
2021-02-06 11:00:10 525 361 0
2021-02-06 11:00:11 525 361 0
2021-02-06 11:00:12 525 361 0
2021-02-06 11:00:13 525 361 0
2021-02-06 11:00:14 525 361 0
2021-02-06 11:00:15 525 361 0
2021-02-06 11:00:16 525 361 0
2021-02-06 11:00:17 525 361 0
2021-02-06 11:00:18 525 361 0
2021-02-06 11:00:19 525 361 0
2021-02-06 11:00:20 623 896 1

Here's another way to do it.
Step 1: convert time column to datetime data type
df['time'] = pd.to_datetime(df['time'],format='%H:%M:%S')
Step 2: Get the time difference between next row and current row.
Convert NaN to 0, and finally convert the value to integer.
df['time_diff'] = (df.time.shift(-1) - df.time).dt.seconds.fillna(0).astype(int)
Step 3: Get the next row Indicator using shift (-1). Replacing NaN
with 0
df['next_ind'] = df.Indicator.shift(-1).fillna(0).astype(int)
Step 4: If current row Indicator is >=1, ignore the row by setting
time_diff to 1. We will NOT repeat this row. Value is set to 1.
df.loc[df.Indicator >= 1, 'time_diff'] = 1
Step 5: Similarly check if current row is 0 and next row is also 0.
If so, we will also need to ignore this row. We will NOT repeat this
row. Value is set to 1
df.loc[(df.Indicator == 0) & (df.next_ind == 0), 'time_diff'] = 1
Step 6: All values in time_diff will have a 1 or the time difference
between rows with 0 and next row with >=1. This will be used as a
value to repeat the rows. Create a list so we can explode.
df['time'] = df.apply(lambda x: list(pd.date_range(x['time'], periods=x['time_diff'], freq=pd.DateOffset(seconds=1))),axis=1)
Step 7: Now explode the dataframe on column time as it contains
lists.
df = df.explode('time')
Step 8: Print the final dataframe for desired results
Putting all this together, the code is as shown below.
c = ['b0','b1','time','Indicator']
d = [[325,350,'10:59:40',0],
[200,42333,'10:59:45',1],
[300,1234,'10:59:52',0],
[400,2345,'10:59:55',0],
[500,3456,'10:59:58',1],
[600,4567,'11:00:03',2]]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
df['time'] = pd.to_datetime(df['time'],format='%H:%M:%S')
df['time_diff'] = (df.time.shift(-1) - df.time).dt.seconds.fillna(0).astype(int)
df['next_ind'] = df.Indicator.shift(-1).fillna(0).astype(int)
df.loc[df.Indicator >= 1, 'time_diff'] = 1
df.loc[(df.Indicator == 0) & (df.next_ind == 0), 'time_diff'] = 1
df['time'] = df.apply(lambda x: list(pd.date_range(x['time'], periods=x['time_diff'], freq=pd.DateOffset(seconds=1))),axis=1)
df = df.explode('time')
df.drop(columns=['time_diff','next_ind'],inplace=True)
print (df)
Output of this will be:
Original DataFrame:
b0 b1 time Indicator
0 325 350 10:59:40 0
1 200 42333 10:59:45 1
2 300 1234 10:59:52 0
3 400 2345 10:59:55 0
4 500 3456 10:59:58 1
5 600 4567 11:00:03 2
Updated DataFrame:
b0 b1 time Indicator
0 325 350 1900-01-01 10:59:40 0
0 325 350 1900-01-01 10:59:41 0
0 325 350 1900-01-01 10:59:42 0
0 325 350 1900-01-01 10:59:43 0
0 325 350 1900-01-01 10:59:44 0
1 200 42333 1900-01-01 10:59:45 1
2 300 1234 1900-01-01 10:59:52 0
3 400 2345 1900-01-01 10:59:55 0
3 400 2345 1900-01-01 10:59:56 0
3 400 2345 1900-01-01 10:59:57 0
4 500 3456 1900-01-01 10:59:58 1
5 600 4567 1900-01-01 11:00:03 2

Related

Aggregate functions on a 3-level pandas grupby object

I want to make a new df with simple metrics like mean, sum, min, max calculated on the Value column in the df visible below, grouped by ID, Date and Key.
index
ID
Key
Date
Value
x
y
z
0
655
321
2021-01-01
50
546
235
252345
1
675
321
2021-01-01
50
345
345
34545
2
654
356
2021-02-02
70
345
346
543
I am doing it like this:
final = df.groupby(['ID','Date','Key'])['Value'].first().mean(level=[0,1]).reset_index().rename(columns={'Value':'Value_Mean'})
I use .first() because one Key can occur multiple times in the df but they all have the same Value. I want to aggregate on ID and Date so I am using level=[0,1].
and then I am adding next metrics with pandas merge as:
final = final.merge(df.groupby(['ID','Date','Key'])['Value'].first().max(level=[0,1]).reset_index().rename(columns={'Value':'Value_Max'}), on=['ID','Date'])
And I go like that with other metrics. I wonder if there is a more sophisticated way to do it than repeat it in multiple lines. I know that you can use .agg() and pass a dict with functions but it seems like in that way it isn't possible to specify the level which is important here.
Use DataFrame.drop_duplicates with named aggregation:
df = pd.DataFrame({'ID':[655,655,655,675,654], 'Key':[321,321,333,321,356],
'Date':['2021-01-01','2021-01-01','2021-01-01','2021-01-01','2021-02-02'],
'Value':[50,30,10,50,70]})
print (df)
ID Key Date Value
0 655 321 2021-01-01 50
1 655 321 2021-01-01 30
2 655 333 2021-01-01 10
3 675 321 2021-01-01 50
4 654 356 2021-02-02 70
final = (df.drop_duplicates(['ID','Date','Key'])
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
final = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
df = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False)['Value']
.agg(['mean', 'max'])
.add_prefix('Value_')
.reset_index())
print (df)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50

How to merge multiple sheets and rename column names with the names of the sheet names?

I have the following data. It is all in one excel file.
Sheet name: may2019
Productivity Count
Date : 01-Apr-2020 00:00 to 30-Apr-2020 23:59
Date Type: Finalized Date Modality: All
Name MR DX CT US MG BMD TOTAL
Svetlana 29 275 101 126 5 5 541
Kate 32 652 67 171 1 0 923
Andrew 0 452 0 259 1 0 712
Tom 50 461 61 104 4 0 680
Maya 0 353 0 406 0 0 759
Ben 0 1009 0 143 0 0 1152
Justin 0 2 9 0 1 9 21
Total 111 3204 238 1209 12 14 4788
Sheet Name: June 2020
Productivity Count
Date : 01-Jun-2019 00:00 to 30-Jun-2019 23:59
Date Type: Finalized Date Modality: All
NAme US DX CT MR MG BMD TOTAL
Svetlana 4 0 17 6 0 4 31
Kate 158 526 64 48 1 0 797
Andrew 154 230 0 0 0 0 384
Tom 1 0 19 20 2 8 50
Maya 260 467 0 0 1 1 729
Ben 169 530 59 40 3 0 801
Justin 125 164 0 0 4 0 293
Alvin 0 1 0 0 0 0 1
Total 871 1918 159 114 11 13 3086
I want to merge all the sheets into on sheet, drop the first 3 rows of all the sheets and and this is the output I am looking for
Sl.No Name US_jun2019 DX_jun2019 CT_jun2019 MR_jun2019 MG_jun2019 BMD_jun2019 TOTAL_jun2019 MR_may2019 DX_may2019 CT_may2019 US_may2019 MG_may2019 BMD_may2019 TOTAL_may2019
1 Svetlana 4 0 17 6 0 4 31 29 275 101 126 5 5 541
2 Kate 158 526 64 48 1 0 797 32 652 67 171 1 0 923
3 Andrew 154 230 0 0 0 0 384 0 353 0 406 0 0 759
4 Tom 1 0 19 20 2 8 50 0 2 9 0 1 9 21
5 Maya 260 467 0 0 1 1 729 0 1009 0 143 0 0 1152
6 Ben 169 530 59 40 3 0 801 50 461 61 104 4 0 680
7 Justin 125 164 0 0 4 0 293 0 452 0 259 1 0 712
8 Alvin 0 1 0 0 0 0 1 #N/A #N/A #N/A #N/A #N/A #N/A #N/A
I tried the following code but the output is not the one i am looking for.
df=pd.concat(df,sort=False)
df= df.drop(df.index[[0,1]])
df=df.rename(columns=df.iloc[0])
df= df.drop(df.index[[0]])
df=df.drop(['Sl.No'], axis = 1)
print(df)
First, read both Excel sheets.
>>> df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
>>> df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
Drop the first three rows.
>>> df1.drop(index=range(3), inplace=True)
>>> df2.drop(index=range(3), inplace=True)
Rename columns to the first row, and drop the first row
>>> df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
>>> df1.drop(index=[0], inplace=True)
>>> df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
>>> df2.drop(index=[0], inplace=True)
Add suffixes to the columns.
>>> df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
>>> df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
Remove the duplicate name column in the second DF.
>>> df2.drop(columns=['Name'], inplace=True)
Concatenate both the dataframes
>>> df = pd.concat([df1, df2], axis=1, inplace=True)
All the code in one place:
import pandas as pd
df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
df1.drop(index=range(3), inplace=True)
df2.drop(index=range(3), inplace=True)
df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
df1.drop(index=[0], inplace=True)
df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
df2.drop(index=[0], inplace=True)
df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
df2.drop(columns=['Name'], inplace=True)
df = pd.concat([df2, df1], axis=1, inplace=True)
print(df)

Generate column of unique ID in pandas

I have a dataframe with three columns, bins_x, bins_y and z. I wish to add a new column unique that is an "index" of sorts for that unique combination of bins_x and bins_y. Below is an example of what I would like to append.
Note that I ordered the dataframe for clarity, but order does not matter in this context.
import numpy as np
import pandas as pd
np.random.seed(12)
n = 1000
height = 20
width = 20
bins_x = np.random.randint(1, width, size=n)
bins_y = np.random.randint(1, height, size=n)
z = np.random.randint(1, 500, size=n)
df = pd.DataFrame({'bins_x': bins_x, 'bins_y': bins_y, 'z': z})
print(df.sort_values(['bins_x', 'bins_y'])
bins_x bins_y z unique
23 0 0 462 0
531 0 0 199 1
665 0 0 176 2
363 0 1 219 0
468 0 1 450 1
593 0 1 385 2
609 0 1 74 3
663 0 1 46 4
14 0 2 242 0
208 0 2 381 1
600 0 2 445 2
865 0 2 221 3
400 0 3 178 0
75 0 4 281 0
140 0 4 205 1
282 0 4 47 2
838 0 4 212 3
Use groupby and cumcount:
df['unique'] = df.groupby(['bins_x','bins_y']).cumcount()
>>> df.sort_values(['bins_x', 'bins_y']).head(10)
bins_x bins_y z unique
207 1 1 4 0
259 1 1 313 1
327 1 1 300 2
341 1 1 64 3
440 1 1 398 4
573 1 1 96 5
174 1 2 219 0
563 1 2 398 1
796 1 2 417 2
809 1 2 167 3

How to do Stock Trading Back Test Using Pandas and Basic Iteration?

I want to estimate a trading strategy, given the amount I invest in a particular stock. Basically when I see "K-Class" is 1, I buy, when I see "K-Class" is 0, I sell. To make that simple engough, we can ignore the open, high, low value. just use the close price to estimate.
We do want to iterate the whole Series, following 1=buy 0=sell, no matter it is right or wrong.
I got a pandas DataFrame with a Series called "K-Class", a boolean, just say 1(buy) and 0(sell)
From the first day the 'K-class' appears 1, I buy, if the second day is 0, I sell 'immediately' at the close price
How can I write a for loop to test the afterall invest money and invest time?(using pandas and python technics)
Pleas feel free to add more variables
I got a
invest_amount = 10000
stock_owned = 10000/ p1 #the first day appears 1, return the close price
invest_time = 0
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1

exclude row for rolling mean calculation in pandas

I am looking for Pandas way to solve this, I have a DataFrame as
df
A RM
0 384 NaN
1 376 380.0
2 399 387.5
3 333 366.0
4 393 363.0
5 323 358.0
6 510 416.5
7 426 468.0
8 352 389.0
I want to see if value in df['A'] > [Previous] RM value then new column Status should have 0 updated else
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I suppose i need to use Shift with numpy where, but I am not getting as desired.
import pandas as pd
import numpy as np
df=pd.DataFrame([384,376,399,333,393,323,510,426,352], columns=['A'])
df['RM']=df['A'].rolling(window=2,center=False).mean()
df['Status'] = np.where((df.A > df.RM.shift(1).rolling(window=2,center=False).mean()) , 0, 1)
Finally, applying rolling mean
df.AverageMean=df[df['Status'] == 1]['A'].rolling(window=2,center=False).mean()
Just simple shift
df['Status']=(df.A<=df.RM.fillna(9999).shift()).astype(int)
df
Out[347]:
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
i assume when you compare with na it always be 1
df['Status'] = (df.A < df.RM.fillna(df.A.max()+1).shift(1)).astype(int)
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

Categories