I am trying to sequence an already sequenced dataset based on conditions. The dataframe looks like the following (the intended output is my desired output):
id
activity
duration_sec
cycle
intended_cycle
1
1
start
0.7
1
1
2
1
a
0.3
1
1
3
1
b
0.4
1
1
4
1
c
0.5
1
1
5
1
c
0.5
1
2
6
1
d
0.4
1
2
7
1
e
0.6
1
3
8
1
stop
2
1
3
9
1
start
0.1
2
4
10
1
b
0.3
2
4
11
1
stop
0.2
2
4
12
1
f
0.3
3
5
13
2
stop
40
4
6
14
2
start
3
5
7
15
2
a
0.7
5
7
16
2
a
3
5
7
17
2
b
0.2
5
7
18
2
stop
0.2
5
7
19
2
start
0.1
6
8
20
2
f
0.4
6
8
21
2
g
0.2
6
8
22
2
h
0.5
6
8
23
2
h
6
6
8
24
2
stop
9
6
8
25
2
start
0.2
7
9
26
2
e
0.3
7
9
27
2
f
0.4
7
10
28
2
stop
0.7
7
10
The alphabets are representative of activity names. I'm hoping to sequence as per the intended cycle column. This is based on the condition that within the current sequence:
the first activity duration>0.5s after the "start", would be considered as 1 sequence (or a sub-sequence, if you will).
the duplicate activities occurring consecutively will be considered as one if either value is >0.5s
the next activity that is >0.5s, after the first>0.5s would be considered as the next sequence
if there is no more activity >0.5s after, the sequence will be considered till the cycle column value changes
if there is no activity within the current cycle that is >0.5s, the activities will be considered independent (i.e. row 25-28)
I'd really appreciate any help. Big thanks to there being a community for this!
sample code for df:
data = {'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28},
'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2},
'activity': {0: 'start', 1: 'a', 2: 'b', 3: 'c', 4: 'c', 5: 'd', 6: 'e', 7: 'stop', 8: 'start', 9: 'b', 10: 'stop', 11: 'f', 12: 'stop', 13: 'start', 14: 'a', 15: 'a', 16: 'b', 17: 'stop', 18: 'start', 19: 'f', 20: 'g', 21: 'h', 22: 'h', 23: 'stop', 24: 'start', 25: 'e', 26: 'f', 27: 'stop'},
'duration_sec': {0: 0.7, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.5, 5: 0.4, 6: 0.6, 7: 2.0, 8: 0.1, 9: 0.3, 10: 0.2, 11: 0.3, 12: 40.0, 13: 3.0, 14: 0.7, 15: 3.0, 16: 0.2, 17: 0.2, 18: 0.1, 19: 0.4, 20: 0.2, 21: 0.5, 22: 6.0, 23: 9.0, 24: 0.2, 25: 0.3, 26: 0.4, 27: 0.7}, 'cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 3, 12: 4, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5, 18: 6, 19: 6, 20: 6, 21: 6, 22: 6, 23: 6, 24: 7, 25: 7, 26: 7, 27: 7}, 'intended_cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4, 10: 4, 11: 5, 12: 6, 13: 7, 14: 7, 15: 7, 16: 7, 17: 7, 18: 8, 19: 8, 20: 8, 21: 8, 22: 8, 23: 8, 24: 9, 25: 9, 26: 10, 27: 10}}
df = pd.DataFrame.from_dict(data)
Related
Here is the dataframe that I'm working with in python.
{'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32}, 'car': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
Here is the code that I'm using. The subplot part I got off a datacamp module.
fig, ax = plt.subplot()
plt.show()
But when I go to plot the mtcars dataset, one variable against the other, I get a blank canvas. Why is that? I don't see how the code is different than what I am looking at on DataCamp.
ax.plot(mtcars['cyl'], mtcars['mpg'])
plt.show()
The answer from below is helpful and gets me closer to a solution but it is giving me lines instead of a scatterplot?
fig, ax = plt.subplot()
plt.show()
import matplotlib.pyplot as plt
plt.plot(df['cyl'], df['mpg'])
plt.show()
or:
ax = plt.subplot(2, 1, 1)
ax.plot(df['cyl'], df['mpg'])
plt.show()
For pandas.merge(df1, df2, on='Col_4') will operate by inner join by default which will take rows on the shared columns that have the exact values in these shared columns.
Question: Let us say we have a 4 rows in first df1 and 3 rows in df2. So if all values in the shared columns are the same, then the first row will be added 4 times since we have, so we will have 10 rows for each row from dataframe1. In total, we will have 12 rows.
Problem: Is there a way to stop once we find a first match between the first and the second dataframe and move to the second row in the first dataframe please? However, we can not add the same match added to row 1 in df1 twice. So, suppose row 1 of df1 got matched to row 1 in df2 based on same value in shared column col_4, then the second row in df1 must be matched with second row of df2.
Code:
import pandas as pd
df1 = pd.DataFrame(
{
'ID':[1,2,3,5,9],
'col_1': [1,2,3,4,5],
'col_2':[6,7,8,9,10],
'col_3':[11,12,13,14,15],
'col_4':['apple', 'apple', 'apple', 'apple', 'apple']
}
)
df2 = pd.DataFrame(
{
'ID':[1,1,3,5],
'col_1': [8,9,10,11],
'col_2':[12,13,15,17],
'col_3':[12,13,14,15],
'col_4':['apple', 'apple', 'apple', 'apple']
}
)
pd.merge(df1, df2, on='col_4')
So, as below, how to stop please at first match as red rectangle show where we stop once we find a match from df1 to df2 based on shared column col_4 please? Output should be based on below figure please:
Results in dictionary format:
{'ID_x': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 3,
11: 3,
12: 5,
13: 5,
14: 5,
15: 5,
16: 9,
17: 9,
18: 9,
19: 9},
'col_1_x': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 3,
11: 3,
12: 4,
13: 4,
14: 4,
15: 4,
16: 5,
17: 5,
18: 5,
19: 5},
'col_2_x': {0: 6,
1: 6,
2: 6,
3: 6,
4: 7,
5: 7,
6: 7,
7: 7,
8: 8,
9: 8,
10: 8,
11: 8,
12: 9,
13: 9,
14: 9,
15: 9,
16: 10,
17: 10,
18: 10,
19: 10},
'col_3_x': {0: 11,
1: 11,
2: 11,
3: 11,
4: 12,
5: 12,
6: 12,
7: 12,
8: 13,
9: 13,
10: 13,
11: 13,
12: 14,
13: 14,
14: 14,
15: 14,
16: 15,
17: 15,
18: 15,
19: 15},
'col_4': {0: 'apple',
1: 'apple',
2: 'apple',
3: 'apple',
4: 'apple',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple',
12: 'apple',
13: 'apple',
14: 'apple',
15: 'apple',
16: 'apple',
17: 'apple',
18: 'apple',
19: 'apple'},
'ID_y': {0: 1,
1: 1,
2: 3,
3: 5,
4: 1,
5: 1,
6: 3,
7: 5,
8: 1,
9: 1,
10: 3,
11: 5,
12: 1,
13: 1,
14: 3,
15: 5,
16: 1,
17: 1,
18: 3,
19: 5},
'col_1_y': {0: 8,
1: 9,
2: 10,
3: 11,
4: 8,
5: 9,
6: 10,
7: 11,
8: 8,
9: 9,
10: 10,
11: 11,
12: 8,
13: 9,
14: 10,
15: 11,
16: 8,
17: 9,
18: 10,
19: 11},
'col_2_y': {0: 12,
1: 13,
2: 15,
3: 17,
4: 12,
5: 13,
6: 15,
7: 17,
8: 12,
9: 13,
10: 15,
11: 17,
12: 12,
13: 13,
14: 15,
15: 17,
16: 12,
17: 13,
18: 15,
19: 17},
'col_3_y': {0: 12,
1: 13,
2: 14,
3: 15,
4: 12,
5: 13,
6: 14,
7: 15,
8: 12,
9: 13,
10: 14,
11: 15,
12: 12,
13: 13,
14: 14,
15: 15,
16: 12,
17: 13,
18: 14,
19: 15}}
Edit: if first row from df1 got matched with first row from df2, then the second row from df1 cannot be matched again with first row of df2 but it should be matched with the second row from df2 if there is a match.
You can add a serial number serial for each group of same value of col_4 in each of df1 and df2. Then, merge by col_4 and this serial number serial, as follows:
We generate the serial number by .groupby() + cumcount():
df1['serial'] = df1.groupby('col_4').cumcount()
df2['serial'] = df2.groupby('col_4').cumcount()
df1.merge(df2, on=['col_4', 'serial'])
Result:
ID_x col_1_x col_2_x col_3_x col_4 serial ID_y col_1_y col_2_y col_3_y
0 1 1 6 11 apple 0 1 8 12 12
1 2 2 7 12 apple 1 1 9 13 13
2 3 3 8 13 apple 2 3 10 15 14
3 5 4 9 14 apple 3 5 11 17 15
Optionally, you can further remove this serial number column serial, as follows:
df1.merge(df2, on=['col_4', 'serial']).drop('serial', axis=1)
Result:
ID_x col_1_x col_2_x col_3_x col_4 ID_y col_1_y col_2_y col_3_y
0 1 1 6 11 apple 1 8 12 12
1 2 2 7 12 apple 1 9 13 13
2 3 3 8 13 apple 3 10 15 14
3 5 4 9 14 apple 5 11 17 15
Edit
You can also simplify the codes by incorporating the generations of serial numbers into the step of .merge(), as follows: (Thanks for the suggestion by #HenryEcker)
df1.merge(df2,
left_on=['col_4', df1.groupby('col_4').cumcount()],
right_on=['col_4', df2.groupby('col_4').cumcount()]
).drop('key_1', axis=1)
dic= {'distinct_id': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5},
'first_name': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb'},
'tasks_completed': {0: 3, 1: 3, 2: 3, 3: 3, 4: 1},
'tasks_available': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3}}
tasks = pd.DataFrame(dic)
I'm trying to make every id/name pair have a row for every unique activity, for example I want "Joe" to have rows where the activity column is "Run" and "Climb", but I want him to have a 0 in the tasks_completed column (those rows not being present already means that he hasn't done these activity tasks). I have tried using df.iterrows() and making a list of the unique ids and activity names and checking to see if they're both present, but it didn't work. Any help is very appreciated!
This is what I am hoping to have:
1: 2,
2: 3,
3: 4,
4: 5,
5: 1,
6: 1,
7: 2,
8: 2,
9: 3,
10: 3,
11: 4,
12: 4,
13: 5,
14: 5},
'email': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony',
5: 'Joe',
6: 'Joe',
7: 'Barry',
8: 'Barry',
9: 'David',
10: 'David',
11: 'Marcus',
12: 'Marcus',
13: 'Anthony',
14: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb',
5: 'Run',
6: 'Climb',
7: 'Run',
8: 'Climb',
9: 'Jump',
10: 'Climb',
11: 'Climb',
12: 'Jump',
13: 'Run',
14: 'Jump'},
'tasks_completed': {0: 3,
1: 3,
2: 3,
3: 3,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0},
'tasks_available': {0: 3,
1: 3,
2: 3,
3: 3,
4: 3,
5: 3,
6: 3,
7: 3,
8: 3,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 3}}
pd.DataFrame(tasks_new)
idx_cols = ['distinct_id', 'first_name', 'activity']
tasks.set_index(idx_cols).unstack(fill_value=0).stack().reset_index()
distinct_id first_name activity tasks_completed tasks_available
0 1 Joe Climb 0 0
1 1 Joe Jump 3 3
2 1 Joe Run 0 0
3 2 Barry Climb 0 0
4 2 Barry Jump 3 3
5 2 Barry Run 0 0
6 3 David Climb 0 0
7 3 David Jump 0 0
8 3 David Run 3 3
9 4 Marcus Climb 0 0
10 4 Marcus Jump 0 0
11 4 Marcus Run 3 3
12 5 Anthony Climb 1 3
13 5 Anthony Jump 0 0
14 5 Anthony Run 0 0
After reading a CSV into a data frame, I am trying to resample my "Value" column to 5 seconds, starting from the first rounded second of the time value. I would like to have the mean for all the values within the next 5 seconds, starting from 46:19.6 (format %M:%S:%f). So the code would give me the mean for 46:20, then 46:25, and so on...Does anybody know how to do this? Thank you!
input:
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})
Assuming your Time field is in datetime64[ns] format, you simply can use pd.Grouper and pass freq=5S:
# next line of code is optional to transform to datetime format if the `Time` field is an `object` i.e. string.
# df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
#Depending on what you want to do, you can also replace the above line of code with one of two below:
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index().iloc[1:]
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S', base=4.6))['Value'].mean().reset_index()
#In the above line of code 4.6s can be adjusted to whatever number between 0 and 5.
df1
output:
Time Value
0 2020-07-07 00:46:15 0.0
1 2020-07-07 00:46:20 2.5
2 2020-07-07 00:46:25 7.6
3 2020-07-07 00:46:30 12.5
4 2020-07-07 00:46:35 17.0
5 2020-07-07 00:46:40 20.0
Full reproducible code from an example DataFrame I created:
import re
import pandas
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})
df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
df1
I want to compare average revenue "in offer" vs average revenue "out of offer" for each SKU.
When I merge the below two dataframes on sku I get multiple rows for each entry because in second dataframe sku is not unique. For example every instance of sku = 1 will have two entries because test_offer contains 2 separate offers for sku 1. However there can only be one offer live for a SKU at any time, which should verify the condition:
test_ga['day'] >= test_offer['start_day'] & test_ga['day'] <= test_offer['end_day']
dataset 1
test_ga = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8, 16: 1, 17: 2, 18: 3, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 3, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 3},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34, 16: 71, 17: 69, 18: 90, 19: 93, 20: 43, 21: 45, 22: 57, 23: 89}} )
dataset 2
test_offer = pd.DataFrame( {'sku': {0: 1, 1: 1, 2: 2},
'offer_number': {0: 5, 1: 6, 2: 7},
'start_day': {0: 2, 1: 6, 2: 4},
'end_day': {0: 4, 1: 7, 2: 8}} )
Expected Output
expected_output = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2},
'offer': {0: float('nan'), 1: '5', 2: '5', 3: '5', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '7', 12: '7', 13: '7', 14: '7', 15: '7'},
'start_day': {0: float('nan'), 1: '2', 2: '2', 3: '2', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '4', 12: '4', 13: '4', 14: '4', 15: '4'},
'end_day': {0: float('nan'), 1: '4', 2: '4', 3: '4', 4: float('nan'), 5: '7', 6: '7', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '8', 12: '8', 13: '8', 14: '8', 15: '8'},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34}} )
I did actually find a solution based on this SO answer, but it took me a while and the question is not really clear.
I thought it could still be useful to create this question even if I found a solution. Besides, there are probably better ways to achieve this that do not require to create a dummy variables and sorting the dataframe?
If this question is a duplicate let me know and I will cancel it.
One possible solution:
test_data = pd.merge(test_ga, test_offer, on = 'sku')
# I define if every row is in offer or not.
test_data['is_offer'] = np.where((test_data['day'] >= test_data['start_day']) & (test_data['day'] <= test_data['end_day']), 1, 0)
expected_output = test_data.sort_values(['sku','day','is_offer']).groupby(['day', 'sku']).tail(1)
and then clean up the data adding Nan values for rows not in offer.
expected_output['start_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['start_day'])
expected_output['end_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['end_day'])
expected_output['offer_number'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['offer_number'])
expected_output