I have data that is essentially a series of laps where each lap has its own elapsed time, but I am trying to calculate the total elapsed time.
Here's some code that has similar data:
import pandas as pd
import numpy as np
laptime = pd.Series([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5])
lap = pd.Series([1,1,1,1,1,2,2,2,2,2,3,3,3,3,3])
timeblocks = pd.DataFrame({'laptime': laptime, 'lap': lap})
timeblocks['timediff'] = timeblocks.laptime.diff()
timeblocks['elapsed'] =
pd.Series([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
timeblocks
The resulting data looks like:
lap laptime timediff elapsed
0 1 1 NaN 1
1 1 2 1.0 2
2 1 3 1.0 3
3 1 4 1.0 4
4 1 5 1.0 5
5 2 1 -4.0 6
6 2 2 1.0 7
7 2 3 1.0 8
8 2 4 1.0 9
9 2 5 1.0 10
10 3 1 -4.0 11
11 3 2 1.0 12
12 3 3 1.0 13
13 3 4 1.0 14
14 3 5 1.0 15
The elapsed time is what I actually need to calculate. I tried various forms of messing around with the time differentials and cumsum, but am kinda stuck.
Real-world data looks more like the following:
113.81201171875 1
113.86206054688 1
113.912109375 1
113.96215820313 1
0.05126953125 2
0.101318359375 2
0.1513671875 2
In the case of the real world data, the sample rate is about 0.05 sec.
import io, operator, itertools
Assuming the data is in a text file or multiline string:
s = '''113.81201171875 1
113.86206054688 1
113.912109375 1
113.96215820313 1
0.05126953125 2
0.101318359375 2
0.1513671875 2'''
f = io.StringIO(s)
Gather the data into a list; sort the list by lap then time; group the data by lap and extract the largest and smallest time; calculate the elapsed lap time; acumulate.
data = []
for line in f:
time, lap = map(float, line.strip().split())
data.append((time, lap))
lap = operator.itemgetter(1)
time = operator.itemgetter(0)
data.sort(key = operator.itemgetter(1,0))
total = 0
for el, times in itertools.groupby(data, lap):
low, *_, high = map(time, times)
elapsed = high - low
print(f'lap {el}, elapsed time: {elapsed}')
total += elapsed
print(f'total elapsed time: {total}')
>>>
lap 1.0, elapsed time: 0.15014648438000222
lap 2.0, elapsed time: 0.10009765625
total elapsed time: 0.2502441406300022
>>>
Related
I have a dataframe df with two columns, df["Period"] and df["Return"]. df["Period"] has number from 1, 2, 3, ... n and is increasing. I want to calculate new columns using .cumprod of df["Return"] where df["Period"] >= 1, 2, 3 etc. Note that the number of rows for each unique period is different and not systematic.
So I get n new columns
df["M_1]: is cumprod of df["Return"] for rows df["Period"] >= 1
df["M_2]: is cumprod of df["Return"] for rows df["Period"] >= 2
...
Below my example which is working. The implementation has two drawbacks:
it is very slow for large number of unique periods
it does not work well with pandas method chaining
Any hint of how to speed this up and/or to vectorize this is appreciated
import numpy as np
import pandas as pd
# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
"Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)
# Slow implementation
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{month}"] = cumret
df.head()
This is the expected output:
Period
Returns
M_1
M_2
M_3
M_4
0
1
-0.0268917
-0.0268917
nan
nan
nan
1
1
0.018205
-0.00917625
nan
nan
nan
2
2
0.00505662
-0.00416604
0.00505662
nan
nan
3
2
-8.28544e-05
-0.00424855
0.00497334
nan
nan
4
2
0.00127519
-0.00297878
0.00625488
nan
nan
5
3
-0.00224315
-0.00521524
0.0039977
-0.00224315
nan
6
3
-0.0197291
-0.0248414
-0.0158103
-0.021928
nan
7
3
0.00136592
-0.0235094
-0.0144659
-0.020592
nan
8
4
0.00582897
-0.0178175
-0.00872129
-0.0148831
0.00582897
9
4
0.00260425
-0.0152597
-0.00613975
-0.0123176
0.0084484
Here is how your code performs on my machine (Python 3.10.7, Pandas 1.4.3) in average after 10,000 iterations:
import statistics
import time
import numpy as np
import pandas as pd
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{period}"] = cumret
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.00298 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
With some minor modifications, you can get a ~3x speed improvement:
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
for period in df["Period"].unique():
df[f"M_{period}"] = (
1 + df.loc[df["Period"].ge(period), "Returns"]
).cumprod() - 1
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.001052 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
QUESTION: How do I find all rows in a pandas data frame which have the min time difference when compared to another time of an advice?
Example:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 **2** <-- zone 16 is closest to advicehour of A
1 A 1 A 16 **3**
2 A 2 A 18 5
3 A 2 A 18 8
4 B 4 B 19 18
5 B 8 B 20 **12** <-- zone 20 is closest to advicehour of B
Expected output:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 3
1 A 1 A 16 2
5 B 8 B 20 12
It is not possible that setdownnr is before advice, and it should also not be possible that an advice for a different zone has a timestamp before the previous one ended.
First create column bor absolute differencies between columns and then get Zone by minimal difference per groups and select all rows which matched:
df['diff'] = df['Setdownhour'].sub(df['Advicehour']).abs()
s = df.set_index('Zone').groupby('Advicenr', sort=False)['diff'].transform('idxmin')
df = df[(s == s.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour diff
0 A 1 A 16 2 1
1 A 1 A 16 3 2
5 B 8 B 20 12 4
Solution without helper column in output:
s = df['Setdownhour'].sub(df['Advicehour']).abs()
s1 = df.assign(s = s).set_index('Zone').groupby('Advicenr')['s'].transform('idxmin')
df = df[(s1 == s1.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 2
1 A 1 A 16 3
5 B 8 B 20 12
Thanks to advice from Jezrael, ended up doing:
df['diff'] = df['Setdownhour'].sub(inner_join_tote_nr['Advicehour']).abs()
df['avg_diff'] = df.groupby(['Setdownnr', 'Advicehour', 'Zone'])['diff'].transform('min')
s = df.groupby(['Advicenr', 'Advicehour'], sort=False)['avg_diff'].min().reset_index()
selected = pd.merge(s, inner_join_tote_nr, left_on=['Advicenr','Advicehour', 'avg_diff'], right_on = ['Advicenr','Advicehour', 'avg_diff'])
I am having trouble creating a couple new calculated columns to my Dataframe. Here is what I'm looking for:
Original DF:
Col_IN Col_OUT
5 2
1 2
2 2
3 0
3 1
What I want to add is two columns. One is 'running end of day total' that takes in the net of the current day plus total of day before. Second column I want 'Available Units' - which factors in the previous day end total plus incoming units. Desired result:
Desired DF:
Col_IN Available_Units Col_OUT End_Total
5 5 2 3
1 4 2 2
2 4 2 2
3 5 0 5
3 8 1 7
It's a weird one - anybody have an idea? Thanks.
For the End_Total you can use np.cumsum and for Available Units you can use shift.
df = pd.DataFrame({
'Col_IN': [5,1,2,3,3],
'Col_OUT': [2,2,2,0,1]
})
df['End_Total'] = np.cumsum(df['Col_IN'] - df['Col_OUT'])
df['Available_Units'] = df['End_Total'].shift().fillna(0) + df['Col_IN']
print(df)
will output
Col_IN Col_OUT End_Total Available_Units
0 5 2 3 5.0
1 1 2 2 4.0
2 2 2 2 4.0
3 3 0 5 5.0
4 3 1 7 8.0
Running totals are also known as cumulative sums, for which pandas has the cumsum() function.
The end totals can be calculated through the cumulative sum of incoming minus the cumulative sum of outgoing:
df["End_Total"] = df["Col_IN"].cumsum() - df["Col_OUT"].cumsum()
The available units can be calculated in the same way, if you shift the outgoing column one down:
df["Available_Units"] = df["Col_IN"].cumsum() - df["Col_OUT"].shift(1).fillna(0).cumsum()
I have the following df
id xx yy time
0 1 553343.041098 4.178420e+06 1
1 1 553343.069815 4.178415e+06 2
2 1 553343.069815 4.178415e+06 3
3 2 553343.950755 4.178415e+06 1
4 2 553341.343829 4.178410e+06 6
The xx and yy is the position of each id at a certain point in time.
I would like to create an extra column in this df, which will be the difference in distance from one point of time to another (going from the smallest value of time to the next bigger, to the next bigger one etc), within the id group.
Is there a Pythonic way of doing so ?
You can do like below.
I didn't do df['distance_meters'] because it is straight forward.
df['xx_diff']=df.groupby('id')['xx'].diff()**2
df['yy_diff']=df.groupby('id')['yy'].diff()**2
If you don't need ['xx_diff'] & ['yy_diff'] columns in your dataframe, you can directly use the code below.
df['distance']= np.sqrt(df.groupby('id')['xx'].diff()**2+df.groupby('id')['yy'].diff()**2)
Output
id xx yy time xx_diff3 yy_diff3 distance
0 1 553343.041098 4178420.0 1 NaN NaN NaN
1 1 553343.069815 4178415.0 2 0.000825 25.0 5.000082
2 1 553343.069815 4178415.0 3 0.000000 0.0 0.000000
3 2 553343.950755 4178415.0 1 NaN NaN NaN
4 2 553341.343829 4178410.0 6 6.796063 25.0 5.638800
I dont know if there is a more efficient way to do this, but here is a solution:
import numpy as np
df['xx_diff'] = df.groupby('id')['xx'].rolling(window=2).apply(lambda x: (x[1] - x[0])**2).reset_index(drop=True)
df['yy_diff'] = df.groupby('id')['yy'].rolling(window=2).apply(lambda x: (x[1] - x[0])**2).reset_index(drop=True)
df['distance_meters'] = np.sqrt(df['xx_diff'] + df['yy_diff'])
A more pythonic answer will be accepted :)
Try This:
import pandas as pd
import math
def calc_distance(values):
values.sort_values('id', inplace = True)
values['distance_diff'] = 0
values.reset_index(drop=True, inplace=True)
for i in range(values.shape[0]-1):
p1 = list(values.loc[i, ['xx', 'yy']])
p2 = list(values.loc[i+1, ['xx', 'yy']])
values.loc[i,'distance_diff'] = math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2))
return values
lt = []
lt.append(df.groupby(['id']).apply(calc_distance))
print(pd.concat(lt, ignore_index=True))
Output:
id xx yy time distance_diff
0 1 553343.041098 4178420.0 1 5.000082
1 1 553343.069815 4178415.0 2 0.000000
2 1 553343.069815 4178415.0 3 0.000000
3 2 553343.950755 4178415.0 1 5.638800
4 2 553341.343829 4178410.0 6 0.000000
I have a dataframe as follow:
d = {'item': [1, 2,3,4,5,6], 'time': [1297468800, 1297468809, 12974688010, 1297468890, 1297468820,1297468805]}
df = pd.DataFrame(data=d)
the output of df is as follow:
item time
0 1 1297468800
1 2 1297468809
2 3 1297468801
3 4 1297468890
4 5 1297468820
5 6 1297468805
the time here is based on the unixsystem time. My goal is to replace the time column in the dataframe.
such as the
mintime = 1297468800
maxtime = 1297468890
And I want to split the time into 10 (can be changed by using parameter like 20 intervals) interval, and recode the time column in df. Such as
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
what is the most efficient way to do this since I have billion of records? Thanks
You can use pd.cut with np.linspace to specify the bins. This encodes your column categorically, from which you can then extract the codes in order:
bins = np.linspace(df.time.min() - 1, df.time.max(), 10)
df['time'] = pd.cut(df.time, bins=bins, right=True).cat.codes + 1
df
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
Alternatively, depending on how you treat the interval edges, you could also do
bins = np.linspace(df.time.min(), df.time.max() + 1, 10)
pd.cut(df.time, bins=bins, right=False).cat.codes + 1
0 1
1 1
2 1
3 9
4 2
5 1
dtype: int8