I have some trips, and for each trip contains different steps, the data frame looks like following:
tripId duration (s) distance (m) speed Km/h
1819714 NaN NaN NaN
1819714 6.0 8.511452 5.106871
1819714 10.0 6.908963 2.487227
1819714 5.0 15.960625 11.491650
1819714 6.0 26.481649 15.888989
... ... ... ... ...
1865507 6.0 16.280313 9.768188
1865507 5.0 17.347482 12.490187
1865507 5.0 14.266625 10.271970
1865507 6.0 22.884008 13.730405
1865507 5.0 21.565655 15.527271
I want to know if, on a trip X, the cyclist has braked (speed has decreased by at least 30%).
The problem is that the duration between every two steps is each time different.
For example, in 6 seconds, the speed of a person X has decreased from 28 km/h to 15 km/h.. here we can say, he has braked, but if the duration was high, we will not be able to say that
My question is if there is a way to apply something to know if there is a braking process, for all data frame in a way that makes sense
The measure of braking is the "change in speed" relative to "change in time". From your data, I created a column 'acceleration', which is change in speed (Km/h) divided by duration (seconds). Then the final column to detect braking if the value is less than -1 (Km/h/s).
Note that you need to determine if a reduction of 1km/h per second is good enough to be considered as braking.
df['speedChange'] = df['speedKm/h'].diff()
df['acceleration'] = df['speedChange'] / df['duration(s)']
df['braking'] = df['acceleration'].apply(lambda x: 'yes' if x<-1 else 'no')
print(df)
Output:
tripId duration(s) distance(m) speedKm/h speedChange acceleration braking
0 1819714.0 6.0 8.511452 5.106871 NaN NaN no
1 1819714.0 10.0 6.908963 2.487227 -2.619644 -0.261964 no
2 1819714.0 5.0 15.960625 11.491650 9.004423 1.800885 no
3 1819714.0 6.0 26.481649 15.888989 4.397339 0.732890 no
4 1865507.0 6.0 16.280313 9.768188 -6.120801 -1.020134 yes
5 1865507.0 5.0 17.347482 12.490187 2.721999 0.544400 no
6 1865507.0 5.0 14.266625 10.271970 -2.218217 -0.443643 no
7 1865507.0 6.0 22.884008 13.730405 3.458435 0.576406 no
I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN
I have the following dataframe:
COD CHM DATE
0 5713 0.0 2020-07-16
1 5713 1.0 2020-08-11
2 5713 2.0 2020-06-20
3 5713 3.0 2020-06-19
4 5713 4.0 2020-06-01
... ... ... ...
2135283 73306036 0.0 2020-09-30
2135284 73306055 12.0 2020-09-30
2135285 73306479 9.0 2020-09-30
2135286 73306656 3.0 2020-09-30
2135287 73306676 1.0 2020-09-30
I want to calculate the mean and the standard deviation for each COD throughout the dates (time).
For this, I am doing:
traf_user_chm_med =traf_user_chm_med.groupby(['COD', 'DATE'])['CHM'].sum().reset_index()
dates = pd.date_range(start=traf_user_chm_med.DATE.min(), end=traf_user_chm_med.DATE.max(), freq='MS', closed='left').sort_values(ascending=False)
clients = traf_user_chm_med['COD'].unique()
idx = pd.MultiIndex.from_product((clients, dates), names=['COD', 'DATE'])
M0 = pd.to_datetime('2020-08')
M1 = M0-pd.DateOffset(month=M0.month-1)
M2 = M0-pd.DateOffset(month=M0.month-2)
M3 = M0-pd.DateOffset(month=M0.month-3)
M4 = M0-pd.DateOffset(month=M0.month-4)
M5 = M0-pd.DateOffset(month=M0.month-5)
def filter_dates(grp):
grp.set_index('YEAR_MONTH', inplace=True)
grp=grp[M0:M5].reset_index()
return grp
traf_user_chm_med = traf_user_chm_med.groupby('COD').apply(filter_dates)
Not sure why it doesn't work, it returns an empty dataframe.
After this I would unstack to get the activity in the several months and calculate the mean and standard deviation for each COD.
This is a long proccess, not sure if there is a faster way to do it that gets me the values I want.
Still, if anyone can help me get this one working would be aweosome!
df['mean'] = df.groupby('DATE')['COD'].transform('mean')
If I understand correctly, you're simply requiring this:
df.groupby("COD")["CHM"].agg("std")
As a general principle, there's almost always a "pythonic" way to do these things that's fewer lines and easy to understand!
You can use transform to broadcast your mean and std
...
df['mean'] = df.groupby('DATE')['COD'].transform('mean')
df['std'] = df.groupby('DATE')['COD'].transform('std')
I have the following data frame in pandas:
I want to add the Avg Price column in the original data frame, after grouping by (Date,Issuer) and then taking the dot product of weights and price, so that it is something like:
Is there a way to do it without using merge or join ? What would be the simplest way to do it?
One way using pandas.DataFrame.prod:
df["Avg Price"] = df[["Weights", "Price"]].prod(1)
df["Avg Price"] = df.groupby(["Date", "Issuer"])["Avg Price"].transform("sum")
print(df)
Output:
Date Issuer Weights Price Avg Price
0 2019-11-12 A 0.4 100 120.0
1 2019-15-12 B 0.5 100 100.0
2 2019-11-12 A 0.2 200 120.0
3 2019-15-12 B 0.3 100 100.0
4 2019-11-12 A 0.4 100 120.0
5 2019-15-12 B 0.2 100 100.0
I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)