Sum rows where index steps is not bigger than 1. Pandas Python - python

I have foot sensor data and I want to calculate the Std of the swing times.
The dataframe looks like this:
Time Force
83 0.83 80
84 0.84 60
85 0.85 40
86 0.86 20
87 0.87 0
88 0.88 0
89 0.89 20
90 0.90 40
91 0.91 60
92 0.92 40
93 0.93 0
94 0.94 0
95 0.95 0
96 0.96 20
So to get the times for when the force ==0, I did:
df[(df['Force']==0)]
Resulting in:
Time Force
87 0.87 0
88 0.88 0
93 0.93 0
94 0.94 0
95 0.95 0
Now I want to sum the Time per swing.
swing 1 = index 87 + 88, swing 2 = index 93 + 94 + 95
How can I achieve this? How can I sum the rows where the index steps is not bigger than 1?
(Imagine I have thousands of rows to sum)
I tried complicated loops like:
swing_durations = []
start = []
start.append(0)
swings_left = swing_times_left.reset_index(drop = True)
for subject in swings_left[['filename']]:
i = 1
for time in swings_left['Time'][1:-1]:
j = i - 1
k = swings_left.where(swings_left['Time'].loc[i] - swings_left['Time'].loc[j] > 0.01)
if k == True:
start.append(time)
swing_durations.append(swings_left[['Time']].loc[j] - start[j])
i = i + 1
totalswingtime_l['filename'== subject]['Variance'] = swing_durations.std()
resulting in an error
Thanks for the help!

A solution is to create an ID for each group of consecutive 0s.
This is what (df.Force.shift()!=(df.Force)).cumsum() does.
Afterwards you only keep the groups containing 0s with np.where.
In [83]: df["swing_id"] = np.where(df.Force==0, (df.Force.shift()!=(df.Force)).cumsum(),np.nan)
...: df
Out[83]:
Time Force swing_id
0 0.83 80 NaN
1 0.84 60 NaN
2 0.85 40 NaN
3 0.86 20 NaN
4 0.87 0 5.0
5 0.88 0 5.0
6 0.89 20 NaN
7 0.90 40 NaN
8 0.91 60 NaN
9 0.92 40 NaN
10 0.93 0 10.0
11 0.94 0 10.0
12 0.95 0 10.0
13 0.96 20 NaN
In [84]: df.groupby("swing_id")["Time"].sum()
Out[84]:
swing_id
5.0 1.75
10.0 2.82
Name: Time, dtype: float64

Related

How to Change the Structure of Pandas Dataframe in Python?

I have a current Pandas DataFrame in the format below (see Current DataFrame) but I want to change the structure of it to look like the Desired DataFrame below. The top row of titles is longitudes and the first column of titles is latitudes.
Current DataFrame:
E0 E1 E2 E3 E4
LAT
89 0.01 0.01 0.02 0.01 0.00
88 0.01 0.00 0.00 0.01 0.00
87 0.00 0.02 0.01 0.02 0.01
86 0.02 0.00 0.03 0.02 0.00
85 0.00 0.00 0.00 0.01 0.03
Code to build it:
df = pd.DataFrame({
'LAT': [89, 88, 87, 86, 85],
'E0': [0.01, 0.01, 0.0, 0.02, 0.0],
'E1': [0.01, 0.0, 0.02, 0.0, 0.0],
'E2': [0.02, 0.0, 0.01, 0.03, 0.0],
'E3': [0.01, 0.01, 0.02, 0.02, 0.01],
'E4': [0.0, 0.0, 0.01, 0.0, 0.03]
}).set_index('LAT')
Desired DataFrame:
LAT LON R
89 0 0.01
89 1 0.01
89 2 0.02
89 3 0.01
89 4 0.00
88 0 0.01
88 1 0.00
88 2 0.00
88 3 0.01
88 4 0.00
87 0 0.00
87 1 0.02
87 2 0.01
87 3 0.02
87 4 0.01
86 0 0.02
86 1 0.00
86 2 0.03
86 3 0.02
86 4 0.00
85 0 0.00
85 1 0.00
85 2 0.00
85 3 0.01
85 4 0.03
Try with stack + str.extract:
new_df = (
df.stack()
.reset_index(name='R')
.rename(columns={'level_1': 'LON'})
)
new_df['LON'] = new_df['LON'].str.extract(r'(\d+$)').astype(int)
Or with pd.wide_to_long + reindex:
new_df = df.reset_index()
new_df = (
pd.wide_to_long(new_df, stubnames='E', i='LAT', j='LON')
.reindex(new_df['LAT'], level=0)
.rename(columns={'E': 'R'})
.reset_index()
)
new_df:
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
You could solve it with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = None,
names_to = 'LON',
values_to = "R",
names_pattern = r".(.)",
sort_by_appearance = True,
ignore_index = False).reset_index()
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
Here we are only interested in the numbers that are at the end of the columns - we get this by passing a regular expression to names_pattern.
You can avoid pyjanitor altogether by using melt and rename:
(df.rename(columns=lambda col: col[-1])
.melt(var_name='LON', value_name='R', ignore_index=False)
)
LON R
LAT
89 0 0.01
88 0 0.01
87 0 0.00
86 0 0.02
85 0 0.00
89 1 0.01
88 1 0.00
87 1 0.02
86 1 0.00
85 1 0.00
89 2 0.02
88 2 0.00
87 2 0.01
86 2 0.03
85 2 0.00
89 3 0.01
88 3 0.01
87 3 0.02
86 3 0.02
85 3 0.01
89 4 0.00
88 4 0.00
87 4 0.01
86 4 0.00
85 4 0.03
Another approach, does this work:
pd.wide_to_long(df.reset_index(), ['E'], i = 'LAT', j = 'LON').reset_index().sort_values(by = ['LAT','LON'])
LAT LON E
4 85 0 0.00
9 85 1 0.00
14 85 2 0.00
19 85 3 0.01
24 85 4 0.03
3 86 0 0.02
8 86 1 0.00
13 86 2 0.03
18 86 3 0.02
23 86 4 0.00
2 87 0 0.00
7 87 1 0.02
12 87 2 0.01
17 87 3 0.02
22 87 4 0.01
1 88 0 0.01
6 88 1 0.00
11 88 2 0.00
16 88 3 0.01
21 88 4 0.00
0 89 0 0.01
5 89 1 0.01
10 89 2 0.02
15 89 3 0.01
20 89 4 0.00
Quick and dirty.
Pad your LAT with you LON in a list of tuple pairs.
[
(89.0, 0.01),
(89.1, 0.01),
(89.2, 0.02)
]
Im sure someone can break down a way to organize it like you want... but from what I know you need a unique ID data point for most data in a query structure.
OR:
If you aren't putting this back into a db, then maybe you can use a dict something like this:
{ '89' : { '0' : 0.01,
'1' : 0.01,
'2' : 0.02 .....
}
}
You can then get the data with
dpoint = data['89']['0']
assert dpoint == 0.01
\\ True
dpoint = data['89']['2']
assert dpoint == 0.02
\\ True

Applying cumulative correction factor across dataframe

I'm fairly new to Pandas so please forgive me if the answer to my question is rather obvious. I've got a dataset like this
Data Correction
0 100 Nan
1 104 Nan
2 108 Nan
3 112 Nan
4 116 Nan
5 120 0.5
6 124 Nan
7 128 Nan
8 132 Nan
9 136 0.4
10 140 Nan
11 144 Nan
12 148 Nan
13 152 0.3
14 156 Nan
15 160 Nan
What I want to is to calculate the correction factor for the data which accumulates upwards.
By that I mean that elements from 13 and below should have the factor 0.3 applied, with 9 and below applying 0.3*0.4 and 5 and below 0.3*0.4*0.5.
So the final correction column should look like this
Data Correction Factor
0 100 Nan 0.06
1 104 Nan 0.06
2 108 Nan 0.06
3 112 Nan 0.06
4 116 Nan 0.06
5 120 0.5 0.06
6 124 Nan 0.12
7 128 Nan 0.12
8 132 Nan 0.12
9 136 0.4 0.12
10 140 Nan 0.3
11 144 Nan 0.3
12 148 Nan 0.3
13 152 0.3 0.3
14 156 Nan 1
15 160 Nan 1
How can I do this?
I think you are looking for cumprod() after reversing the Correction column:
df=df.assign(Factor=df.Correction[::-1].cumprod().ffill().fillna(1))
Data Correction Factor
0 100 NaN 0.06
1 104 NaN 0.06
2 108 NaN 0.06
3 112 NaN 0.06
4 116 NaN 0.06
5 120 0.5 0.06
6 124 NaN 0.12
7 128 NaN 0.12
8 132 NaN 0.12
9 136 0.4 0.12
10 140 NaN 0.30
11 144 NaN 0.30
12 148 NaN 0.30
13 152 0.3 0.30
14 156 NaN 1.00
15 160 NaN 1.00
I can't think of a good pandas function that does this, however, you can create a for loop to do multiply an array with the values then put it as a column.
import numpy as np
import pandas as pd
lst = [np.nan,np.nan,np.nan,np.nan,np.nan,0.5,np.nan,np.nan,np.nan,np.nan,0.4,np.nan,np.nan,np.nan,0.3,np.nan,np.nan]
lst1 = [i + 100 for i in range(len(lst))]
newcol= [1.0 for i in range(len(lst))]
newcol = np.asarray(newcol)
df = pd.DataFrame({'Data' : lst1,'Correction' : lst})
for i in range(len(df['Correction'])):
if(~np.isnan(df.Correction[i])):
print(df.Correction[i])
newcol[0:i+1] = newcol[0:i+1] * df.Correction[i]
df['Factor'] = newcol
print(df)
This code prints
Data Correction Factor
0 100 NaN 0.06
1 101 NaN 0.06
2 102 NaN 0.06
3 103 NaN 0.06
4 104 NaN 0.06
5 105 0.5 0.06
6 106 NaN 0.12
7 107 NaN 0.12
8 108 NaN 0.12
9 109 NaN 0.12
10 110 0.4 0.12
11 111 NaN 0.30
12 112 NaN 0.30
13 113 NaN 0.30
14 114 0.3 0.30
15 115 NaN 1.00
16 116 NaN 1.00

Python Pandas-retrieving values in one column while they are less than the value of a second column

Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.
My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN

sorting a column with missing values

There are 6 columns of data , 4th column has same values as the first one but some values missing, I would like to know how to sort the 4th column such that same values fall on same row using python.
Sample data
255 12 0.1 255 12 0.1
256 13 0.1 259 15 0.15
259 15 0.15 272 18 0.12
272 18 0.12
290 19 0.09
Desired output
255 12 0.1 255 12 0.1
256 13 0.1
259 15 0.15 259 15 0.15
272 18 0.12 272 18 0.12
290 19 0.09
You can try merge:
print df
a b c d e f
0 255 12 0.10 255.0 12.0 0.10
1 256 13 0.10 259.0 15.0 0.15
2 259 15 0.15 272.0 18.0 0.12
3 272 18 0.12 NaN NaN NaN
4 290 19 0.09 NaN NaN NaN
print pd.merge(df[['a','b','c']],
df[['d','e','f']],
left_on=['a','b'],
right_on=['d','e'],
how='left')
a b c d e f
0 255 12 0.10 255.0 12.0 0.10
1 256 13 0.10 NaN NaN NaN
2 259 15 0.15 259.0 15.0 0.15
3 272 18 0.12 272.0 18.0 0.12
4 290 19 0.09 NaN NaN NaN

pandas why does int64 - float64 column subtraction yield NaN's

I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.

Categories