this is my dataframe
date number name di t
0 2008-01-01 150 서울역(150) 승차 379
1 2008-01-01 150 서울역(150) 하차 145
2 2008-01-01 151 시청(151) 승차 131
3 2008-01-01 151 시청(151) 하차 35
4 2008-01-01 152 종각(152) 승차 1287
5 2008-01-01 152 종각(152) 하차 74
6 2008-01-01 153 종로3가(153) 승차 484
7 2008-01-01 153 종로3가(153) 하차 28
8 2008-01-01 154 종로5가(154) 승차 89
9 2008-01-01 154 종로5가(154) 하차 14
10 2008-01-01 155 동대문(155) 승차 190
11 2008-01-01 155 동대문(155) 하차 23
12 2008-01-01 156 신설동(156) 승차 65
13 2008-01-01 156 신설동(156) 하차 15
14 2008-01-01 157 제기동(157) 승차 156
15 2008-01-01 157 제기동(157) 하차 16
and
I want the result like this subtraction di(승차 - 하차)
date number name di t
0 2008-01-01 150 서울역(150) 승차 234
2 2008-01-01 151 시청(151) 승차 96
4 2008-01-01 152 종각(152) 승차 1213
6 2008-01-01 153 종로3가(153) 승차 456
8 2008-01-01 154 종로5가(154) 승차 75
10 2008-01-01 155 동대문(155) 승차 167
12 2008-01-01 156 신설동(156) 승차 50
14 2008-01-01 157 제기동(157) 승차 140
how can i get this dataframe?
I did a google search of "dataframe subtraction" but it’s not showing the result I want, what is wrong with my search?
We can do the following:
Groupby on number and get the diff of each group
Merge back to our original dataframe based on index
Remove unwanted columns
group = abs(df.groupby('number')['t'].diff().dropna())
group.index = group.index-1
df_merge = df.merge(group,
left_index=True,
right_index=True,
suffixes=['_1', ''])
df_merge.drop('t_1', axis=1, inplace=True)
print(df_merge)
date number name di t
0 2008-01-01 150 서울역(150) 승차 234.0
2 2008-01-01 151 시청(151) 승차 96.0
4 2008-01-01 152 종각(152) 승차 1213.0
6 2008-01-01 153 종로3가(153) 승차 456.0
8 2008-01-01 154 종로5가(154) 승차 75.0
10 2008-01-01 155 동대문(155) 승차 167.0
12 2008-01-01 156 신설동(156) 승차 50.0
14 2008-01-01 157 제기동(157) 승차 140.0
IIUC get first under groupby then assign the diff with dropna
g=df.groupby(['date','number','name'])
yourdf=g.di.first().reset_index()
yourdf['t']=-g.t.diff().dropna().values
yourdf
Out[648]:
date number name di t
0 2008-01-01 150 서울역(150) 승차 234.0
1 2008-01-01 151 시청(151) 승차 96.0
2 2008-01-01 152 종각(152) 승차 1213.0
3 2008-01-01 153 종로3가(153) 승차 456.0
4 2008-01-01 154 종로5가(154) 승차 75.0
5 2008-01-01 155 동대문(155) 승차 167.0
6 2008-01-01 156 신설동(156) 승차 50.0
7 2008-01-01 157 제기동(157) 승차 140.0
Push into one line
df.groupby(['date','number','name']).\
agg({'di':'first','t':lambda x : x.iloc[0]-x.iloc[1]}).reset_index()
Out[665]:
date number name di t
0 2008-01-01 150 서울역(150) 승차 234
1 2008-01-01 151 시청(151) 승차 96
2 2008-01-01 152 종각(152) 승차 1213
3 2008-01-01 153 종로3가(153) 승차 456
4 2008-01-01 154 종로5가(154) 승차 75
5 2008-01-01 155 동대문(155) 승차 167
6 2008-01-01 156 신설동(156) 승차 50
7 2008-01-01 157 제기동(157) 승차 140
If the rows are always paired and ordered as shown in your sample, then just do the simple math, and then drop_duplicated(). the calculation on the rows with an odd number index has no influence on the result (they will all be discarded).
df2 = df.copy()
df2['t'] = df2.t - df2.t.shift(-1)
df2.drop_duplicates(['date','number','name'])
df2
# date number name di t
#0 2008-01-01 150 서울역(150) 승차 234.0
#2 2008-01-01 151 시청(151) 승차 96.0
#4 2008-01-01 152 종각(152) 승차 1213.0
#6 2008-01-01 153 종로3가(153) 승차 456.0
#8 2008-01-01 154 종로5가(154) 승차 75.0
#10 2008-01-01 155 동대문(155) 승차 167.0
#12 2008-01-01 156 신설동(156) 승차 50.0
#14 2008-01-01 157 제기동(157) 승차 140.0
Update: Just a follow-up to this old question. The one I suggested above had one issue for those groups with only one row (i.e. no paired row), but this can be overcome by using another drop_duplicated():
# define columns to group rows
uniq_cols = ['date', 'number', 'name']
# find all groups/rows which do NOT have any paired rows
# and save them in a separate dataframe
# Here you can setup their value to NULL if needed
u = df.drop_duplicates(uniq_cols, keep=False)
# calculate the difference
df['t'] = df.t - df.t.shift(-1)
# concat the two data-frames and then drop_duplicated
# make sure `u` is before `df`, so that its values will be kept
# while the ones in `df` will be discarded
# sort_index() to get back to its original order.
pd.concat([u, df]).drop_duplicates(uniq_cols).sort_index()
Note: Rows need to be sorted so that rows in the same group are line-up consecutively.
Related
I have a dataframe that looks like this:
Path_Version commitdates Year-Month API Age api_spec_id
168 NaN 2018-10-19 2018-10 39 521
169 NaN 2018-10-19 2018-10 39 521
170 NaN 2018-10-12 2018-10 39 521
171 NaN 2018-10-12 2018-10 39 521
172 NaN 2018-10-12 2018-10 39 521
173 NaN 2018-10-11 2018-10 39 521
174 NaN 2018-10-11 2018-10 39 521
175 NaN 2018-10-11 2018-10 39 521
176 NaN 2018-10-11 2018-10 39 521
177 NaN 2018-10-11 2018-10 39 521
178 NaN 2018-09-26 2018-09 39 521
179 NaN 2018-09-25 2018-09 39 521
I want to calculate the days elapsed from the first commitdate till the last, after sorting the commit dates first, so something like this:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 25
169 NaN 2018-10-19 2018-10 39 521 25
170 NaN 2018-10-12 2018-10 39 521 18
171 NaN 2018-10-12 2018-10 39 521 18
172 NaN 2018-10-12 2018-10 39 521 18
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0
I tried first sorting the commitdates by api_spec_id since it is unique for every API, and then calculating the diff
final_api['commitdates'] = final_api.groupby('api_spec_id')['commitdate'].apply(lambda x: x.sort_values())
final_api['diff'] = final_api.groupby('api_spec_id')['commitdates'].diff() / np.timedelta64(1, 'D')
final_api['diff'] = final_api['diff'].fillna(0)
It just returns me a zero for the entire column. I don't want to group them, I only want to calculate the difference based on the sorted commitdates: starting from the first commitdate till the last in the entire dataset, in days
Any idea how can I achieve this?
Use pandas.to_datetime, sub, min and dt.days:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.min()).dt.days
If you need to group per API:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.groupby(df['api_spec_id']).transform('min')).dt.days
Output:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 24
169 NaN 2018-10-19 2018-10 39 521 24
170 NaN 2018-10-12 2018-10 39 521 17
171 NaN 2018-10-12 2018-10 39 521 17
172 NaN 2018-10-12 2018-10 39 521 17
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0
I have a dataset with an uneven sample frequency as seen on this subset:
time date x y id nn1 nn2
0 2019-09-17 08:43:06 234 236 4909 22.02271554554524 38.2099463490856
0 2019-09-17 08:43:06 251 222 4911 22.02271554554524 46.57252408878007
1 2019-09-17 08:43:07 231 244 4909 30.4138126514911 41.617304093369626
1 2019-09-17 08:43:07 252 222 4911 30.4138126514911 46.57252408878007
1 2019-09-17 08:43:07 207 210 4900 41.617304093369626 46.57252408878007
2 2019-09-17 08:43:08 234 250 4909 33.28663395418648 48.82622246293481
2 2019-09-17 08:43:08 206 210 4900 47.53945729601885 48.82622246293481
3 2019-09-17 08:43:09 252 222 4911 38.28837943815329 47.53945729601885
3 2019-09-17 08:43:09 206 210 4900 40.718546143004666 47.53945729601885
3 2019-09-17 08:43:09 223 247 4909 38.28837943815329 40.718546143004666
4 2019-09-17 08:43:10 206 210 4900 35.4682957019364 47.53945729601885
4 2019-09-17 08:43:10 229 237 4909 27.459060435491963 35.4682957019364
4 2019-09-17 08:43:10 252 222 4911 27.459060435491963 47.53945729601885
5 2019-09-17 08:43:12 226 241 4909 30.805843601498726 38.01315561749642
5 2019-09-17 08:43:12 251 223 4911 30.805843601498726 44.94441010848846
5 2019-09-17 08:43:12 209 207 4900 38.01315561749642 44.94441010848846
6 2019-09-17 08:43:13 251 222 4911 34.20526275297414 44.598206241955516
6 2019-09-17 08:43:13 224 243 4909 34.20526275297414 39.0
6 2019-09-17 08:43:13 209 207 4900 39.0 44.598206241955516
7 2019-09-17 08:43:14 251 222 4911 33.421549934136806 45.5411901469428
7 2019-09-17 08:43:14 225 243 4909 33.421549934136806 39.81205847478876
8 2019-09-17 08:43:15 225 245 4909 34.713109915419565 41.23105625617661
8 2019-09-17 08:43:15 209 207 4900 41.23105625617661 44.598206241955516
8 2019-09-17 08:43:15 251 222 4911 34.713109915419565 44.598206241955516
9 2019-09-17 08:43:16 209 207 4900 37.20215047547655 48.46648326421054
9 2019-09-17 08:43:16 254 225 4911 25.942243542145693 48.46648326421054
10 2019-09-17 08:43:18 206 207 4900 41.182520563948 67.26812023536856
10 2019-09-17 08:43:18 242 227 4909 30.805843601498726 41.182520563948
10 2019-09-17 08:43:18 272 220 4911 30.805843601498726 67.26812023536856
I want to reshape the data set into even 0.25 Seconds intervals (increasing the sample frequency to 4 fps) and fill the Nan Values with average values of the given second. I fail with interpolating and reshaping, is there anyone that can help?? Also, the ID has to stay the same. I deeply appreciate!
Dataframe looks like below
날짜 역번호 역명 구분 a b c d e f ... k l m n o p q r s t
2008-01-01 150 서울역(150) 승차 379 287 371 876 965 1389 ... 2520 3078 3495 3055 2952 2726 3307 2584 1059 264
2008-01-01 150 서울역(150) 하차 145 707 689 1037 1170 1376 ... 1955 2304 2203 2128 1747 1593 1078 744 406 558
2008-01-01 151 시청(151) 승차 131 131 101 152 191 202 ... 892 900 1154 1706 1444 1267 928 531 233 974
2008-01-01 151 시청(151) 하차 35 158 203 393 375 460 ... 1157 1153 1303 1190 830 454 284 141 107 185
2008-01-01 152 종각(152) 승차 1287 867 400 330 345 338 ... 1867 2269 2777 2834 2646 2784 2920 2290 802 1559
I have dataframe like above. which I want to a~t reshape (a~t, 1)
I want to reshape dataframe like below
날짜 역번호 역명 구분 a
2018-01-01 150 서울역 승차 379
2018-01-01 150 서울역 승차 287
2018-01-01 150 서울역 승차 371
2018-01-01 150 서울역 승차 876
2018-01-01 150 서울역 승차 965
....
2008-01-01 152 종각 승차 802
2008-01-01 152 종각 승차 1559
like df = df.reshape(len(data2)*a~t, 1)
how can i do this ??
# A sample dataframe with 5 columns
df = pd.DataFrame(np.random.randn(100,5))
# Firsts 0, 1 will be retained and rest of the columns will be made as row
# with their corresponding value. Finally we drop the variable axis
df = df.melt([0,1],value_name='A').drop('variable', axis=1)
is converted to
I have a dataframe like this:
date number name div a b c d e f ... k l m n o p q r s t
0 2008-01-01 150 A get_on 379 287 371 876 965 1389 ... 2520 3078 3495 3055 2952 2726 3307 2584 1059 264
1 2008-01-01 150 A get_off 145 707 689 1037 1170 1376 ... 1955 2304 2203 2128 1747 1593 1078 744 406 558
2 2008-01-01 151 B get_on 131 131 101 152 191 202 ... 892 900 1154 1706 1444 1267 928 531 233 974
3 2008-01-01 151 B get_off 35 158 203 393 375 460 ... 1157 1153 1303 1190 830 454 284 141 107 185
4 2008-01-01 152 C get_on 1287 867 400 330 345 338 ... 1867 2269 2777 2834 2646 2784 2920 2290 802 1559
5 2008-01-01 152 C get_off 74 147 261 473 597 698 ... 2161 2298 2360 1997 1217 799 461 271 134 210
to
date number name div a
2008-01-01 150 A get_on 379
2008-01-01 150 A get_on 287
2008-01-01 150 A get_on 371
2008-01-01 150 A get_on 876
2008-01-01 150 A get_on 965
2008-01-01 150 A get_on 1389
....
2008-01-01 152 C get_off 2161
2008-01-01 152 C get_off 2298
2008-01-01 152 C get_off 2360
2008-01-01 152 C get_off 1997
2008-01-01 152 C get_off 1217
2008-01-01 152 C get_off 799
2008-01-01 152 C get_off 461
2008-01-01 152 C get_off 271
2008-01-01 152 C get_off 134
2008-01-01 152 C get_off 210
I tried melt method like
df.melt(id_vars=df.columns.tolist()[0:4], value_name='a').drop('variable', 1)
but the column of 'b~t' is deleted... I want to add 'b~t' column is go to under 'a' column
It's not working on my dataframe...
How can I get like result?
number is train number
name is train name
dive is get_on or get_off
dataset is https://drive.google.com/open?id=1Upb5PgymkPB5TXuta_sg6SijwzUuEkfl
Use DataFrame.sort_values after melt:
df = df.melt(id_vars=df.columns[:4], value_name='a').drop('variable', 1)
df = df.sort_values(['date','number', 'div'], ascending=[True, True, False])
print (df.head())
date number name div a
0 2008-01-01 150 A get_on 379
6 2008-01-01 150 A get_on 287
12 2008-01-01 150 A get_on 371
18 2008-01-01 150 A get_on 876
24 2008-01-01 150 A get_on 965
print (df.tail())
date number name div a
71 2008-01-01 152 C get_off 799
77 2008-01-01 152 C get_off 461
83 2008-01-01 152 C get_off 271
89 2008-01-01 152 C get_off 134
95 2008-01-01 152 C get_off 210
I have a dataframe as follows:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 NaN 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 NaN
07-JAN-17 104 NaN
What is the best way, to fill the missing values, using last available ones?
Following is the intended result:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 111 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 211
07-JAN-17 104 211
Use ffill function, what is same as fillna with method ffill:
df = df.ffill()
print (df)
A B
zDate
01-JAN-17 100.0 200.0
02-JAN-17 111.0 203.0
03-JAN-17 111.0 202.0
04-JAN-17 109.0 205.0
05-JAN-17 101.0 211.0
06-JAN-17 105.0 211.0
07-JAN-17 104.0 211.0