How to force date index to dataframe when concatentating

How to force date index to dataframe when concatentating - python

I am trying to concat two dataframes:
DataFrame 1 'AB1'
AB_BH AB_CA
Date
2007-01-05 305 324
2007-01-12 427 435
2007-01-19 481 460
2007-01-26 491 506
2007-02-02 459 503
2007-02-09 459 493
2007-02-16 450 486
DataFrame 2 'ABFluid'
Obj Total Rigs
Date
2007-01-03 312
2007-01-09 412
2007-01-16 446
2007-01-23 468
2007-01-30 456
2007-02-06 465
2007-02-14 456
2007-02-20 435
2007-02-27 440
Using the following code:
rigdata = pd.concat([AB1,ABFluid['Total Rigs']], axis=1
Which results in this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 NaN NaN 312
2007-01-05 305 324 NaN
2007-01-09 NaN NaN 412
2007-01-12 427 435 NaN
2007-01-16 NaN NaN 446
2007-01-19 481 460 NaN
2007-01-23 NaN NaN 468
2007-01-26 491 506 NaN
But I am looking to force the 'Total Rigs' dataframe to have the same dates as the AB1 frame like this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 305 324 312
2007-01-12 427 435 412
2007-01-19 481 460 446
2007-01-26 491 506 468
Which is just aligning them by column and re_indexing the dates.
Any suggestions??

You could do ABFluid.index = AB1.index before the concat, to make the second DataFrame have the same index as the first.

Related

Resetting all indices from multiple index dataframe

I have a multi-index dataframe (df):
contract A B Total
sex Male Female Male Female TotalMale TotalFemale
grade2
B1 948 467 408 835 1356 1302
B2 184 863 515 359 699 1222
B3 241 351 907 360 1148 711
B4 472 175 809 555 1281 730
B5 740 563 606 601 1346 1164
B6 435 780 295 392 730 1172
Total 3020 3199 3540 3102 6560 6301
I am trying to drop all indexes so my output is:
0 1 2 3 4 5
0 948 467 408 835 1356 1302
1 184 863 515 359 699 1222
2 241 351 907 360 1148 711
3 472 175 809 555 1281 730
4 740 563 606 601 1346 1164
5 435 780 295 392 730 1172
6 3020 3199 3540 3102 6560 6301
I have tried:
df= df.reset_index()
and
df= df.reset_index(drop=True)
without success

Try with build newDataFrame
df = pd.DataFrame(df.to_numpy())

You can use set_axis for the columns:
df.set_axis(range(df.shape[1]), axis=1).reset_index(drop=True)
If you need to use it in a pipeline, combine it with pipe:
(df
.pipe(lambda d: d.set_axis(range(d.shape[1]), axis=1))
.reset_index(drop=True)
)

pandas .diff() but use first cell as difference between last cell in prior column

say that i have a df in the following format:
year 2016 2017 2018 2019 2020 min max avg
month
2021-01-01 284 288 311 383 476 284 476 357.4
2021-02-01 301 315 330 388 441 301 441 359.6
2021-03-01 303 331 341 400 475 303 475 375.4
2021-04-01 283 300 339 419 492 283 492 372.6
2021-05-01 287 288 346 420 445 287 445 359.7
2021-06-01 283 292 340 424 446 283 446 359.1
2021-07-01 294 296 360 444 452 294 452 370.3
2021-08-01 294 315 381 445 451 294 451 375.9
2021-09-01 288 331 405 464 459 288 464 385.6
2021-10-01 327 349 424 457 453 327 457 399.1
2021-11-01 316 351 413 469 471 316 471 401.0
2021-12-01 259 329 384 467 465 259 467 375.7
and i would like to get the difference of the 2020 column by using df['delta'] = df['2020'].diff()
this will obviously return NaN for the first value in the column. how can i make it so that it automatically interprets that diff as the difference between the FIRST value of 2020 and the LAST value of 2019?

If you want only for 2020:
df["delta"] = pd.concat([df["2019"], df["2020"]]).diff().tail(len(df))
Prints:
year 2016 2017 2018 2019 2020 min max avg delta
0 2021-01-01 284 288 311 383 476 284 476 357.4 9.0
1 2021-02-01 301 315 330 388 441 301 441 359.6 -35.0
2 2021-03-01 303 331 341 400 475 303 475 375.4 34.0
3 2021-04-01 283 300 339 419 492 283 492 372.6 17.0
4 2021-05-01 287 288 346 420 445 287 445 359.7 -47.0
5 2021-06-01 283 292 340 424 446 283 446 359.1 1.0
6 2021-07-01 294 296 360 444 452 294 452 370.3 6.0
7 2021-08-01 294 315 381 445 451 294 451 375.9 -1.0
8 2021-09-01 288 331 405 464 459 288 464 385.6 8.0
9 2021-10-01 327 349 424 457 453 327 457 399.1 -6.0
10 2021-11-01 316 351 413 469 471 316 471 401.0 18.0
11 2021-12-01 259 329 384 467 465 259 467 375.7 -6.0

You can try unstack then do the diff, notice the first item in 2016 will still be NaN
out = df.drop(['min','max','avg'],1).unstack().diff().unstack(0)
2016 2017 2018 2019 2020
2021-01-01 NaN 29.0 -18.0 -1.0 9.0
2021-02-01 17.0 27.0 19.0 5.0 -35.0
2021-03-01 2.0 16.0 11.0 12.0 34.0
2021-04-01 -20.0 -31.0 -2.0 19.0 17.0
2021-05-01 4.0 -12.0 7.0 1.0 -47.0
2021-06-01 -4.0 4.0 -6.0 4.0 1.0
2021-07-01 11.0 4.0 20.0 20.0 6.0
2021-08-01 0.0 19.0 21.0 1.0 -1.0
2021-09-01 -6.0 16.0 24.0 19.0 8.0
2021-10-01 39.0 18.0 19.0 -7.0 -6.0
2021-11-01 -11.0 2.0 -11.0 12.0 18.0
2021-12-01 -57.0 -22.0 -29.0 -2.0 -6.0

transfer data from an earlier part of a dataframe to a later part based on criteria match

Ok, I tried to figure this one out but I couldn't do it and I couldn't find any other questions quite like it...
Using pandas and a dataframe, I need to match data from an earlier part of a dataframe and put it in a later part of the dataframe, based on a matching value.
The data looks like this:
nc Date oldval lor
508 508 2019-07-08 296.820007 500
509 509 2019-07-17 297.73999 502
510 510 2019-07-19 297.170013 502
511 511 2019-07-25 300 504
512 512 2019-08-05 283.820007 505
513 513 2019-08-12 288.070007 506
514 514 2019-08-14 283.899994 506
515 515 2019-08-23 284.850006 507
516 516 2019-09-03 290.73999 508
517 517 2019-09-16 300.160004 510
518 518 2019-09-24 295.869995 511
519 519 2019-09-27 295.399994 511
520 520 2019-10-02 288.059998 512
521 521 2019-10-08 288.529999 513
522 522 2019-10-18 297.970001 514
523 523 2019-11-21 310.269989 518
524 524 2019-12-03 309.549988 520
What I need to do is look at the column 'lor', compare it to all previous rows on the column 'nc', and if 'nc' has a matching value, then put the date into a new column 'xDate' and the 'oldval' into a new column 'xval' on the same line as 'lor'. The numbers in column 'nc' will be unique and increasing in value, while the numbers in the column 'lor' may or may not duplicate.
The final data should look like this:
nc Date oldval lor xdate xval
508 508 2019-07-08 296.820007 500 np.nan np.nan
509 509 2019-07-17 297.73999 502 np.nan np.nan
510 510 2019-07-19 297.170013 502 np.nan np.nan
511 511 2019-07-25 300 504 np.nan np.nan
512 512 2019-08-05 283.820007 505 np.nan np.nan
513 513 2019-08-12 288.070007 506 np.nan np.nan
514 514 2019-08-14 283.899994 506 np.nan np.nan
515 515 2019-08-23 284.850006 507 np.nan np.nan
516 516 2019-09-03 290.73999 508 2019-07-08 296.820007
517 517 2019-09-16 300.160004 510 2019-07-19 297.170013
518 518 2019-09-24 295.869995 511 2019-07-25 300
519 519 2019-09-27 295.399994 511 2019-07-25 300
520 520 2019-10-02 288.059998 512 2019-08-05 283.820007
521 521 2019-10-08 288.529999 513 2019-08-12 288.070007
522 522 2019-10-18 297.970001 514 2019-08-14 283.899994
523 523 2019-11-21 310.269989 518 2019-09-24 295.869995
524 524 2019-12-03 309.549988 520 2019-10-02 288.059998

You can use apply to find the matching values, then convert them to columns with apply(pd.Series).
s = df['lor'].apply(lambda x: df.loc[df['nc'] == x, ['Date', 'oldval']].values).explode()
df[['xdate','xval']] = s.apply(pd.Series)

Pandas' merge could help with meeting ur usecase:
#reset index
#initialize index to 0, 1, ...
df = df.reset_index(drop=True)
#merge dataframe on itself
res = (df[['lor','Date','oldval']]
.merge(df, left_on='lor',right_on='nc',how='left')
.filter(['Date_y','oldval_y'])
.set_axis(['xdate','xval'],axis=1)
)
#concatenate df, res on columns
pd.concat([df,res],axis=1)
nc Date oldval lor xdate xval
0 508 2019-07-08 296.820007 500 NaN NaN
1 509 2019-07-17 297.739990 502 NaN NaN
2 510 2019-07-19 297.170013 502 NaN NaN
3 511 2019-07-25 300.000000 504 NaN NaN
4 512 2019-08-05 283.820007 505 NaN NaN
5 513 2019-08-12 288.070007 506 NaN NaN
6 514 2019-08-14 283.899994 506 NaN NaN
7 515 2019-08-23 284.850006 507 NaN NaN
8 516 2019-09-03 290.739990 508 2019-07-08 296.820007
9 517 2019-09-16 300.160004 510 2019-07-19 297.170013
10 518 2019-09-24 295.869995 511 2019-07-25 300.000000
11 519 2019-09-27 295.399994 511 2019-07-25 300.000000
12 520 2019-10-02 288.059998 512 2019-08-05 283.820007
13 521 2019-10-08 288.529999 513 2019-08-12 288.070007
14 522 2019-10-18 297.970001 514 2019-08-14 283.899994
15 523 2019-11-21 310.269989 518 2019-09-24 295.869995
16 524 2019-12-03 309.549988 520 2019-10-02 288.059998

Taking the mean value of N last days

I have this data frame:
ID Date X 123_Var 456_Var 789_Var
A 16-07-19 3 777 250 810
A 17-07-19 9 637 121 529
A 20-07-19 2 295 272 490
A 21-07-19 3 778 600 544
A 22-07-19 6 741 792 907
A 25-07-19 6 435 416 820
A 26-07-19 8 590 455 342
A 27-07-19 6 763 476 753
A 02-08-19 6 717 211 454
A 03-08-19 6 152 442 475
A 05-08-19 6 564 340 302
A 07-08-19 6 105 929 633
A 08-08-19 6 948 366 586
B 07-08-19 4 509 690 406
B 08-08-19 2 413 725 414
B 12-08-19 2 170 702 912
B 13-08-19 3 851 616 477
B 14-08-19 9 475 447 555
B 15-08-19 1 412 403 708
B 17-08-19 2 299 537 321
B 18-08-19 4 310 119 125
I want to show the mean value of n last days (using Date column), excluding the value of current day.
I'm using this code (what should I do to fix this?):
n = 4
cols = list(df.filter(regex='Var').columns)
df = df.set_index('Date')
df[cols] = (df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
.reset_index(0,drop=True).add_suffix(f'_{n}'))
df.reset_index(inplace=True)
Expected result:
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
A 16-07-19 3 777 250 810 NaN NaN NaN
A 17-07-19 9 637 121 529 777.000000 250.000000 810.0
A 20-07-19 2 295 272 490 707.000000 185.500000 669.5
A 21-07-19 3 778 600 544 466.000000 196.500000 509.5
A 22-07-19 6 741 792 907 536.500000 436.000000 517.0
A 25-07-19 6 435 416 820 759.500000 696.000000 725.5
A 26-07-19 8 590 455 342 588.000000 604.000000 863.5
A 27-07-19 6 763 476 753 512.500000 435.500000 581.0
A 02-08-19 6 717 211 454 NaN NaN NaN
A 03-08-19 6 152 442 475 717.000000 211.000000 454.0
A 05-08-19 6 564 340 302 434.500000 326.500000 464.5
A 07-08-19 6 105 929 633 358.000000 391.000000 388.5
A 08-08-19 6 948 366 586 334.500000 634.500000 467.5
B 07-08-19 4 509 690 406 NaN NaN NaN
B 08-08-19 2 413 725 414 509.000000 690.000000 406.0
B 12-08-19 2 170 702 912 413.000000 725.000000 414.0
B 13-08-19 3 851 616 477 291.500000 713.500000 663.0
B 14-08-19 9 475 447 555 510.500000 659.000000 694.5
B 15-08-19 1 412 403 708 498.666667 588.333333 648.0
B 17-08-19 2 299 537 321 579.333333 488.666667 580.0
B 18-08-19 4 310 119 125 395.333333 462.333333 528.0
Note: dataframe has changed.

I change unutbu solution for working in rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
n = 5
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
df3 = (df1.mul(df2)
.sub(df_[cols])
.div(df1[cols].sub(1)).add_suffix(f'_{n}')
)
df4 = df_.join(df3)
print (df4)
X 123_Var 456_Var 789_Var 123_Var_5 456_Var_5 789_Var_5
ID Date
A 2019-07-16 3 777 250 810 NaN NaN NaN
2019-07-17 9 637 121 529 777.000000 250.000000 810.0
2019-07-20 2 295 272 490 707.000000 185.500000 669.5
2019-07-21 3 778 600 544 466.000000 196.500000 509.5
2019-07-22 6 741 792 907 536.500000 436.000000 517.0
2019-07-25 6 435 416 820 759.500000 696.000000 725.5
2019-07-26 8 590 455 342 588.000000 604.000000 863.5
2019-07-27 6 763 476 753 512.500000 435.500000 581.0
2019-08-02 6 717 211 454 NaN NaN NaN
2019-08-03 6 152 442 475 717.000000 211.000000 454.0
2019-08-05 6 564 340 302 434.500000 326.500000 464.5
2019-08-07 6 105 929 633 358.000000 391.000000 388.5
2019-08-08 6 948 366 586 334.500000 634.500000 467.5
B 2019-08-07 4 509 690 406 NaN NaN NaN
2019-08-08 2 413 725 414 509.000000 690.000000 406.0
2019-08-12 2 170 702 912 413.000000 725.000000 414.0
2019-08-13 3 851 616 477 170.000000 702.000000 912.0
2019-08-14 9 475 447 555 510.500000 659.000000 694.5
2019-08-15 1 412 403 708 498.666667 588.333333 648.0
2019-08-17 2 299 537 321 579.333333 488.666667 580.0
2019-08-18 4 310 119 125 395.333333 462.333333 528.0

how to reshape in pandas dataframe

Dataframe looks like below
날짜 역번호 역명 구분 a b c d e f ... k l m n o p q r s t
2008-01-01 150 서울역(150) 승차 379 287 371 876 965 1389 ... 2520 3078 3495 3055 2952 2726 3307 2584 1059 264
2008-01-01 150 서울역(150) 하차 145 707 689 1037 1170 1376 ... 1955 2304 2203 2128 1747 1593 1078 744 406 558
2008-01-01 151 시청(151) 승차 131 131 101 152 191 202 ... 892 900 1154 1706 1444 1267 928 531 233 974
2008-01-01 151 시청(151) 하차 35 158 203 393 375 460 ... 1157 1153 1303 1190 830 454 284 141 107 185
2008-01-01 152 종각(152) 승차 1287 867 400 330 345 338 ... 1867 2269 2777 2834 2646 2784 2920 2290 802 1559
I have dataframe like above. which I want to a~t reshape (a~t, 1)
I want to reshape dataframe like below
날짜 역번호 역명 구분 a
2018-01-01 150 서울역 승차 379
2018-01-01 150 서울역 승차 287
2018-01-01 150 서울역 승차 371
2018-01-01 150 서울역 승차 876
2018-01-01 150 서울역 승차 965
....
2008-01-01 152 종각 승차 802
2008-01-01 152 종각 승차 1559
like df = df.reshape(len(data2)*a~t, 1)
how can i do this ??

# A sample dataframe with 5 columns
df = pd.DataFrame(np.random.randn(100,5))
# Firsts 0, 1 will be retained and rest of the columns will be made as row
# with their corresponding value. Finally we drop the variable axis
df = df.melt([0,1],value_name='A').drop('variable', axis=1)
is converted to

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to force date index to dataframe when concatentating - python

You could do ABFluid.index = AB1.index before the concat, to make the second DataFrame have the same index as the first.

Related

Resetting all indices from multiple index dataframe

pandas .diff() but use first cell as difference between last cell in prior column

transfer data from an earlier part of a dataframe to a later part based on criteria match

Taking the mean value of N last days

how to reshape in pandas dataframe

Categories

Resources