How to columns into multiple rows in complex dataframe? - python

I have a dataframe like this:
date number name div a b c d e f ... k l m n o p q r s t
0 2008-01-01 150 A get_on 379 287 371 876 965 1389 ... 2520 3078 3495 3055 2952 2726 3307 2584 1059 264
1 2008-01-01 150 A get_off 145 707 689 1037 1170 1376 ... 1955 2304 2203 2128 1747 1593 1078 744 406 558
2 2008-01-01 151 B get_on 131 131 101 152 191 202 ... 892 900 1154 1706 1444 1267 928 531 233 974
3 2008-01-01 151 B get_off 35 158 203 393 375 460 ... 1157 1153 1303 1190 830 454 284 141 107 185
4 2008-01-01 152 C get_on 1287 867 400 330 345 338 ... 1867 2269 2777 2834 2646 2784 2920 2290 802 1559
5 2008-01-01 152 C get_off 74 147 261 473 597 698 ... 2161 2298 2360 1997 1217 799 461 271 134 210
to
date number name div a
2008-01-01 150 A get_on 379
2008-01-01 150 A get_on 287
2008-01-01 150 A get_on 371
2008-01-01 150 A get_on 876
2008-01-01 150 A get_on 965
2008-01-01 150 A get_on 1389
....
2008-01-01 152 C get_off 2161
2008-01-01 152 C get_off 2298
2008-01-01 152 C get_off 2360
2008-01-01 152 C get_off 1997
2008-01-01 152 C get_off 1217
2008-01-01 152 C get_off 799
2008-01-01 152 C get_off 461
2008-01-01 152 C get_off 271
2008-01-01 152 C get_off 134
2008-01-01 152 C get_off 210
I tried melt method like
df.melt(id_vars=df.columns.tolist()[0:4], value_name='a').drop('variable', 1)
but the column of 'b~t' is deleted... I want to add 'b~t' column is go to under 'a' column
It's not working on my dataframe...
How can I get like result?
number is train number
name is train name
dive is get_on or get_off
dataset is https://drive.google.com/open?id=1Upb5PgymkPB5TXuta_sg6SijwzUuEkfl

Use DataFrame.sort_values after melt:
df = df.melt(id_vars=df.columns[:4], value_name='a').drop('variable', 1)
df = df.sort_values(['date','number', 'div'], ascending=[True, True, False])
print (df.head())
date number name div a
0 2008-01-01 150 A get_on 379
6 2008-01-01 150 A get_on 287
12 2008-01-01 150 A get_on 371
18 2008-01-01 150 A get_on 876
24 2008-01-01 150 A get_on 965
print (df.tail())
date number name div a
71 2008-01-01 152 C get_off 799
77 2008-01-01 152 C get_off 461
83 2008-01-01 152 C get_off 271
89 2008-01-01 152 C get_off 134
95 2008-01-01 152 C get_off 210

Related

How do I use pandas to organize dataframe by both row and column?

I'm learning python and pandas, and I know how to do basic operations like groupby() and sum(). But I'm trying to do more complex operations like categorizing using rows and columns, but I'm not sure how to begin the problem below.
Here's the dataset from GitHub:
https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
Here's what I'm trying to produce:
Generation
Fire A-M
Fire N-Z
Water A-M
Water N-Z
Grass A-M
Grass N-Z
1
#pokemon
2
3
4
5
6
Here's what my approach:
df = pd.read_csv(pokemon_data.csv, header=0)
fire = df.loc[df['Type 1'] == 'Fire']
water = df.loc[df['Type 1'] == 'Water']
grass = df.loc[df['Type 1'] == 'Grass']
# Trim down columns to only related data
fire = fire[['Name', 'Type 1', 'Generation']]
water = water[['Name', 'Type 1', 'Generation']]
grass = grass[['Name', 'Type 1', 'Generation']]
Next steps: Should I begin to sort by Generation first, or by alphabetical range (A-M and N-Z)? I can't wrap my head around this.
An explanation of your work is much appreciated. Thank you!
Create helper column for new columns in final DataFrame by compare first letter of column Name and then use DataFrame.pivot_table, if need aggregate strings in Name need aggregate function join:
df['cat'] = df['Type 1'] + ' ' + np.where(df['Name'].str[0].gt('M'), 'N-Z','A-M')
print (df)
# Name Type 1 Type 2 HP Attack Defense \
0 1 Bulbasaur Grass Poison 45 49 49
1 2 Ivysaur Grass Poison 60 62 63
2 3 Venusaur Grass Poison 80 82 83
3 3 VenusaurMega Venusaur Grass Poison 80 100 123
4 4 Charmander Fire NaN 39 52 43
.. ... ... ... ... .. ... ...
795 719 Diancie Rock Fairy 50 100 150
796 719 DiancieMega Diancie Rock Fairy 50 160 110
797 720 HoopaHoopa Confined Psychic Ghost 80 110 60
798 720 HoopaHoopa Unbound Psychic Dark 80 160 60
799 721 Volcanion Fire Water 80 110 120
Sp. Atk Sp. Def Speed Generation Legendary cat
0 65 65 45 1 False Grass A-M
1 80 80 60 1 False Grass A-M
2 100 100 80 1 False Grass N-Z
3 122 120 80 1 False Grass N-Z
4 60 50 65 1 False Fire A-M
.. ... ... ... ... ... ...
795 100 150 50 6 True Rock A-M
796 160 110 110 6 True Rock A-M
797 150 130 70 6 True Psychic A-M
798 170 130 80 6 True Psychic A-M
799 130 90 70 6 True Fire N-Z
df = df.pivot_table(index='Generation', columns='cat', values='Name', aggfunc=','.join)
# print (df)
Create your column names first then pivot your dataframe:
df['Group'] = df['Type 1'] + ' ' + np.where(df['Name'].str[0].between('A', 'M'), 'A-M', 'N-Z')
out = df.astype({'#': str}).pivot_table('#', 'Generation', 'Group', aggfunc=' '.join)
Output
>>> out
Group Bug A-M Bug N-Z Dark A-M ... Steel N-Z Water A-M Water N-Z
Generation ...
1 10 11 12 14 15 15 13 46 47 48 49 123 127 127 NaN ... NaN 9 9 55 87 91 98 99 116 118 129 130 130 131 7 8 54 60 61 62 72 73 79 80 80 86 90 117 119 1...
2 165 166 168 205 214 214 167 193 204 212 212 213 198 228 229 229 ... 208 208 227 159 160 170 171 183 184 222 226 230 158 186 194 195 199 211 223 224 245
3 267 268 269 284 314 265 266 283 290 291 292 313 262 359 359 ... 379 258 259 270 271 272 318 339 341 342 349 350 36... 260 260 278 279 319 319 320 321 340 369
4 401 402 412 414 415 413 413 413 416 469 430 491 ... NaN 395 418 419 423 456 457 458 490 393 394 422 484 489
5 542 557 558 588 589 595 596 617 632 636 649 540 541 543 544 545 616 637 510 625 630 633 635 ... NaN 502 550 565 580 592 593 594 647 647 501 503 515 516 535 536 537 564 581
6 NaN 664 665 666 686 687 ... NaN 656 657 658 692 693 NaN
[6 rows x 35 columns]
Transposed view for readability:
>>> out.T
Generation 1 2 3 4 5 6
Group
Bug A-M 10 11 12 14 15 15 165 166 168 205 214 214 267 268 269 284 314 401 402 412 414 415 542 557 558 588 589 595 596 617 632 636 649 NaN
Bug N-Z 13 46 47 48 49 123 127 127 167 193 204 212 212 213 265 266 283 290 291 292 313 413 413 413 416 469 540 541 543 544 545 616 637 664 665 666
Dark A-M NaN 198 228 229 229 262 359 359 430 491 510 625 630 633 635 686 687
Dark N-Z NaN 197 215 261 302 302 461 509 559 560 570 571 624 629 634 717
Dragon A-M 147 148 149 NaN 334 334 371 380 380 381 381 443 444 445 445 610 611 612 621 646 646 646 704 706
Dragon N-Z NaN NaN 372 373 373 384 384 NaN 643 644 705 718
Electric A-M 81 82 101 125 135 179 180 181 181 239 309 310 310 312 404 405 462 466 522 587 603 604 694 695 702
Electric N-Z 25 26 100 145 172 243 311 403 417 479 479 479 479 479 479 523 602 642 642 NaN
Fairy A-M 35 36 173 210 NaN NaN NaN 669 670 671 683
Fairy N-Z NaN 175 176 209 NaN 468 NaN 682 684 685 700 716
Fighting A-M 56 66 67 68 106 107 237 296 297 307 308 308 448 448 533 534 619 620 701
Fighting N-Z 57 236 NaN 447 532 538 539 674 675
Fire A-M 4 5 6 6 6 58 59 126 136 146 155 219 240 244 250 256 257 257 323 323 390 391 392 467 485 500 554 555 555 631 653 654 655 662 667
Fire N-Z 37 38 77 78 156 157 218 255 322 324 NaN 498 499 513 514 663 668 721
Flying N-Z NaN NaN NaN NaN 641 641 714 715
Ghost A-M 92 93 94 94 200 354 354 355 356 425 426 429 477 487 487 563 607 608 609 711 711 711 711
Ghost N-Z NaN NaN 353 442 562 708 709 710 710 710 710
Grass A-M 1 2 44 69 102 103 152 153 154 182 187 189 253 286 331 332 388 406 420 421 455 460 460 470 546 549 556 590 591 597 598 650 652 673
Grass N-Z 3 3 43 45 70 71 114 188 191 192 252 254 254 273 274 275 285 315 357 387 389 407 459 465 492 492 495 496 497 511 512 547 548 640 651 672
Ground A-M 50 51 104 105 207 232 330 343 344 383 383 449 450 472 529 530 552 553 622 623 645 645 NaN
Ground N-Z 27 28 111 112 231 328 329 464 551 618 NaN
Ice A-M 124 144 225 362 362 471 473 478 613 614 615 712 713
Ice N-Z NaN 220 221 238 361 363 364 365 378 NaN 582 583 584 NaN
Normal A-M 22 39 52 83 84 85 108 113 115 115 132 133 162 163 174 190 203 206 241 242 264 294 295 298 301 351 352 399 400 424 427 428 428 431 440 441 446 463 493 506 507 531 531 572 573 585 626 628 648 648 659 660 661 676
Normal N-Z 16 17 18 18 19 20 21 40 53 128 137 143 161 164 216 217 233 234 235 263 276 277 287 288 289 293 300 327 333 335 396 397 398 432 474 486 504 505 508 519 520 521 586 627 NaN
Poison A-M 23 24 42 88 89 109 169 316 452 453 569 691
Poison N-Z 29 30 31 32 33 34 41 110 NaN 317 336 434 435 451 454 568 690
Psychic A-M 63 64 65 65 96 97 122 150 150 150 151 196 249 251 281 282 282 326 358 386 386 386 386 433 439 475 475 481 482 488 517 518 574 575 576 578 605 606 677 678 678 720 720
Psychic N-Z NaN 177 178 201 202 280 325 360 480 494 527 528 561 577 579 NaN
Rock A-M 74 75 76 140 141 142 142 246 337 345 346 347 348 408 411 438 525 526 566 567 688 689 698 699 703 719 719
Rock N-Z 95 138 139 185 247 248 248 299 338 377 409 410 476 524 639 696 697
Steel A-M NaN NaN 303 303 304 305 306 306 374 375 376 376 385 436 437 483 599 600 601 638 679 680 681 681 707
Steel N-Z NaN 208 208 227 379 NaN NaN NaN
Water A-M 9 9 55 87 91 98 99 116 118 129 130 130 131 159 160 170 171 183 184 222 226 230 258 259 270 271 272 318 339 341 342 349 350 36... 395 418 419 423 456 457 458 490 502 550 565 580 592 593 594 647 647 656 657 658 692 693
Water N-Z 7 8 54 60 61 62 72 73 79 80 80 86 90 117 119 1... 158 186 194 195 199 211 223 224 245 260 260 278 279 319 319 320 321 340 369 393 394 422 484 489 501 503 515 516 535 536 537 564 581 NaN

Filtering columns based on row values in Pandas

I am trying to create dataframes from this "master" dataframe based on unique entries in the row 2.
DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
1 DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
2 UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
3
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
7 4/1/2020 872 568 505 652 366 982 159 131 218 961 52 85 679 923
8 5/1/2020 93 58 864 682 346 19 293 19 206 500 793 962 630 413
9 6/1/2020 696 262 833 418 876 695 900 781 179 138 143 526 9 866
10 7/1/2020 810 58 579 244 81 858 362 440 186 425 55 920 345 596
11 8/1/2020 834 609 618 214 547 834 301 875 783 216 834 609 550 274
12 9/1/2020 687 935 976 380 885 246 339 904 627 460 659 352 361 793
13 10/1/2020 596 300 810 248 475 718 350 574 825 804 245 209 212 925
14 11/1/2020 584 984 711 879 916 107 277 412 122 683 151 811 129 4
15 12/1/2020 616 515 101 743 650 526 475 991 796 227 880 692 734 799
16 1/1/2021 106 441 305 964 452 249 282 486 374 620 652 793 115 697
17 2/1/2021 969 504 936 678 67 42 985 791 709 689 520 503 102 731
18 3/1/2021 823 169 412 177 783 601 613 251 533 463 13 127 516 15
19 4/1/2021 348 588 140 966 143 576 419 611 128 830 68 209 952 935
20 5/1/2021 96 711 651 121 708 360 159 229 552 951 79 665 709 165
21 6/1/2021 805 657 729 629 249 547 581 583 236 828 636 248 412 535
22 7/1/2021 286 320 908 765 336 286 148 168 821 567 63 908 248 320
23 8/1/2021 707 975 565 699 47 712 700 439 497 106 288 105 872 158
24 9/1/2021 346 523 142 181 904 266 28 740 125 64 287 707 553 437
25 10/1/2021 245 42 773 591 492 512 846 487 983 180 372 306 785 691
26 11/1/2021 785 577 448 489 425 205 672 358 868 637 104 422 873 919
so the output will look something like this
df_unit1
DATE PROP1 PROP2
1 DAYS MEAN MEAN
2 UNIT1 UNIT1
3
4 1/1/2020 677 972
5 2/1/2020 515 430
6 3/1/2020 253 174
7 4/1/2020 872 679
8 5/1/2020 93 630
9 6/1/2020 696 9
10 7/1/2020 810 345
11 8/1/2020 834 550
12 9/1/2020 687 361
13 10/1/2020 596 212
14 11/1/2020 584 129
15 12/1/2020 616 734
16 1/1/2021 106 115
17 2/1/2021 969 102
18 3/1/2021 823 516
19 4/1/2021 348 952
20 5/1/2021 96 709
21 6/1/2021 805 412
22 7/1/2021 286 248
23 8/1/2021 707 872
24 9/1/2021 346 553
25 10/1/2021 245 785
26 11/1/2021 785 873
df_unit2
DATE PROP1 PROP2
1 DAYS MEAN MEAN
2 UNIT2 UNIT2
3
4 1/1/2020 92 733
5 2/1/2020 11 272
6 3/1/2020 295 602
7 4/1/2020 568 923
8 5/1/2020 58 413
9 6/1/2020 262 866
10 7/1/2020 58 596
11 8/1/2020 609 274
12 9/1/2020 935 793
13 10/1/2020 300 925
14 11/1/2020 984 4
15 12/1/2020 515 799
16 1/1/2021 441 697
17 2/1/2021 504 731
18 3/1/2021 169 15
19 4/1/2021 588 935
20 5/1/2021 711 165
21 6/1/2021 657 535
22 7/1/2021 320 320
23 8/1/2021 975 158
24 9/1/2021 523 437
25 10/1/2021 42 691
26 11/1/2021 577 919
I have extracted the unique units from the row
unitName = pd.Series(pd.Series(df[2,:]).unique(), name = "Unit Names")
unitName = unitName.tolist()
Next I was planning to loop through this list of unique units and create dataframes with each units
for unit in unitName:
df_unit = df.iloc[[df.iloc[2:,:].str.match(unit)],:]
print(df_unit)
I am getting an error that 'DataFrame' object has no attribute 'str'. So my plan was to match all the cells in row2 that matches a given unit and then extract the entire column for the matched row cell.
This response has two parts:
Solution 1: Strip columns based on common name in dataframe
With the assumption that your dataframe columns look as follows:
['DATE DAYS', 'PROP1 MEAN UNIT1', 'PROP1 MEAN UNIT2', 'PROP1 MEAN UNIT3', 'PROP1 MEAN UNIT4', 'PROP1 MEAN UNIT5', 'PROP1 MEAN UNIT6', 'PROP2 MEAN UNIT7', 'PROP2 MEAN UNIT8', 'PROP2 MEAN UNIT3', 'PROP2 MEAN UNIT4', 'PROP2 MEAN UNIT11', 'PROP2 MEAN UNIT12', 'PROP2 MEAN UNIT1', 'PROP2 MEAN UNIT2']
and the first few records of your dataframe looks like this...
DATE DAYS PROP1 MEAN UNIT1 ... PROP2 MEAN UNIT1 PROP2 MEAN UNIT2
0 1/1/2020 677 ... 972 733
1 2/1/2020 515 ... 430 272
2 3/1/2020 253 ... 174 602
3 4/1/2020 872 ... 679 923
4 5/1/2020 93 ... 630 413
5 6/1/2020 696 ... 9 866
6 7/1/2020 810 ... 345 596
The following lines of code should give you what you want:
cols = df.columns.tolist()
units = sorted(set(x[x.rfind('UNIT'):] for x in cols[1:]))
s_units = sorted(cols[1:],key = lambda x: x.split()[2])
for i in units:
unit_sublist = ['DATE DAYS'] + [j for j in s_units if j[-6:].strip() == i]
print ('df_' + i.lower())
print (df[unit_sublist])
I got the following:
df_unit1
DATE DAYS PROP1 MEAN UNIT1 PROP2 MEAN UNIT1
0 1/1/2020 677 972
1 2/1/2020 515 430
2 3/1/2020 253 174
3 4/1/2020 872 679
4 5/1/2020 93 630
5 6/1/2020 696 9
6 7/1/2020 810 345
df_unit11
DATE DAYS PROP2 MEAN UNIT11
0 1/1/2020 586
1 2/1/2020 123
2 3/1/2020 823
3 4/1/2020 52
4 5/1/2020 793
5 6/1/2020 143
6 7/1/2020 55
df_unit12
DATE DAYS PROP2 MEAN UNIT12
0 1/1/2020 576
1 2/1/2020 36
2 3/1/2020 822
3 4/1/2020 85
4 5/1/2020 962
5 6/1/2020 526
6 7/1/2020 920
df_unit2
DATE DAYS PROP1 MEAN UNIT2 PROP2 MEAN UNIT2
0 1/1/2020 92 733
1 2/1/2020 11 272
2 3/1/2020 295 602
3 4/1/2020 568 923
4 5/1/2020 58 413
5 6/1/2020 262 866
6 7/1/2020 58 596
df_unit3
DATE DAYS PROP1 MEAN UNIT3 PROP2 MEAN UNIT3
0 1/1/2020 342 69
1 2/1/2020 86 441
2 3/1/2020 644 680
3 4/1/2020 505 218
4 5/1/2020 864 206
5 6/1/2020 833 179
6 7/1/2020 579 186
df_unit4
DATE DAYS PROP1 MEAN UNIT4 PROP2 MEAN UNIT4
0 1/1/2020 432 621
1 2/1/2020 754 11
2 3/1/2020 401 729
3 4/1/2020 652 961
4 5/1/2020 682 500
5 6/1/2020 418 138
6 7/1/2020 244 425
df_unit5
DATE DAYS PROP1 MEAN UNIT5
0 1/1/2020 878
1 2/1/2020 219
2 3/1/2020 574
3 4/1/2020 366
4 5/1/2020 346
5 6/1/2020 876
6 7/1/2020 81
df_unit6
DATE DAYS PROP1 MEAN UNIT6
0 1/1/2020 831
1 2/1/2020 818
2 3/1/2020 184
3 4/1/2020 982
4 5/1/2020 19
5 6/1/2020 695
6 7/1/2020 858
df_unit7
DATE DAYS PROP2 MEAN UNIT7
0 1/1/2020 293
1 2/1/2020 822
2 3/1/2020 354
3 4/1/2020 159
4 5/1/2020 293
5 6/1/2020 900
6 7/1/2020 362
df_unit8
DATE DAYS PROP2 MEAN UNIT8
0 1/1/2020 88
1 2/1/2020 280
2 3/1/2020 12
3 4/1/2020 131
4 5/1/2020 19
5 6/1/2020 781
6 7/1/2020 440
Solution 2: Create column names based on first 3 rows in the source data
Let us assume the first 6 rows of your dataframe looks like this.
DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
Then you can write the below code to create the dataframe.
data = '''DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
7 4/1/2020 872 568 505 652 366 982 159 131 218 961 52 85 679 923
8 5/1/2020 93 58 864 682 346 19 293 19 206 500 793 962 630 413
9 6/1/2020 696 262 833 418 876 695 900 781 179 138 143 526 9 866
10 7/1/2020 810 58 579 244 81 858 362 440 186 425 55 920 345 596
11 8/1/2020 834 609 618 214 547 834 301 875 783 216 834 609 550 274
12 9/1/2020 687 935 976 380 885 246 339 904 627 460 659 352 361 793
13 10/1/2020 596 300 810 248 475 718 350 574 825 804 245 209 212 925
14 11/1/2020 584 984 711 879 916 107 277 412 122 683 151 811 129 4
15 12/1/2020 616 515 101 743 650 526 475 991 796 227 880 692 734 799
16 1/1/2021 106 441 305 964 452 249 282 486 374 620 652 793 115 697
17 2/1/2021 969 504 936 678 67 42 985 791 709 689 520 503 102 731
18 3/1/2021 823 169 412 177 783 601 613 251 533 463 13 127 516 15
19 4/1/2021 348 588 140 966 143 576 419 611 128 830 68 209 952 935
20 5/1/2021 96 711 651 121 708 360 159 229 552 951 79 665 709 165
21 6/1/2021 805 657 729 629 249 547 581 583 236 828 636 248 412 535
22 7/1/2021 286 320 908 765 336 286 148 168 821 567 63 908 248 320
23 8/1/2021 707 975 565 699 47 712 700 439 497 106 288 105 872 158
24 9/1/2021 346 523 142 181 904 266 28 740 125 64 287 707 553 437
25 10/1/2021 245 42 773 591 492 512 846 487 983 180 372 306 785 691
26 11/1/2021 785 577 448 489 425 205 672 358 868 637 104 422 873 919'''
data_list = data.split('\n')
data_line1 = data_list[0].split()
data_line2 = data_list[1].split()
data_line3 = [''] + data_list[2].split()
data_header = [' '.join([data_line1[i],data_line2[i],data_line3[i]]) for i in range(len(data_line1))]
data_header[0] = data_header[0][:-1]
new_data= data_list[3:]
import pandas as pd
df = pd.DataFrame(data = None,columns=data_header)
for i in range(len(new_data)-1):
df.loc[i] = new_data[i].split()[1:]
print (df)
Here is what worked for me.
#Assign unique column names to the dataframe
df.columns = range(df.shape[1])
#Get all the unique units in the dataframe
unitName = pd.Series(pd.Series(df.loc[2,:]).unique(), name = "Unit Names")
#Convert them to a list to loop through
unitName = unitName.tolist()
for var in unitName:
#this looks for an exact match for the unit in row index 2 and
#extracts the entire column with the match
df_item = df[df.columns[df.iloc[3].str.fullmatch(var)]]
print (df_item)

sum of occurrences in bins

I'm looking at popularity of food stalls in a pop-up market:
Unnamed: 0 Shop1 Shop2 Shop3 ... shop27 shop28 shop29 shop30 shop31 shop32 shop33 shop34
0 0 484 516 484 ... 348 146 1445 1489 623 453 779 694
1 1 276 564 941 ... 1463 178 700 996 1151 364 111 1243
2 2 74 1093 961 ... 1260 1301 1151 663 1180 723 1477 1198
3 3 502 833 22 ... 349 1105 835 1 938 921 745 14
4 4 829 983 952 ... 568 1435 518 807 874 197 81 573
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
114 114 1 187 706 ... 587 1239 1413 850 1324 788 687 687
115 115 398 733 298 ... 864 981 100 80 1322 381 430 349
116 116 11 312 904 ... 34 508 850 1278 432 395 601 213
117 117 824 261 593 ... 1026 147 488 69 25 286 1229 1028
118 118 461 966 183 ... 850 817 1411 863 950 987 415 130
I then summarize the overall visits and split into bins (pd.cut(df.sum(axis=0),5,labels=['lowest','lower','medium','higher','highest'])):
Unnamed: 0 lowest
Shop1 medium
Shop2 medium
Shop3 lower
Shop4 lower
... ...
shop31 higher
shop32 medium
shop33 higher
shop34 higher
I then want to see popularity of each category over time, manual example:
6891-33086 33087-59151 59152-85216 85217-111281 111282-137346
0 0 1373 3546 13999 1238
How can I do this with python?

how can I subtract in dataframe?

this is my dataframe
date number name di t
0 2008-01-01 150 서울역(150) 승차 379
1 2008-01-01 150 서울역(150) 하차 145
2 2008-01-01 151 시청(151) 승차 131
3 2008-01-01 151 시청(151) 하차 35
4 2008-01-01 152 종각(152) 승차 1287
5 2008-01-01 152 종각(152) 하차 74
6 2008-01-01 153 종로3가(153) 승차 484
7 2008-01-01 153 종로3가(153) 하차 28
8 2008-01-01 154 종로5가(154) 승차 89
9 2008-01-01 154 종로5가(154) 하차 14
10 2008-01-01 155 동대문(155) 승차 190
11 2008-01-01 155 동대문(155) 하차 23
12 2008-01-01 156 신설동(156) 승차 65
13 2008-01-01 156 신설동(156) 하차 15
14 2008-01-01 157 제기동(157) 승차 156
15 2008-01-01 157 제기동(157) 하차 16
and
I want the result like this subtraction di(승차 - 하차)
date number name di t
0 2008-01-01 150 서울역(150) 승차 234
2 2008-01-01 151 시청(151) 승차 96
4 2008-01-01 152 종각(152) 승차 1213
6 2008-01-01 153 종로3가(153) 승차 456
8 2008-01-01 154 종로5가(154) 승차 75
10 2008-01-01 155 동대문(155) 승차 167
12 2008-01-01 156 신설동(156) 승차 50
14 2008-01-01 157 제기동(157) 승차 140
how can i get this dataframe?
I did a google search of "dataframe subtraction" but it’s not showing the result I want, what is wrong with my search?
We can do the following:
Groupby on number and get the diff of each group
Merge back to our original dataframe based on index
Remove unwanted columns
group = abs(df.groupby('number')['t'].diff().dropna())
group.index = group.index-1
df_merge = df.merge(group,
left_index=True,
right_index=True,
suffixes=['_1', ''])
df_merge.drop('t_1', axis=1, inplace=True)
print(df_merge)
date number name di t
0 2008-01-01 150 서울역(150) 승차 234.0
2 2008-01-01 151 시청(151) 승차 96.0
4 2008-01-01 152 종각(152) 승차 1213.0
6 2008-01-01 153 종로3가(153) 승차 456.0
8 2008-01-01 154 종로5가(154) 승차 75.0
10 2008-01-01 155 동대문(155) 승차 167.0
12 2008-01-01 156 신설동(156) 승차 50.0
14 2008-01-01 157 제기동(157) 승차 140.0
IIUC get first under groupby then assign the diff with dropna
g=df.groupby(['date','number','name'])
yourdf=g.di.first().reset_index()
yourdf['t']=-g.t.diff().dropna().values
yourdf
Out[648]:
date number name di t
0 2008-01-01 150 서울역(150) 승차 234.0
1 2008-01-01 151 시청(151) 승차 96.0
2 2008-01-01 152 종각(152) 승차 1213.0
3 2008-01-01 153 종로3가(153) 승차 456.0
4 2008-01-01 154 종로5가(154) 승차 75.0
5 2008-01-01 155 동대문(155) 승차 167.0
6 2008-01-01 156 신설동(156) 승차 50.0
7 2008-01-01 157 제기동(157) 승차 140.0
Push into one line
df.groupby(['date','number','name']).\
agg({'di':'first','t':lambda x : x.iloc[0]-x.iloc[1]}).reset_index()
Out[665]:
date number name di t
0 2008-01-01 150 서울역(150) 승차 234
1 2008-01-01 151 시청(151) 승차 96
2 2008-01-01 152 종각(152) 승차 1213
3 2008-01-01 153 종로3가(153) 승차 456
4 2008-01-01 154 종로5가(154) 승차 75
5 2008-01-01 155 동대문(155) 승차 167
6 2008-01-01 156 신설동(156) 승차 50
7 2008-01-01 157 제기동(157) 승차 140
If the rows are always paired and ordered as shown in your sample, then just do the simple math, and then drop_duplicated(). the calculation on the rows with an odd number index has no influence on the result (they will all be discarded).
df2 = df.copy()
df2['t'] = df2.t - df2.t.shift(-1)
df2.drop_duplicates(['date','number','name'])
df2
# date number name di t
#0 2008-01-01 150 서울역(150) 승차 234.0
#2 2008-01-01 151 시청(151) 승차 96.0
#4 2008-01-01 152 종각(152) 승차 1213.0
#6 2008-01-01 153 종로3가(153) 승차 456.0
#8 2008-01-01 154 종로5가(154) 승차 75.0
#10 2008-01-01 155 동대문(155) 승차 167.0
#12 2008-01-01 156 신설동(156) 승차 50.0
#14 2008-01-01 157 제기동(157) 승차 140.0
Update: Just a follow-up to this old question. The one I suggested above had one issue for those groups with only one row (i.e. no paired row), but this can be overcome by using another drop_duplicated():
# define columns to group rows
uniq_cols = ['date', 'number', 'name']
# find all groups/rows which do NOT have any paired rows
# and save them in a separate dataframe
# Here you can setup their value to NULL if needed
u = df.drop_duplicates(uniq_cols, keep=False)
# calculate the difference
df['t'] = df.t - df.t.shift(-1)
# concat the two data-frames and then drop_duplicated
# make sure `u` is before `df`, so that its values will be kept
# while the ones in `df` will be discarded
# sort_index() to get back to its original order.
pd.concat([u, df]).drop_duplicates(uniq_cols).sort_index()
Note: Rows need to be sorted so that rows in the same group are line-up consecutively.

how to reshape in pandas dataframe

Dataframe looks like below
날짜 역번호 역명 구분 a b c d e f ... k l m n o p q r s t
2008-01-01 150 서울역(150) 승차 379 287 371 876 965 1389 ... 2520 3078 3495 3055 2952 2726 3307 2584 1059 264
2008-01-01 150 서울역(150) 하차 145 707 689 1037 1170 1376 ... 1955 2304 2203 2128 1747 1593 1078 744 406 558
2008-01-01 151 시청(151) 승차 131 131 101 152 191 202 ... 892 900 1154 1706 1444 1267 928 531 233 974
2008-01-01 151 시청(151) 하차 35 158 203 393 375 460 ... 1157 1153 1303 1190 830 454 284 141 107 185
2008-01-01 152 종각(152) 승차 1287 867 400 330 345 338 ... 1867 2269 2777 2834 2646 2784 2920 2290 802 1559
I have dataframe like above. which I want to a~t reshape (a~t, 1)
I want to reshape dataframe like below
날짜 역번호 역명 구분 a
2018-01-01 150 서울역 승차 379
2018-01-01 150 서울역 승차 287
2018-01-01 150 서울역 승차 371
2018-01-01 150 서울역 승차 876
2018-01-01 150 서울역 승차 965
....
2008-01-01 152 종각 승차 802
2008-01-01 152 종각 승차 1559
like df = df.reshape(len(data2)*a~t, 1)
how can i do this ??
# A sample dataframe with 5 columns
df = pd.DataFrame(np.random.randn(100,5))
# Firsts 0, 1 will be retained and rest of the columns will be made as row
# with their corresponding value. Finally we drop the variable axis
df = df.melt([0,1],value_name='A').drop('variable', axis=1)
is converted to

Categories