I am following a Lynda tutorial where they use the following code:
import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
and it works perfectly. However, in my case it seems that the code is not compiling, for the last line I keep getting an error.
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I know in the video they are using Python 2, however I have Python 3 since I am learning for work (which uses Python 3). Most of the differences I have been able to figure out, however I cannot figure out how to create this new column called 'total' with the sums of the passengers.
The root cause of this error message is the categorical nature of the month column:
In [42]: flights.dtypes
Out[42]:
year int64
month category
passengers int64
dtype: object
In [43]: flights.month.cat.categories
Out[43]: Index(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], d
type='object')
and you are trying to add a category total - Pandas doesn't like that.
Workaround:
In [45]: flights.month.cat.add_categories('total', inplace=True)
In [46]: x = flights.pivot(index='year', columns='month', values='passengers')
In [47]: x['total'] = x.sum(1)
In [48]: x
Out[48]:
month January February March April May June July August September October November December total
year
1949 112.0 118.0 132.0 129.0 121.0 135.0 148.0 148.0 136.0 119.0 104.0 118.0 1520.0
1950 115.0 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 114.0 140.0 1676.0
1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0 2042.0
1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0 2364.0
1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0 2700.0
1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0 2867.0
1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0 3408.0
1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0 3939.0
1957 315.0 301.0 356.0 348.0 355.0 422.0 465.0 467.0 404.0 347.0 305.0 336.0 4421.0
1958 340.0 318.0 362.0 348.0 363.0 435.0 491.0 505.0 404.0 359.0 310.0 337.0 4572.0
1959 360.0 342.0 406.0 396.0 420.0 472.0 548.0 559.0 463.0 407.0 362.0 405.0 5140.0
1960 417.0 391.0 419.0 461.0 472.0 535.0 622.0 606.0 508.0 461.0 390.0 432.0 5714.0
UPDATE: alternatively if you don't want to touch the original DF you can get rid of categorical columns in the flights_unstacked DF:
In [76]: flights_unstacked.columns = \
...: flights_unstacked.columns \
...: .set_levels(flights_unstacked.columns.get_level_values(1).categories,
...: level=1)
...:
In [77]: flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
In [78]: flights_unstacked
Out[78]:
passengers
month January February March April May June July August September October November December total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714
Related
say that i have a df in the following format:
year 2016 2017 2018 2019 2020 min max avg
month
2021-01-01 284 288 311 383 476 284 476 357.4
2021-02-01 301 315 330 388 441 301 441 359.6
2021-03-01 303 331 341 400 475 303 475 375.4
2021-04-01 283 300 339 419 492 283 492 372.6
2021-05-01 287 288 346 420 445 287 445 359.7
2021-06-01 283 292 340 424 446 283 446 359.1
2021-07-01 294 296 360 444 452 294 452 370.3
2021-08-01 294 315 381 445 451 294 451 375.9
2021-09-01 288 331 405 464 459 288 464 385.6
2021-10-01 327 349 424 457 453 327 457 399.1
2021-11-01 316 351 413 469 471 316 471 401.0
2021-12-01 259 329 384 467 465 259 467 375.7
and i would like to get the difference of the 2020 column by using df['delta'] = df['2020'].diff()
this will obviously return NaN for the first value in the column. how can i make it so that it automatically interprets that diff as the difference between the FIRST value of 2020 and the LAST value of 2019?
If you want only for 2020:
df["delta"] = pd.concat([df["2019"], df["2020"]]).diff().tail(len(df))
Prints:
year 2016 2017 2018 2019 2020 min max avg delta
0 2021-01-01 284 288 311 383 476 284 476 357.4 9.0
1 2021-02-01 301 315 330 388 441 301 441 359.6 -35.0
2 2021-03-01 303 331 341 400 475 303 475 375.4 34.0
3 2021-04-01 283 300 339 419 492 283 492 372.6 17.0
4 2021-05-01 287 288 346 420 445 287 445 359.7 -47.0
5 2021-06-01 283 292 340 424 446 283 446 359.1 1.0
6 2021-07-01 294 296 360 444 452 294 452 370.3 6.0
7 2021-08-01 294 315 381 445 451 294 451 375.9 -1.0
8 2021-09-01 288 331 405 464 459 288 464 385.6 8.0
9 2021-10-01 327 349 424 457 453 327 457 399.1 -6.0
10 2021-11-01 316 351 413 469 471 316 471 401.0 18.0
11 2021-12-01 259 329 384 467 465 259 467 375.7 -6.0
You can try unstack then do the diff, notice the first item in 2016 will still be NaN
out = df.drop(['min','max','avg'],1).unstack().diff().unstack(0)
2016 2017 2018 2019 2020
2021-01-01 NaN 29.0 -18.0 -1.0 9.0
2021-02-01 17.0 27.0 19.0 5.0 -35.0
2021-03-01 2.0 16.0 11.0 12.0 34.0
2021-04-01 -20.0 -31.0 -2.0 19.0 17.0
2021-05-01 4.0 -12.0 7.0 1.0 -47.0
2021-06-01 -4.0 4.0 -6.0 4.0 1.0
2021-07-01 11.0 4.0 20.0 20.0 6.0
2021-08-01 0.0 19.0 21.0 1.0 -1.0
2021-09-01 -6.0 16.0 24.0 19.0 8.0
2021-10-01 39.0 18.0 19.0 -7.0 -6.0
2021-11-01 -11.0 2.0 -11.0 12.0 18.0
2021-12-01 -57.0 -22.0 -29.0 -2.0 -6.0
I have a long df from 07:00:00 to 20:00:00 (df1) and a short df with only fractions of the long one (df2) (identical datetime index values).
I would like to compare the groupsize values of the two data frames.
The datetime index, id, x, and y values should be identical.
I can i do this?
df1:
Out[180]:
date id gs x y
2019-10-09 07:38:22.139 3166 nan 248 233
2019-10-09 07:38:25.259 3166 nan 252 235
2019-10-09 07:38:27.419 3166 nan 253 231
2019-10-09 07:38:30.299 3166 nan 251 232
2019-10-09 07:38:32.379 3166 nan 251 233
2019-10-09 07:38:37.179 3166 nan 228 245
2019-10-09 07:39:49.498 3167 nan 289 253
2019-10-09 07:40:19.099 3168 nan 288 217
2019-10-09 07:40:38.779 3169 nan 278 139
2019-10-09 07:40:39.899 3169 nan 279 183
...
2019-10-09 19:52:53.959 5725 nan 190 180
2019-10-09 19:52:56.439 5725 nan 193 185
2019-10-09 19:52:58.919 5725 nan 204 220
2019-10-09 19:53:06.440 5804 nan 190 198
2019-10-09 19:53:08.919 5804 nan 200 170
2019-10-09 19:53:11.419 5804 nan 265 209
2019-10-09 19:53:16.460 5789 nan 292 218
2019-10-09 19:53:36.460 5806 nan 284 190
2019-10-09 19:54:08.939 5807 nan 404 226
2019-10-09 19:54:23.979 5808 nan 395 131
df2:
Out[181]:
date id gs x y
2019-10-09 11:20:01.418 3479 2.0 353 118.0
2019-10-09 11:20:01.418 3477 2.0 315 92.0
2019-10-09 11:20:01.418 3473 2.0 351 176.0
2019-10-09 11:20:01.418 3476 2.0 318 176.0
2019-10-09 11:20:01.418 3386 0.0 148 255.0
2019-10-09 11:20:01.418 3390 0.0 146 118.0
2019-10-09 11:20:01.418 3447 0.0 469 167.0
2019-10-09 11:20:03.898 3447 0.0 466 169.0
2019-10-09 11:20:03.898 3390 0.0 139 119.0
2019-10-09 11:20:03.898 3477 2.0 316 93.0
Expected output should be a dataframe with columns "date", "id", "x", "y", "gs(df1)", "gs(df2)"
Do a Merge where everything is equal but make sure to reset index so its part of the merge condition
df1_t = df1.reset_index()
df2_t = df1.reset_index()
results = df1_t.merge(df2_t, left_on = ['date', 'ids', 'x', 'y'],
right_on = ['date', 'ids', 'x', 'y'],
indicator = True).reset_index()
print(results)
results will have the rows on df1 that are in df2.
I have this data frame:
ID Date X 123_Var 456_Var 789_Var
A 16-07-19 3 777 250 810
A 17-07-19 9 637 121 529
A 20-07-19 2 295 272 490
A 21-07-19 3 778 600 544
A 22-07-19 6 741 792 907
A 25-07-19 6 435 416 820
A 26-07-19 8 590 455 342
A 27-07-19 6 763 476 753
A 02-08-19 6 717 211 454
A 03-08-19 6 152 442 475
A 05-08-19 6 564 340 302
A 07-08-19 6 105 929 633
A 08-08-19 6 948 366 586
B 07-08-19 4 509 690 406
B 08-08-19 2 413 725 414
B 12-08-19 2 170 702 912
B 13-08-19 3 851 616 477
B 14-08-19 9 475 447 555
B 15-08-19 1 412 403 708
B 17-08-19 2 299 537 321
B 18-08-19 4 310 119 125
I want to show the mean value of n last days (using Date column), excluding the value of current day.
I'm using this code (what should I do to fix this?):
n = 4
cols = list(df.filter(regex='Var').columns)
df = df.set_index('Date')
df[cols] = (df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
.reset_index(0,drop=True).add_suffix(f'_{n}'))
df.reset_index(inplace=True)
Expected result:
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
A 16-07-19 3 777 250 810 NaN NaN NaN
A 17-07-19 9 637 121 529 777.000000 250.000000 810.0
A 20-07-19 2 295 272 490 707.000000 185.500000 669.5
A 21-07-19 3 778 600 544 466.000000 196.500000 509.5
A 22-07-19 6 741 792 907 536.500000 436.000000 517.0
A 25-07-19 6 435 416 820 759.500000 696.000000 725.5
A 26-07-19 8 590 455 342 588.000000 604.000000 863.5
A 27-07-19 6 763 476 753 512.500000 435.500000 581.0
A 02-08-19 6 717 211 454 NaN NaN NaN
A 03-08-19 6 152 442 475 717.000000 211.000000 454.0
A 05-08-19 6 564 340 302 434.500000 326.500000 464.5
A 07-08-19 6 105 929 633 358.000000 391.000000 388.5
A 08-08-19 6 948 366 586 334.500000 634.500000 467.5
B 07-08-19 4 509 690 406 NaN NaN NaN
B 08-08-19 2 413 725 414 509.000000 690.000000 406.0
B 12-08-19 2 170 702 912 413.000000 725.000000 414.0
B 13-08-19 3 851 616 477 291.500000 713.500000 663.0
B 14-08-19 9 475 447 555 510.500000 659.000000 694.5
B 15-08-19 1 412 403 708 498.666667 588.333333 648.0
B 17-08-19 2 299 537 321 579.333333 488.666667 580.0
B 18-08-19 4 310 119 125 395.333333 462.333333 528.0
Note: dataframe has changed.
I change unutbu solution for working in rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
n = 5
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
df3 = (df1.mul(df2)
.sub(df_[cols])
.div(df1[cols].sub(1)).add_suffix(f'_{n}')
)
df4 = df_.join(df3)
print (df4)
X 123_Var 456_Var 789_Var 123_Var_5 456_Var_5 789_Var_5
ID Date
A 2019-07-16 3 777 250 810 NaN NaN NaN
2019-07-17 9 637 121 529 777.000000 250.000000 810.0
2019-07-20 2 295 272 490 707.000000 185.500000 669.5
2019-07-21 3 778 600 544 466.000000 196.500000 509.5
2019-07-22 6 741 792 907 536.500000 436.000000 517.0
2019-07-25 6 435 416 820 759.500000 696.000000 725.5
2019-07-26 8 590 455 342 588.000000 604.000000 863.5
2019-07-27 6 763 476 753 512.500000 435.500000 581.0
2019-08-02 6 717 211 454 NaN NaN NaN
2019-08-03 6 152 442 475 717.000000 211.000000 454.0
2019-08-05 6 564 340 302 434.500000 326.500000 464.5
2019-08-07 6 105 929 633 358.000000 391.000000 388.5
2019-08-08 6 948 366 586 334.500000 634.500000 467.5
B 2019-08-07 4 509 690 406 NaN NaN NaN
2019-08-08 2 413 725 414 509.000000 690.000000 406.0
2019-08-12 2 170 702 912 413.000000 725.000000 414.0
2019-08-13 3 851 616 477 170.000000 702.000000 912.0
2019-08-14 9 475 447 555 510.500000 659.000000 694.5
2019-08-15 1 412 403 708 498.666667 588.333333 648.0
2019-08-17 2 299 537 321 579.333333 488.666667 580.0
2019-08-18 4 310 119 125 395.333333 462.333333 528.0
I'm having issues with scraping basketball-reference.com. I'm trying to access the "Team Per Game Stats" table but can't seem to target the correct div/table. I'm trying to capture the table and bring it into a dataframe using pandas.
I've tried using soup.find and soup.find_all to find a all the tables but when I search the results I do not see the ID of the table I am looking for. See below.
x = soup.find("table", id="team-stats-per_game")
import csv, time, sys, math
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import urllib.request
#NBA season
year = 2019
# URL page we will scraping
url = "https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)
# Basketball reference URL
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)
Result:
None
I expect the output to list the table elements, specifically tr and th tags to target and bring into a pandas df.
As Jarett mentioned above, BeautifulSoup can't parse your tag. In this case it's because it's commented out in the source.
While this is admittedly an amateurish approach, it works for your data.
table_src = html.text.split('<div class="overthrow table_container"
id="div_team-stats-per_game">')[1].split('</table>')[0] + '</table>'
table = BeautifulSoup(table_src, 'lxml')
The tables are rendered after, so you'd need to use Selenium to let it render or as mentioned above. But that isn't necessary as most of the tables are within the comments. You could use BeautifulSoup to pull out the comments, then search through those for the table tags.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
#NBA season
year = 2019
url = 'https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base'.format(year)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
This will return you a list of dataframes, so just pull out the table you want from wherever it is located by its index position:
Output:
print (tables[3])
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 19780 3555 ... 615 486 1137 1608 9686
1 2.0 Golden State Warriors* 82 19805 3612 ... 625 525 1169 1757 9650
2 3.0 New Orleans Pelicans 82 19755 3581 ... 610 441 1215 1732 9466
3 4.0 Philadelphia 76ers* 82 19805 3407 ... 606 432 1223 1745 9445
4 5.0 Los Angeles Clippers* 82 19830 3384 ... 561 385 1193 1913 9442
5 6.0 Portland Trail Blazers* 82 19855 3470 ... 546 413 1135 1669 9402
6 7.0 Oklahoma City Thunder* 82 19855 3497 ... 766 425 1145 1839 9387
7 8.0 Toronto Raptors* 82 19880 3460 ... 680 437 1150 1724 9384
8 9.0 Sacramento Kings 82 19730 3541 ... 679 363 1095 1751 9363
9 10.0 Washington Wizards 82 19930 3456 ... 683 379 1154 1701 9350
10 11.0 Houston Rockets* 82 19830 3218 ... 700 405 1094 1803 9341
11 12.0 Atlanta Hawks 82 19855 3392 ... 675 419 1397 1932 9294
12 13.0 Minnesota Timberwolves 82 19830 3413 ... 683 411 1074 1664 9223
13 14.0 Boston Celtics* 82 19780 3451 ... 706 435 1052 1670 9216
14 15.0 Brooklyn Nets* 82 19980 3301 ... 539 339 1236 1763 9204
15 16.0 Los Angeles Lakers 82 19780 3491 ... 618 440 1284 1701 9165
16 17.0 Utah Jazz* 82 19755 3314 ... 663 483 1240 1728 9161
17 18.0 San Antonio Spurs* 82 19805 3468 ... 501 386 992 1487 9156
18 19.0 Charlotte Hornets 82 19830 3297 ... 591 405 1001 1550 9081
19 20.0 Denver Nuggets* 82 19730 3439 ... 634 363 1102 1644 9075
20 21.0 Dallas Mavericks 82 19780 3182 ... 533 351 1167 1650 8927
21 22.0 Indiana Pacers* 82 19705 3390 ... 713 404 1122 1594 8857
22 23.0 Phoenix Suns 82 19880 3289 ... 735 418 1279 1932 8815
23 24.0 Orlando Magic* 82 19780 3316 ... 543 445 1082 1526 8800
24 25.0 Detroit Pistons* 82 19855 3185 ... 569 331 1135 1811 8778
25 26.0 Miami Heat 82 19730 3251 ... 627 448 1208 1712 8668
26 27.0 Chicago Bulls 82 19905 3266 ... 603 351 1159 1663 8605
27 28.0 New York Knicks 82 19780 3134 ... 557 422 1151 1713 8575
28 29.0 Cleveland Cavaliers 82 19755 3189 ... 534 195 1106 1642 8567
29 30.0 Memphis Grizzlies 82 19880 3113 ... 684 448 1147 1801 8490
30 NaN League Average 82 19815 3369 ... 626 406 1155 1714 9119
[31 rows x 25 columns]
As other answers mentioned this is basically because the content of page is being loaded by help of JavaScript and getting source code with help of urlopener or request will not load that dynamic part.
So here I have a way around of it, actually you can make use of selenium to let the dynamic content load and then get the source code from there and find for the table.
Here is the code that actually give the result you expected.
But you will need to setup selenium web driver
from lxml import html
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
def parse(url):
response = webdriver.Firefox()
response.get(url)
sleep(3)
sourceCode=response.page_source
return sourceCode
year =2019
soup = BeautifulSoup(parse("https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)),'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)
Hope this helped you with your problem and feel free to ask any further doubts.
Happy Coding:)
I am trying to concat two dataframes:
DataFrame 1 'AB1'
AB_BH AB_CA
Date
2007-01-05 305 324
2007-01-12 427 435
2007-01-19 481 460
2007-01-26 491 506
2007-02-02 459 503
2007-02-09 459 493
2007-02-16 450 486
DataFrame 2 'ABFluid'
Obj Total Rigs
Date
2007-01-03 312
2007-01-09 412
2007-01-16 446
2007-01-23 468
2007-01-30 456
2007-02-06 465
2007-02-14 456
2007-02-20 435
2007-02-27 440
Using the following code:
rigdata = pd.concat([AB1,ABFluid['Total Rigs']], axis=1
Which results in this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 NaN NaN 312
2007-01-05 305 324 NaN
2007-01-09 NaN NaN 412
2007-01-12 427 435 NaN
2007-01-16 NaN NaN 446
2007-01-19 481 460 NaN
2007-01-23 NaN NaN 468
2007-01-26 491 506 NaN
But I am looking to force the 'Total Rigs' dataframe to have the same dates as the AB1 frame like this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 305 324 312
2007-01-12 427 435 412
2007-01-19 481 460 446
2007-01-26 491 506 468
Which is just aligning them by column and re_indexing the dates.
Any suggestions??
You could do ABFluid.index = AB1.index before the concat, to make the second DataFrame have the same index as the first.