Web Scraping NBA Reference for Misc Stats - python

I'm new to web scraping and trying to retrieve the miscellaneous table from https://www.basketball-reference.com/leagues/NBA_2021.html using Beautifulsoup. I have some code written but I'm unable to print the required table and just returns none.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2021.html"
data = urlopen(url)
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print(table)
Any help would be appreciated. Thank you

The sports-reference.com sites have some of those tables within the comments of the source html. So you need to pull out the comments, then parse the tables in there:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "http://www.basketball-reference.com/leagues/NBA_2021.html"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, attrs = {'id': 'misc_stats'}, header=1)[0])
except:
continue
df = tables[0]
Output:
print(df.to_string())
Rk Team Age W L PW PL MOV SOS SRS ORtg DRtg NRtg Pace FTr 3PAr TS% eFG% TOV% ORB% FT/FGA eFG%.1 TOV%.1 DRB% FT/FGA.1 Arena Attend. Attend./G
0 1.0 Los Angeles Lakers 29.0 16.0 6.0 16 6 7.73 -0.08 7.65 112.6 104.8 7.8 99.3 0.257 0.352 0.578 0.547 13.1 23.3 0.193 0.513 12.9 80.5 0.152 STAPLES Center 0 0
1 2.0 Utah Jazz 28.5 16.0 5.0 15 6 7.57 -0.20 7.38 115.9 108.2 7.7 98.2 0.245 0.485 0.587 0.557 13.1 26.2 0.185 0.507 10.4 79.7 0.152 Vivint Smart Home Arena 21290 1935
2 3.0 Milwaukee Bucks 28.0 12.0 8.0 15 5 8.15 -0.79 7.36 118.4 110.4 8.0 101.8 0.225 0.425 0.598 0.576 12.5 24.7 0.164 0.532 11.7 79.7 0.161 Fiserv Forum 0 0
3 4.0 Los Angeles Clippers 29.1 16.0 6.0 16 6 7.23 -1.13 6.10 117.8 110.4 7.4 97.4 0.235 0.413 0.603 0.565 12.1 22.2 0.200 0.537 13.3 80.0 0.189 STAPLES Center 0 0
4 5.0 Denver Nuggets 26.5 12.0 8.0 13 7 4.95 -0.19 4.76 117.0 112.0 5.0 97.1 0.244 0.383 0.587 0.556 12.5 26.1 0.187 0.551 13.8 78.7 0.204 Ball Arena 0 0
5 6.0 Brooklyn Nets 28.1 14.0 9.0 14 9 4.48 -0.56 3.92 117.9 113.5 4.4 101.9 0.264 0.415 0.620 0.584 13.4 20.5 0.217 0.524 10.6 77.6 0.192 Barclays Center 0 0
6 7.0 Phoenix Suns 26.6 11.0 8.0 11 8 2.84 0.24 3.08 110.8 108.0 2.8 97.5 0.214 0.428 0.572 0.537 12.3 18.9 0.179 0.521 12.4 80.0 0.193 Phoenix Suns Arena 0 0
7 8.0 Philadelphia 76ers 26.7 15.0 6.0 13 8 4.19 -1.13 3.06 111.5 107.4 4.1 101.6 0.299 0.351 0.576 0.538 13.8 23.7 0.228 0.515 13.4 77.9 0.199 Wells Fargo Center 0 0
8 9.0 Atlanta Hawks 24.3 10.0 10.0 12 8 2.50 0.26 2.76 112.2 109.7 2.5 99.2 0.298 0.396 0.564 0.517 12.9 25.1 0.243 0.506 11.5 77.1 0.203 State Farm Arena 3529 353
9 10.0 Boston Celtics 25.5 11.0 8.0 11 8 2.53 -0.03 2.50 112.4 109.8 2.6 99.3 0.236 0.359 0.570 0.541 13.4 25.3 0.178 0.536 13.8 77.9 0.209 TD Garden 0 0
10 11.0 Memphis Grizzlies 24.8 9.0 7.0 9 7 1.31 1.15 2.47 108.9 107.6 1.3 100.6 0.192 0.327 0.551 0.523 12.4 23.0 0.149 0.530 14.6 77.5 0.190 FedEx Forum 410 51
11 12.0 Indiana Pacers 26.8 12.0 9.0 12 9 2.71 -0.33 2.38 113.0 110.3 2.7 99.9 0.238 0.381 0.583 0.553 12.6 20.4 0.182 0.533 13.3 76.9 0.194 Bankers Life Fieldhouse 0 0
12 13.0 Houston Rockets 28.4 10.0 9.0 11 8 2.95 -0.97 1.98 109.4 106.5 2.9 102.1 0.255 0.445 0.573 0.541 13.7 19.3 0.193 0.512 13.5 76.8 0.195 Toyota Center 28141 3127
13 14.0 Toronto Raptors 27.2 9.0 12.0 12 9 1.67 -1.33 0.34 111.6 109.9 1.7 100.2 0.238 0.479 0.570 0.532 12.9 22.0 0.195 0.533 14.9 77.4 0.234 Amalie Arena 10989 999
14 15.0 Dallas Mavericks 26.4 8.0 13.0 9 12 -2.00 2.00 0.00 109.6 111.6 -2.0 98.7 0.264 0.411 0.559 0.525 11.3 18.5 0.199 0.530 12.7 76.7 0.216 American Airlines Center 0 0
15 16.0 San Antonio Spurs 26.9 11.0 10.0 10 11 -1.05 0.92 -0.13 110.3 111.3 -1.0 100.3 0.224 0.331 0.550 0.516 10.0 19.9 0.175 0.547 12.5 78.8 0.156 AT&T Center 0 0
16 17.0 Golden State Warriors 26.7 11.0 10.0 10 11 -1.05 0.77 -0.28 108.6 109.6 -1.0 103.2 0.262 0.417 0.563 0.527 12.7 18.4 0.201 0.514 13.5 75.8 0.249 Chase Center 0 0
17 18.0 Charlotte Hornets 24.8 10.0 11.0 10 11 -0.62 -0.41 -1.03 110.2 110.8 -0.6 99.0 0.247 0.414 0.560 0.529 13.1 23.3 0.185 0.544 14.1 75.0 0.166 Spectrum Center 0 0
18 19.0 New York Knicks 24.4 9.0 13.0 10 12 -2.00 0.53 -1.47 107.1 109.2 -2.1 95.4 0.264 0.319 0.538 0.500 12.6 23.8 0.203 0.503 10.7 76.9 0.198 Madison Square Garden (IV) 0 0
19 20.0 Portland Trail Blazers 27.3 11.0 9.0 9 11 -1.65 -0.07 -1.72 115.0 116.6 -1.6 99.8 0.229 0.460 0.567 0.529 10.1 21.5 0.190 0.560 12.2 78.0 0.209 Moda Center 0 0
20 21.0 Chicago Bulls 24.9 8.0 11.0 8 11 -2.26 0.36 -1.90 110.9 113.1 -2.2 103.4 0.246 0.413 0.590 0.556 15.2 20.8 0.196 0.553 12.9 80.0 0.217 United Center 0 0
21 22.0 New Orleans Pelicans 25.1 7.0 12.0 8 11 -2.58 -0.17 -2.75 110.3 112.8 -2.5 99.6 0.284 0.365 0.558 0.526 13.4 25.6 0.203 0.549 12.8 79.9 0.193 Smoothie King Center 8820 980
22 23.0 Detroit Pistons 26.3 5.0 16.0 7 14 -4.67 1.82 -2.85 107.7 112.4 -4.7 98.5 0.273 0.408 0.544 0.501 13.0 22.6 0.215 0.558 14.2 76.6 0.194 Little Caesars Arena 0 0
23 24.0 Cleveland Cavaliers 24.7 10.0 11.0 8 13 -4.19 0.04 -4.15 104.9 109.1 -4.2 97.2 0.254 0.309 0.536 0.505 14.4 25.9 0.181 0.537 14.9 75.3 0.170 Quicken Loans Arena 12564 1142
24 25.0 Miami Heat 26.7 7.0 13.0 7 13 -5.45 0.29 -5.16 106.9 112.3 -5.4 98.9 0.263 0.452 0.581 0.547 15.7 17.0 0.204 0.543 13.3 76.6 0.183 AmericanAirlines Arena 0 0
25 26.0 Sacramento Kings 25.7 9.0 11.0 7 13 -5.80 0.45 -5.35 112.7 118.4 -5.7 100.1 0.283 0.377 0.576 0.546 13.0 23.5 0.203 0.558 11.6 75.8 0.194 Golden 1 Center 0 0
26 27.0 Washington Wizards 26.2 4.0 13.0 6 11 -5.29 -0.85 -6.14 112.1 117.2 -5.1 104.4 0.282 0.374 0.569 0.534 11.6 20.7 0.212 0.565 12.8 78.9 0.251 Capital One Arena 0 0
27 28.0 Oklahoma City Thunder 23.7 8.0 11.0 5 14 -8.26 0.61 -7.66 105.2 113.3 -8.1 101.3 0.243 0.446 0.556 0.527 12.9 15.7 0.176 0.537 10.9 77.7 0.157 Chesapeake Energy Arena 0 0
28 29.0 Orlando Magic 26.2 8.0 14.0 6 16 -6.82 -1.40 -8.22 105.5 112.3 -6.8 98.9 0.220 0.358 0.526 0.490 12.2 24.1 0.174 0.547 12.4 79.7 0.173 Amway Center 35768 3252
29 30.0 Minnesota Timberwolves 23.5 5.0 15.0 5 15 -9.30 0.55 -8.76 104.6 113.7 -9.1 101.1 0.230 0.377 0.530 0.497 12.7 23.3 0.174 0.539 13.3 75.0 0.217 Target Center 0 0
30 NaN League Average 26.3 NaN NaN 10 10 0.00 0.00 0.00 111.1 111.1 NaN 99.8 0.250 0.396 0.568 0.534 12.8 22.2 0.193 0.534 12.8 77.8 0.193 NaN 4050 400
If you look at the source html, you'll see the tables in the comments start with <!--
BeautifulSoup skips over those. Hense, you need to add the part in the code that specifically finds the comments comments = soup.find_all(string=lambda text: isinstance(text, Comment)). Once you have all the comments, then you can iterate through each comment to see if theres a table in there. If there's a table, parse it, like you normally would with the non commented <table> tags.

Related

pd.concat ends up with ValueError: no types given

I'm trying to merge two dfs (basically th same df at different time) using pd.concat.
here is my code:
Aujourdhui = datetime.datetime.now()
Aujourdhui = (Aujourdhui.strftime("%X"))
PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
PerfsL1.columns = ['Équipe', 'Used_players', 'age', 'Possesion', "nb_matchs", "Starts", "Min",
'90s','Buts','Assists', 'No_penaltis', 'Penaltis', 'Penaltis_tentes',
'Cartons_jaunes', 'Cartons_rouges', 'Buts/90mn','Assists/90mn', 'B+A /90mn',
'NoPenaltis/90mn', 'B+A+P/90mn','Exp_buts','Exp_NoPenaltis', 'Exp_Assists', 'Exp_NP+A',
'Exp_buts/90mn', 'Exp_Assists/90mn','Exp_B+A/90mn','Exp_NoPenaltis/90mn', 'Exp_NP+A/90mn']
PerfsL1.insert(0, "Date", Aujourdhui)
print(PerfsL1)
PerfsL12 = pd.read_csv('Ligue_1_Perfs.csv', index_col=0)
print(PerfsL12)
PerfsL1 = pd.concat([PerfsL1, PerfsL12], ignore_index = True)
print (PerfsL1)
I successfully managed to get both df individually which are sharing the same columns, but I can't merge them, getting
ValueError: no types given.
Do you have an idea where it could be coming from ?
EDIT
Here are both dataframes:
'Ligue_1.csv'
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 00:37:48 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 00:37:48 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 00:37:48 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 00:37:48 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 00:37:48 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 00:37:48 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 00:37:48 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 00:37:48 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 00:37:48 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 00:37:48 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 00:37:48 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 00:37:48 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 00:37:48 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 00:37:48 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
14 00:37:48 Paris S-G 18 27.6 60.0 2 ... 8.1 3.05 1.76 4.81 2.27 4.03
print(PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0])
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 09:56:18 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 09:56:18 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 09:56:18 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 09:56:18 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 09:56:18 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 09:56:18 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 09:56:18 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 09:56:18 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 09:56:18 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 09:56:18 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 09:56:18 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 09:56:18 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 09:56:18 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 09:56:18 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
Thanks you for your support and have a great day !
Your code should work.
Nevertheless, try this before the concat:
PerfsL1["Date"] = pd.to_datetime(PerfsL1["Date"], format="%X", errors=‘coerce’)
I finally managed to concat both tables.
The solution was to put but both as csv before:
table1 = pd.read_html ('http://.......1........com)
table1.to_csv ('C://.....1........')
table1 = pd.read_csv('C://.....1........')
table2 = pd.read_html ('http://.......2........com)
table2.to_csv ('C://.....2........')
table2 = pd.read_csv('C://.....2........')
x = pd.concat([table2, table1])
And now it works perfectly !
Thanks for your help !

scrape from descendant tags within div using bs4 in python

my code only gets as far as finding div class="league-player-tracking-shots", I've used descendants, children and contents but can't get to the bottom of the tree, I need the value in td tag, Please help
url- https://stats.nba.com/players/bio/?sort=PLAYER_NAME&dir=-1
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://stats.nba.com/players/bio/?sort=PLAYER_NAME&dir=-1'
##url = input('Enter -')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
player = soup.find('nba-stat-table')
stat_table = player.find(class_='nba-stat-table__overlay')
for child in stat_table.children:
c = child.findAll('td')
#print(c)
print(player)
print(stat_table)
The page loads the data from external source. You can use requests module to simulate this request.
For example:
import json
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Referer': 'https://stats.nba.com/players/bio/?sort=PLAYER_NAME&dir=-1',
'x-nba-stats-token': 'true',
}
url = 'https://stats.nba.com/stats/leaguedashplayerbiostats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2019-20&SeasonSegment=&SeasonType=Regular Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='
data = requests.get(url, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for row in data['resultSets'][0]['rowSet']:
print(('{:<20}'*len(row)).format(*row))
Prints:
203932 Aaron Gordon 1610612753 ORL 24.0 6-8 80 235 Arizona USA 2014 1 4 58 14.4 7.6 3.7 -2.0 0.051 0.175 0.203 0.511 0.164
1628988 Aaron Holiday 1610612754 IND 23.0 6-0 72 185 UCLA USA 2018 1 23 58 9.4 2.3 3.3 1.9 0.015 0.076 0.188 0.517 0.191
1627846 Abdel Nader 1610612760 OKC 26.0 6-5 77 225 Iowa State Egypt 2016 2 58 48 6.0 1.9 0.7 -3.4 0.018 0.095 0.16 0.58 0.07
1629690 Adam Mokoka 1610612741 CHI 21.0 6-5 77 190 None France Undrafted Undrafted Undrafted 11 2.9 0.9 0.4 17.1 0.057 0.029 0.11 0.538 0.043
1629678 Admiral Schofield 1610612764 WAS 23.0 6-5 77 241 Tennessee United Kingdom 2019 2 42 27 3.1 1.3 0.5 -4.9 0.018 0.092 0.123 0.514 0.068
201143 Al Horford 1610612755 PHI 34.0 6-9 81 240 Florida Dominican Republic 2007 1 3 60 12.0 6.9 4.1 3.5 0.05 0.17 0.174 0.526 0.185
202329 Al-Farouq Aminu 1610612753 ORL 29.0 6-8 80 220 Wake Forest USA 2010 1 8 18 4.3 4.8 1.2 -5.4 0.053 0.158 0.127 0.395 0.088
202692 Alec Burks 1610612755 PHI 28.0 6-6 78 214 Colorado USA 2011 1 12 59 15.1 4.4 2.9 -8.3 0.025 0.134 0.23 0.549 0.178
1629346 Alen Smailagic 1610612744 GSW 19.0 6-10 82 215 None Serbia 2019 2 39 14 4.2 1.9 0.9 -3.0 0.077 0.11 0.175 0.61 0.133
1627936 Alex Caruso 1610612747 LAL 26.0 6-5 77 186 Texas A&M USA Undrafted Undrafted Undrafted 58 5.4 1.9 1.8 10.3 0.014 0.088 0.136 0.538 0.128
203458 Alex Len 1610612758 SAC 27.0 7-0 84 250 Maryland Ukraine 2013 1 5 49 8.3 6.0 1.0 -5.6 0.096 0.195 0.178 0.596 0.085
1628035 Alfonzo McKinnie 1610612739 CLE 27.0 6-7 79 215 None USA Undrafted Undrafted Undrafted 40 4.6 2.8 0.4 -7.4 0.061 0.131 0.147 0.493 0.032
1628993 Alize Johnson 1610612754 IND 24.0 6-7 79 212 Missouri State USA 2018 2 50 13 1.4 1.4 0.2 4.7 0.119 0.204 0.142 0.51 0.045
203459 Allen Crabbe 1610612750 MIN 28.0 6-5 77 212 California USA 2013 2 31 37 4.6 2.1 0.9 -15.0 0.014 0.098 0.122 0.47 0.073
1629019 Allonzo Trier 1610612752 NYK 24.0 6-4 76 200 Arizona USA Undrafted Undrafted Undrafted 24 6.5 1.2 1.2 -11.1 0.018 0.078 0.208 0.62 0.164
1628518 Amile Jefferson 1610612753 ORL 27.0 6-9 81 222 Duke USA Undrafted Undrafted Undrafted 18 0.8 1.3 0.2 8.5 0.112 0.163 0.129 0.372 0.078
...and so on.

Weird bug when changing boolean number to classify dataframe

I have a dataframe where I am trying to add some boolean constraints that are numbers.
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>0]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0
11 1991 3.13 -0.41 2.69 -0.41 75.1 1.0 78.4 -0.3
However, when I change the second constraint to 1 like this:
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>1]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
There is no output. How does this make sense?
Parentheses were incorrect:
hrw_hotdry = combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t']>1.0)]
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0

Split data frame based on specific value in Python

I have a dataframe as below:
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
10000 .90 1.10 1.30 1.50 2.10 3.10 5.60 8.40 15.80
15000 1.35 1.65 1.95 2.25 3.15 4.65 8.40 12.60 23.70
20000 1.80 2.20 2.60 3.00 4.20 6.20 11.20 16.80 31.60
25000 2.25 2.75 3.25 3.75 5.25 7.75 14.00 21.00 39.50
30000 2.70 3.30 3.90 4.50 6.30 9.30 16.80 25.20 47.40
35000 3.15 3.85 4.55 5.25 7.35 10.85 19.60 29.40 55.30
40000 3.60 4.40 5.20 6.00 8.40 12.40 22.40 33.60 63.20
45000 4.05 4.95 5.85 6.75 9.45 13.95 25.20 37.80 71.10
50000 4.50 5.50 6.50 7.50 10.50 15.50 28.00 42.00 79.00
10000 .60 .80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
15000 .90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95
Now I would like to split them into 3 dataframes based on the 'Size'
df1: From 10000 - before next occurrence of 10000
df2: Second 10000 - before 1000
df3: From 1000 to end
Otherwise,it is fine to have a temporary variable (temp column) in the same dataframe specifying categories like S1,S2 and S3 respectively for above ranges.
Could anyone guide me how to go about this?
Regards
Assumng that you want to break on the decreases, you could use the compare-cumsum-groupby pattern:
parts = list(df.groupby((df["Size"].diff() < 0).cumsum()))
which gives me (suppressing boring rows in the middle)
>>> for key, group in parts:
... print(key)
... print(group)
... print("----")
...
0
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
[...]
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0
----
1
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
[...]
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50
----
2
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
[...]
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.90
----
Not so elegant but this works:
In [259]:
ranges=[]
first = df.index[0]
criteria = df.index[df['Size'].diff() < 0]
for idx in criteria:
ranges.append((first, idx))
first += idx
ranges
Out[259]:
[(0, 9), (9, 18)]
In [261]:
splits = []
for r in ranges:
splits.append(df.iloc[r[0]:r[1]])
splits.append(df.iloc[ranges[-1][0]:])
splits
Out[261]:
[ Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
3 25000 2.25 2.75 3.25 3.75 5.25 7.75 14.0 21.0 39.5
4 30000 2.70 3.30 3.90 4.50 6.30 9.30 16.8 25.2 47.4
5 35000 3.15 3.85 4.55 5.25 7.35 10.85 19.6 29.4 55.3
6 40000 3.60 4.40 5.20 6.00 8.40 12.40 22.4 33.6 63.2
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
12 25000 1.5 2.0 2.5 3.0 4.5 7.0 13.25 20.25 38.75
13 30000 1.8 2.4 3.0 3.6 5.4 8.4 15.90 24.30 46.50
14 35000 2.1 2.8 3.5 4.2 6.3 9.8 18.55 28.35 54.25
15 40000 2.4 3.2 4.0 4.8 7.2 11.2 21.20 32.40 62.00
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.60 0.80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
10 15000 0.90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
11 20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
12 25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
13 30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
14 35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
15 40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
16 45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
17 50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
21 4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
22 5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
23 6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
24 7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
25 8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95]
So firstly this looks to see when the size stops increasing:
df['Size'].diff() < 0
and we use to mask the index, we then iterate over these ranges to create a list of tuple ranges.
We iterate over these ranges to slice the df in the last step.

Interpolating a large dataframe onto a sparse, irregular index

I've got one dataframe containing several years of data sampled at 30 min intervals (7 parameters from a continuous water quality sensor), and I've got another dataframe containing data at a few hundred random points in time, with one minute precision. I'd like to find the interpolated values of the 7 parameters at the few hundred random points in time.
So here's a few lines of what these dataframes look like:
print df1
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 14:30:00 25.9 45.22 29.2 70.4 4.9 0.98 7.9 -6
2002-07-16 15:00:00 26.0 44.92 29.0 76.2 5.3 1.02 7.9 -6
2002-07-16 15:30:00 26.0 45.06 29.1 77.9 5.4 1.06 7.9 -5
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 16:30:00 25.9 45.33 29.3 72.9 5.0 1.17 7.9 -6
2002-07-16 17:00:00 25.9 45.46 29.4 65.8 4.5 1.21 7.9 -6
2002-07-16 17:30:00 25.9 45.40 29.4 70.5 4.9 1.19 7.9 -6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
2002-07-16 18:30:00 25.8 45.57 29.5 67.6 4.7 1.11 7.8 -6
...
print df2
PO4F NH4F NO2F NO3F NO23F CHLA_N
DateTimeStamp
2002-07-16 14:01:00 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:05:00 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:09:00 0.023 0.073 0.000 NaN 0.014 5.8
...
I want to find the values of df1 at the index values of df2, but the only way I can figure out from reading the docs and other stackoverflow answers is by putting df1 on a one minute time base (which will generate a bunch of nans), then filling the nans using Series.interpolate, and then pulling out the one minute values at the discrete times of df2. That seems incredibly wasteful. There must be another way, right?
If you want interpolation, I think you're stuck with the method you describe, or something approximately as "wasteful." If you can setting for taking the most recent value or the next value, use ffill or bfill respectively.
In [34]: df1.reindex(df2.index, method='ffill')
Out[34]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
DateTimeStamp
2002-07-16 14:01:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 16:05:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 18:09:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
Here's a way to do what I think you want
Starting frame df1 and df2
In [100]: df1
Out[100]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
time
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 14:30:00 25.9 45.22 29.2 70.4 4.9 0.98 7.9 -6
2002-07-16 15:00:00 26.0 44.92 29.0 76.2 5.3 1.02 7.9 -6
2002-07-16 15:30:00 26.0 45.06 29.1 77.9 5.4 1.06 7.9 -5
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 16:30:00 25.9 45.33 29.3 72.9 5.0 1.17 7.9 -6
2002-07-16 17:00:00 25.9 45.46 29.4 65.8 4.5 1.21 7.9 -6
2002-07-16 17:30:00 25.9 45.40 29.4 70.5 4.9 1.19 7.9 -6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
2002-07-16 18:30:00 25.8 45.57 29.5 67.6 4.7 1.11 7.8 -6
In [101]: df2
Out[101]:
P04F NH4F N02F N03F NO23F CHLA_N
time
2002-07-16 14:01:00 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:05:00 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:09:00 0.023 0.073 0.000 NaN 0.014 5.8
Calculate a rounded time (the time I am converting to an int in nanoseconds, then rounding to the nearest 30*60 seconds). You may have to adjust if you want up or down (to the next 1/2 hour)
In [102]: new_index = pd.DatetimeIndex(int(1e9*30*60)*(np.round(df2.index.asi8/(1e9*30*60))).astype(np.int64)).values
In [104]: new_index
Out[104]:
array(['2002-07-16T10:00:00.000000000-0400',
'2002-07-16T12:00:00.000000000-0400',
'2002-07-16T14:00:00.000000000-0400'], dtype='datetime64[ns]')
Copying just to avoid modifying the original frame. Set the new index
In [105]: df3 = df2.copy()
In [106]: df3.index = new_index
Subselect and join
In [107]: df1.loc[df3.index].join(df3)
Out[107]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb P04F NH4F N02F N03F NO23F CHLA_N
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6 0.023 0.073 0.000 NaN 0.014 5.8

Categories