I have a dataframe where I am trying to add some boolean constraints that are numbers.
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>0]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0
11 1991 3.13 -0.41 2.69 -0.41 75.1 1.0 78.4 -0.3
However, when I change the second constraint to 1 like this:
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>1]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
There is no output. How does this make sense?
Parentheses were incorrect:
hrw_hotdry = combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t']>1.0)]
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0
Related
I want to create a new column diff aqualing the differenciation of a series in a nother column.
The following is my dataframe:
df=pd.DataFrame({
'series_1' : [10.1, 15.3, 16, 12, 14.5, 11.8, 2.3, 7.7,5,10],
'series_2' : [9.6,10.4, 11.2, 3.3, 6, 4, 1.94, 15.44, 6.17, 8.16]
})
It has the following display:
series_1 series_2
0 10.1 9.60
1 15.3 10.40
2 16.0 11.20
3 12.0 3.30
4 14.5 6.00
5 11.8 4.00
6 2.3 1.94
7 7.7 15.44
8 5.0 6.17
9 10.0 8.16
Goal
Is to get the following output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
My code
To reach the desired output I used the following code and it worked:
diff_2=[np.nan]
l=len(df)
for i in range(1, l):
diff_2.append(df['series_2'][i] - df['series_2'][i-1])
df['diff_2'] = diff_2
Issue with my code
I replicated here a simplified dataframe, the real one I am working on is extremly large and my code took almost 9 minute runtime!
I want an alternative allowing me to get the output in a fast way,
Any suggestion from your side will be highly appreciated, thanks.
here is one way to do it, using diff
# create a new col by taking difference b/w consecutive rows of DF using diff
df['diff_2']=df['series_2'].diff()
df
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
You might want to add the following line of code:
df["diff_2"] = df["series_2"].sub(df["series_2"].shift(1))
to achieve your goal output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
That is a build-in pandas feature, so that should be optimized for good performance.
I'm trying to merge two dfs (basically th same df at different time) using pd.concat.
here is my code:
Aujourdhui = datetime.datetime.now()
Aujourdhui = (Aujourdhui.strftime("%X"))
PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
PerfsL1.columns = ['Équipe', 'Used_players', 'age', 'Possesion', "nb_matchs", "Starts", "Min",
'90s','Buts','Assists', 'No_penaltis', 'Penaltis', 'Penaltis_tentes',
'Cartons_jaunes', 'Cartons_rouges', 'Buts/90mn','Assists/90mn', 'B+A /90mn',
'NoPenaltis/90mn', 'B+A+P/90mn','Exp_buts','Exp_NoPenaltis', 'Exp_Assists', 'Exp_NP+A',
'Exp_buts/90mn', 'Exp_Assists/90mn','Exp_B+A/90mn','Exp_NoPenaltis/90mn', 'Exp_NP+A/90mn']
PerfsL1.insert(0, "Date", Aujourdhui)
print(PerfsL1)
PerfsL12 = pd.read_csv('Ligue_1_Perfs.csv', index_col=0)
print(PerfsL12)
PerfsL1 = pd.concat([PerfsL1, PerfsL12], ignore_index = True)
print (PerfsL1)
I successfully managed to get both df individually which are sharing the same columns, but I can't merge them, getting
ValueError: no types given.
Do you have an idea where it could be coming from ?
EDIT
Here are both dataframes:
'Ligue_1.csv'
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 00:37:48 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 00:37:48 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 00:37:48 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 00:37:48 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 00:37:48 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 00:37:48 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 00:37:48 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 00:37:48 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 00:37:48 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 00:37:48 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 00:37:48 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 00:37:48 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 00:37:48 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 00:37:48 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
14 00:37:48 Paris S-G 18 27.6 60.0 2 ... 8.1 3.05 1.76 4.81 2.27 4.03
print(PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0])
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 09:56:18 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 09:56:18 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 09:56:18 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 09:56:18 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 09:56:18 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 09:56:18 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 09:56:18 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 09:56:18 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 09:56:18 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 09:56:18 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 09:56:18 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 09:56:18 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 09:56:18 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 09:56:18 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
Thanks you for your support and have a great day !
Your code should work.
Nevertheless, try this before the concat:
PerfsL1["Date"] = pd.to_datetime(PerfsL1["Date"], format="%X", errors=‘coerce’)
I finally managed to concat both tables.
The solution was to put but both as csv before:
table1 = pd.read_html ('http://.......1........com)
table1.to_csv ('C://.....1........')
table1 = pd.read_csv('C://.....1........')
table2 = pd.read_html ('http://.......2........com)
table2.to_csv ('C://.....2........')
table2 = pd.read_csv('C://.....2........')
x = pd.concat([table2, table1])
And now it works perfectly !
Thanks for your help !
I'm new to web scraping and trying to retrieve the miscellaneous table from https://www.basketball-reference.com/leagues/NBA_2021.html using Beautifulsoup. I have some code written but I'm unable to print the required table and just returns none.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2021.html"
data = urlopen(url)
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print(table)
Any help would be appreciated. Thank you
The sports-reference.com sites have some of those tables within the comments of the source html. So you need to pull out the comments, then parse the tables in there:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "http://www.basketball-reference.com/leagues/NBA_2021.html"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, attrs = {'id': 'misc_stats'}, header=1)[0])
except:
continue
df = tables[0]
Output:
print(df.to_string())
Rk Team Age W L PW PL MOV SOS SRS ORtg DRtg NRtg Pace FTr 3PAr TS% eFG% TOV% ORB% FT/FGA eFG%.1 TOV%.1 DRB% FT/FGA.1 Arena Attend. Attend./G
0 1.0 Los Angeles Lakers 29.0 16.0 6.0 16 6 7.73 -0.08 7.65 112.6 104.8 7.8 99.3 0.257 0.352 0.578 0.547 13.1 23.3 0.193 0.513 12.9 80.5 0.152 STAPLES Center 0 0
1 2.0 Utah Jazz 28.5 16.0 5.0 15 6 7.57 -0.20 7.38 115.9 108.2 7.7 98.2 0.245 0.485 0.587 0.557 13.1 26.2 0.185 0.507 10.4 79.7 0.152 Vivint Smart Home Arena 21290 1935
2 3.0 Milwaukee Bucks 28.0 12.0 8.0 15 5 8.15 -0.79 7.36 118.4 110.4 8.0 101.8 0.225 0.425 0.598 0.576 12.5 24.7 0.164 0.532 11.7 79.7 0.161 Fiserv Forum 0 0
3 4.0 Los Angeles Clippers 29.1 16.0 6.0 16 6 7.23 -1.13 6.10 117.8 110.4 7.4 97.4 0.235 0.413 0.603 0.565 12.1 22.2 0.200 0.537 13.3 80.0 0.189 STAPLES Center 0 0
4 5.0 Denver Nuggets 26.5 12.0 8.0 13 7 4.95 -0.19 4.76 117.0 112.0 5.0 97.1 0.244 0.383 0.587 0.556 12.5 26.1 0.187 0.551 13.8 78.7 0.204 Ball Arena 0 0
5 6.0 Brooklyn Nets 28.1 14.0 9.0 14 9 4.48 -0.56 3.92 117.9 113.5 4.4 101.9 0.264 0.415 0.620 0.584 13.4 20.5 0.217 0.524 10.6 77.6 0.192 Barclays Center 0 0
6 7.0 Phoenix Suns 26.6 11.0 8.0 11 8 2.84 0.24 3.08 110.8 108.0 2.8 97.5 0.214 0.428 0.572 0.537 12.3 18.9 0.179 0.521 12.4 80.0 0.193 Phoenix Suns Arena 0 0
7 8.0 Philadelphia 76ers 26.7 15.0 6.0 13 8 4.19 -1.13 3.06 111.5 107.4 4.1 101.6 0.299 0.351 0.576 0.538 13.8 23.7 0.228 0.515 13.4 77.9 0.199 Wells Fargo Center 0 0
8 9.0 Atlanta Hawks 24.3 10.0 10.0 12 8 2.50 0.26 2.76 112.2 109.7 2.5 99.2 0.298 0.396 0.564 0.517 12.9 25.1 0.243 0.506 11.5 77.1 0.203 State Farm Arena 3529 353
9 10.0 Boston Celtics 25.5 11.0 8.0 11 8 2.53 -0.03 2.50 112.4 109.8 2.6 99.3 0.236 0.359 0.570 0.541 13.4 25.3 0.178 0.536 13.8 77.9 0.209 TD Garden 0 0
10 11.0 Memphis Grizzlies 24.8 9.0 7.0 9 7 1.31 1.15 2.47 108.9 107.6 1.3 100.6 0.192 0.327 0.551 0.523 12.4 23.0 0.149 0.530 14.6 77.5 0.190 FedEx Forum 410 51
11 12.0 Indiana Pacers 26.8 12.0 9.0 12 9 2.71 -0.33 2.38 113.0 110.3 2.7 99.9 0.238 0.381 0.583 0.553 12.6 20.4 0.182 0.533 13.3 76.9 0.194 Bankers Life Fieldhouse 0 0
12 13.0 Houston Rockets 28.4 10.0 9.0 11 8 2.95 -0.97 1.98 109.4 106.5 2.9 102.1 0.255 0.445 0.573 0.541 13.7 19.3 0.193 0.512 13.5 76.8 0.195 Toyota Center 28141 3127
13 14.0 Toronto Raptors 27.2 9.0 12.0 12 9 1.67 -1.33 0.34 111.6 109.9 1.7 100.2 0.238 0.479 0.570 0.532 12.9 22.0 0.195 0.533 14.9 77.4 0.234 Amalie Arena 10989 999
14 15.0 Dallas Mavericks 26.4 8.0 13.0 9 12 -2.00 2.00 0.00 109.6 111.6 -2.0 98.7 0.264 0.411 0.559 0.525 11.3 18.5 0.199 0.530 12.7 76.7 0.216 American Airlines Center 0 0
15 16.0 San Antonio Spurs 26.9 11.0 10.0 10 11 -1.05 0.92 -0.13 110.3 111.3 -1.0 100.3 0.224 0.331 0.550 0.516 10.0 19.9 0.175 0.547 12.5 78.8 0.156 AT&T Center 0 0
16 17.0 Golden State Warriors 26.7 11.0 10.0 10 11 -1.05 0.77 -0.28 108.6 109.6 -1.0 103.2 0.262 0.417 0.563 0.527 12.7 18.4 0.201 0.514 13.5 75.8 0.249 Chase Center 0 0
17 18.0 Charlotte Hornets 24.8 10.0 11.0 10 11 -0.62 -0.41 -1.03 110.2 110.8 -0.6 99.0 0.247 0.414 0.560 0.529 13.1 23.3 0.185 0.544 14.1 75.0 0.166 Spectrum Center 0 0
18 19.0 New York Knicks 24.4 9.0 13.0 10 12 -2.00 0.53 -1.47 107.1 109.2 -2.1 95.4 0.264 0.319 0.538 0.500 12.6 23.8 0.203 0.503 10.7 76.9 0.198 Madison Square Garden (IV) 0 0
19 20.0 Portland Trail Blazers 27.3 11.0 9.0 9 11 -1.65 -0.07 -1.72 115.0 116.6 -1.6 99.8 0.229 0.460 0.567 0.529 10.1 21.5 0.190 0.560 12.2 78.0 0.209 Moda Center 0 0
20 21.0 Chicago Bulls 24.9 8.0 11.0 8 11 -2.26 0.36 -1.90 110.9 113.1 -2.2 103.4 0.246 0.413 0.590 0.556 15.2 20.8 0.196 0.553 12.9 80.0 0.217 United Center 0 0
21 22.0 New Orleans Pelicans 25.1 7.0 12.0 8 11 -2.58 -0.17 -2.75 110.3 112.8 -2.5 99.6 0.284 0.365 0.558 0.526 13.4 25.6 0.203 0.549 12.8 79.9 0.193 Smoothie King Center 8820 980
22 23.0 Detroit Pistons 26.3 5.0 16.0 7 14 -4.67 1.82 -2.85 107.7 112.4 -4.7 98.5 0.273 0.408 0.544 0.501 13.0 22.6 0.215 0.558 14.2 76.6 0.194 Little Caesars Arena 0 0
23 24.0 Cleveland Cavaliers 24.7 10.0 11.0 8 13 -4.19 0.04 -4.15 104.9 109.1 -4.2 97.2 0.254 0.309 0.536 0.505 14.4 25.9 0.181 0.537 14.9 75.3 0.170 Quicken Loans Arena 12564 1142
24 25.0 Miami Heat 26.7 7.0 13.0 7 13 -5.45 0.29 -5.16 106.9 112.3 -5.4 98.9 0.263 0.452 0.581 0.547 15.7 17.0 0.204 0.543 13.3 76.6 0.183 AmericanAirlines Arena 0 0
25 26.0 Sacramento Kings 25.7 9.0 11.0 7 13 -5.80 0.45 -5.35 112.7 118.4 -5.7 100.1 0.283 0.377 0.576 0.546 13.0 23.5 0.203 0.558 11.6 75.8 0.194 Golden 1 Center 0 0
26 27.0 Washington Wizards 26.2 4.0 13.0 6 11 -5.29 -0.85 -6.14 112.1 117.2 -5.1 104.4 0.282 0.374 0.569 0.534 11.6 20.7 0.212 0.565 12.8 78.9 0.251 Capital One Arena 0 0
27 28.0 Oklahoma City Thunder 23.7 8.0 11.0 5 14 -8.26 0.61 -7.66 105.2 113.3 -8.1 101.3 0.243 0.446 0.556 0.527 12.9 15.7 0.176 0.537 10.9 77.7 0.157 Chesapeake Energy Arena 0 0
28 29.0 Orlando Magic 26.2 8.0 14.0 6 16 -6.82 -1.40 -8.22 105.5 112.3 -6.8 98.9 0.220 0.358 0.526 0.490 12.2 24.1 0.174 0.547 12.4 79.7 0.173 Amway Center 35768 3252
29 30.0 Minnesota Timberwolves 23.5 5.0 15.0 5 15 -9.30 0.55 -8.76 104.6 113.7 -9.1 101.1 0.230 0.377 0.530 0.497 12.7 23.3 0.174 0.539 13.3 75.0 0.217 Target Center 0 0
30 NaN League Average 26.3 NaN NaN 10 10 0.00 0.00 0.00 111.1 111.1 NaN 99.8 0.250 0.396 0.568 0.534 12.8 22.2 0.193 0.534 12.8 77.8 0.193 NaN 4050 400
If you look at the source html, you'll see the tables in the comments start with <!--
BeautifulSoup skips over those. Hense, you need to add the part in the code that specifically finds the comments comments = soup.find_all(string=lambda text: isinstance(text, Comment)). Once you have all the comments, then you can iterate through each comment to see if theres a table in there. If there's a table, parse it, like you normally would with the non commented <table> tags.
I have a dataframe as below:
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
10000 .90 1.10 1.30 1.50 2.10 3.10 5.60 8.40 15.80
15000 1.35 1.65 1.95 2.25 3.15 4.65 8.40 12.60 23.70
20000 1.80 2.20 2.60 3.00 4.20 6.20 11.20 16.80 31.60
25000 2.25 2.75 3.25 3.75 5.25 7.75 14.00 21.00 39.50
30000 2.70 3.30 3.90 4.50 6.30 9.30 16.80 25.20 47.40
35000 3.15 3.85 4.55 5.25 7.35 10.85 19.60 29.40 55.30
40000 3.60 4.40 5.20 6.00 8.40 12.40 22.40 33.60 63.20
45000 4.05 4.95 5.85 6.75 9.45 13.95 25.20 37.80 71.10
50000 4.50 5.50 6.50 7.50 10.50 15.50 28.00 42.00 79.00
10000 .60 .80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
15000 .90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95
Now I would like to split them into 3 dataframes based on the 'Size'
df1: From 10000 - before next occurrence of 10000
df2: Second 10000 - before 1000
df3: From 1000 to end
Otherwise,it is fine to have a temporary variable (temp column) in the same dataframe specifying categories like S1,S2 and S3 respectively for above ranges.
Could anyone guide me how to go about this?
Regards
Assumng that you want to break on the decreases, you could use the compare-cumsum-groupby pattern:
parts = list(df.groupby((df["Size"].diff() < 0).cumsum()))
which gives me (suppressing boring rows in the middle)
>>> for key, group in parts:
... print(key)
... print(group)
... print("----")
...
0
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
[...]
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0
----
1
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
[...]
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50
----
2
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
[...]
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.90
----
Not so elegant but this works:
In [259]:
ranges=[]
first = df.index[0]
criteria = df.index[df['Size'].diff() < 0]
for idx in criteria:
ranges.append((first, idx))
first += idx
ranges
Out[259]:
[(0, 9), (9, 18)]
In [261]:
splits = []
for r in ranges:
splits.append(df.iloc[r[0]:r[1]])
splits.append(df.iloc[ranges[-1][0]:])
splits
Out[261]:
[ Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
3 25000 2.25 2.75 3.25 3.75 5.25 7.75 14.0 21.0 39.5
4 30000 2.70 3.30 3.90 4.50 6.30 9.30 16.8 25.2 47.4
5 35000 3.15 3.85 4.55 5.25 7.35 10.85 19.6 29.4 55.3
6 40000 3.60 4.40 5.20 6.00 8.40 12.40 22.4 33.6 63.2
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
12 25000 1.5 2.0 2.5 3.0 4.5 7.0 13.25 20.25 38.75
13 30000 1.8 2.4 3.0 3.6 5.4 8.4 15.90 24.30 46.50
14 35000 2.1 2.8 3.5 4.2 6.3 9.8 18.55 28.35 54.25
15 40000 2.4 3.2 4.0 4.8 7.2 11.2 21.20 32.40 62.00
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.60 0.80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
10 15000 0.90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
11 20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
12 25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
13 30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
14 35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
15 40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
16 45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
17 50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
21 4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
22 5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
23 6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
24 7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
25 8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95]
So firstly this looks to see when the size stops increasing:
df['Size'].diff() < 0
and we use to mask the index, we then iterate over these ranges to create a list of tuple ranges.
We iterate over these ranges to slice the df in the last step.
I've got one dataframe containing several years of data sampled at 30 min intervals (7 parameters from a continuous water quality sensor), and I've got another dataframe containing data at a few hundred random points in time, with one minute precision. I'd like to find the interpolated values of the 7 parameters at the few hundred random points in time.
So here's a few lines of what these dataframes look like:
print df1
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 14:30:00 25.9 45.22 29.2 70.4 4.9 0.98 7.9 -6
2002-07-16 15:00:00 26.0 44.92 29.0 76.2 5.3 1.02 7.9 -6
2002-07-16 15:30:00 26.0 45.06 29.1 77.9 5.4 1.06 7.9 -5
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 16:30:00 25.9 45.33 29.3 72.9 5.0 1.17 7.9 -6
2002-07-16 17:00:00 25.9 45.46 29.4 65.8 4.5 1.21 7.9 -6
2002-07-16 17:30:00 25.9 45.40 29.4 70.5 4.9 1.19 7.9 -6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
2002-07-16 18:30:00 25.8 45.57 29.5 67.6 4.7 1.11 7.8 -6
...
print df2
PO4F NH4F NO2F NO3F NO23F CHLA_N
DateTimeStamp
2002-07-16 14:01:00 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:05:00 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:09:00 0.023 0.073 0.000 NaN 0.014 5.8
...
I want to find the values of df1 at the index values of df2, but the only way I can figure out from reading the docs and other stackoverflow answers is by putting df1 on a one minute time base (which will generate a bunch of nans), then filling the nans using Series.interpolate, and then pulling out the one minute values at the discrete times of df2. That seems incredibly wasteful. There must be another way, right?
If you want interpolation, I think you're stuck with the method you describe, or something approximately as "wasteful." If you can setting for taking the most recent value or the next value, use ffill or bfill respectively.
In [34]: df1.reindex(df2.index, method='ffill')
Out[34]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
DateTimeStamp
2002-07-16 14:01:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 16:05:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 18:09:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
Here's a way to do what I think you want
Starting frame df1 and df2
In [100]: df1
Out[100]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb
time
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5
2002-07-16 14:30:00 25.9 45.22 29.2 70.4 4.9 0.98 7.9 -6
2002-07-16 15:00:00 26.0 44.92 29.0 76.2 5.3 1.02 7.9 -6
2002-07-16 15:30:00 26.0 45.06 29.1 77.9 5.4 1.06 7.9 -5
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6
2002-07-16 16:30:00 25.9 45.33 29.3 72.9 5.0 1.17 7.9 -6
2002-07-16 17:00:00 25.9 45.46 29.4 65.8 4.5 1.21 7.9 -6
2002-07-16 17:30:00 25.9 45.40 29.4 70.5 4.9 1.19 7.9 -6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6
2002-07-16 18:30:00 25.8 45.57 29.5 67.6 4.7 1.11 7.8 -6
In [101]: df2
Out[101]:
P04F NH4F N02F N03F NO23F CHLA_N
time
2002-07-16 14:01:00 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:05:00 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:09:00 0.023 0.073 0.000 NaN 0.014 5.8
Calculate a rounded time (the time I am converting to an int in nanoseconds, then rounding to the nearest 30*60 seconds). You may have to adjust if you want up or down (to the next 1/2 hour)
In [102]: new_index = pd.DatetimeIndex(int(1e9*30*60)*(np.round(df2.index.asi8/(1e9*30*60))).astype(np.int64)).values
In [104]: new_index
Out[104]:
array(['2002-07-16T10:00:00.000000000-0400',
'2002-07-16T12:00:00.000000000-0400',
'2002-07-16T14:00:00.000000000-0400'], dtype='datetime64[ns]')
Copying just to avoid modifying the original frame. Set the new index
In [105]: df3 = df2.copy()
In [106]: df3.index = new_index
Subselect and join
In [107]: df1.loc[df3.index].join(df3)
Out[107]:
Temp SpCond Sal DO_pct DO_mgl Depth pH Turb P04F NH4F N02F N03F NO23F CHLA_N
2002-07-16 14:00:00 26.0 45.31 29.3 71.6 4.9 0.95 7.9 -5 0.053 0.073 0.005 0.021 0.026 8.6
2002-07-16 16:00:00 25.9 45.23 29.2 67.0 4.6 1.11 7.8 -6 0.029 0.069 0.002 0.016 0.018 9.6
2002-07-16 18:00:00 25.9 45.27 29.3 74.3 5.1 1.15 7.9 -6 0.023 0.073 0.000 NaN 0.014 5.8