Python: Parse multiple tables from webpage and group data in CSV - python

I'm a total newbie at Python and have what I think is a pretty complex problem. I'd like to parse two tables from a website for about 80 URLs, example of one of the pages: https://www.sports-reference.com/cfb/players/sam-darnold-1.html
I'd need the first table "Passing" and the second table "Rushing and Receiving" from each of the 80 URLs (I know how to get the first and second table). But the problem is I need it for all 80 URLs in one csv.
This is my code so far and how the data looks:
import requests
import pandas as pd
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
try:
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
except:
pass
data = pd.concat(dataframes)
data.to_csv('testcsv3.csv', sep=',') ```
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| 1 | | | | | | | | | | | | | | | |
| 2 | | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 |
| 3 | | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 |
| 4 | | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 |
| 5 | | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 6 | | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 |
| 7 | | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 |
| 8 | | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 1 | | | | | | | | | | | | | | | |
| 2 | | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 |
| 3 | | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 |
| 4 | | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
| 5 | | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 |
| 6 | | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 |
| 7 | | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
And this is how I'd like the data to look, note the player name is missing from each grouping which ideally can be added from the sample website/url and I've added the second table which I need help appending:
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | School | Conf | Class | Pos | G | Att | Yds | Avg | TD |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Russell Wilson | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 | North Carolina State | ACC | FR | QB | 11 | 150 | 467 | 6.7 | 3 |
| 3 | Russell Wilson | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 | North Carolina State | ACC | SO | QB | 12 | 129 | 300 | 6.8 | 2 |
| 4 | Russell Wilson | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 | North Carolina State | ACC | JR | QB | 13 | 190 | 560 | 7.1 | 5 |
| 5 | Russell Wilson | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | Big Ten | SR | QB | 14 | 210 | 671 | 7.3 | 7 |
| 6 | Russell Wilson | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 | Overall | | | | | | | | |
| 7 | Russell Wilson | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 | North Carolina State | | | | | | | | |
| 8 | Russell Wilson | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | | | | | | | | |
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Cam Newton | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 | Florida | SEC | FR | QB | 5 | 210 | 456 | 7.1 | 2 |
| 3 | Cam Newton | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 | Florida | SEC | SO | QB | 1 | 212 | 478 | 4.5 | 5 |
| 4 | Cam Newton | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | SEC | JR | QB | 14 | 219 | 481 | 6.7 | 6 |
| 5 | Cam Newton | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 | Overall | | | | | | | 3.4 | 7 |
| 6 | Cam Newton | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 | Florida | | | | | | | | |
| 7 | Cam Newton | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | | | | | | | | |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
So basically I'd wanna append the second table (Only the columns mentioned) to the end of the first table and add the player name (read from the URL) to each row

import requests
import pandas as pd
from bs4 import BeautifulSoup
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
COLUMNS2 = ['School', 'Conf', 'Class', 'Pos', 'G', 'Att', 'Yds','Avg', 'TD', 'Rec', 'Yds', 'Avg', 'TD', 'Plays', 'Yds', 'Avg', 'TD']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
dataframes2 = []
for url in urls:
a = url
print(a)
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
#Create player column in first column and derive the player from the URL
cy_data.insert(0, 'Player', url)
cy_data['Player'] = cy_data['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes.append(cy_data)
table2 = soup.find_all('table')[1] # Find the second "table" tag in the page
rows2 = table2.find_all("tr")
cy_data2 = []
for row2 in rows2:
cells2 = row2.find_all("td")
cells2 = cells2[0:14]
cy_data2.append([cell.text for cell in cells2]) # For each "td" tag, get the text inside it
cy_data2 = pd.DataFrame(cy_data2, columns=COLUMNS2)
cy_data2.insert(0, 'Player', url)
cy_data2['Player'] = cy_data2['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data2['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes2.append(cy_data2)
data = pd.concat(dataframes).reset_index()
data2 = pd.concat(dataframes).reset_index()
data3 = data.merge(data2, on=['index', 'Player'], suffixes=('',' '))
#Filter on None rows
data3 = data3.loc[data3['School'].notnull()].drop('index', axis=1)
display(data, data2, data3)

Related

Running Hyperopt in Freqtrade and getting crazy results

I ran hyperopt for 5000 iterations and got the following results:
2022-01-10 19:38:31,370 - freqtrade.optimize.hyperopt - INFO - Best result:
1101 trades. Avg profit 0.23%. Total profit 25.48064438 BTC (254.5519Σ%). Avg duration 888.1 mins.
with values:
{ 'roi_p1': 0.011364434095803464,
'roi_p2': 0.04123147845715937,
'roi_p3': 0.10554480985209454,
'roi_t1': 105,
'roi_t2': 47,
'roi_t3': 30,
'rsi-enabled': True,
'rsi-value': 9,
'sell-rsi-enabled': True,
'sell-rsi-value': 94,
'sell-trigger': 'sell-bb_middle1',
'stoploss': -0.42267640639979365,
'trigger': 'bb_lower2'}
2022-01-10 19:38:31,371 - freqtrade.optimize.hyperopt - INFO - ROI table:
{ 0: 0.15814072240505736,
30: 0.05259591255296283,
77: 0.011364434095803464,
182: 0}
Result for strategy BBRSI
================================================== BACKTESTING REPORT =================================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:----------|------------:|---------------:|---------------:|-------------------:|:----------------|---------:|-------:|
| ETH/BTC | 11 | -1.30 | -14.26 | -1.42732928 | 3 days, 4:55:00 | 0 | 1 |
| LUNA/BTC | 17 | 0.60 | 10.22 | 1.02279906 | 15:46:00 | 9 | 0 |
| SAND/BTC | 37 | 0.30 | 11.24 | 1.12513532 | 6:16:00 | 14 | 1 |
| MATIC/BTC | 24 | 0.47 | 11.35 | 1.13644340 | 12:20:00 | 10 | 0 |
| ADA/BTC | 24 | 0.24 | 5.68 | 0.56822170 | 21:05:00 | 5 | 0 |
| BNB/BTC | 11 | -1.09 | -11.96 | -1.19716109 | 3 days, 0:44:00 | 2 | 1 |
| XRP/BTC | 20 | -0.39 | -7.71 | -0.77191523 | 1 day, 5:48:00 | 1 | 1 |
| DOT/BTC | 9 | 0.50 | 4.54 | 0.45457736 | 4 days, 1:13:00 | 4 | 0 |
| SOL/BTC | 19 | -0.38 | -7.16 | -0.71688463 | 22:47:00 | 3 | 1 |
| MANA/BTC | 29 | 0.38 | 11.16 | 1.11753320 | 10:25:00 | 9 | 1 |
| AVAX/BTC | 27 | 0.30 | 8.15 | 0.81561432 | 16:36:00 | 11 | 1 |
| GALA/BTC | 26 | -0.52 | -13.45 | -1.34594702 | 15:48:00 | 9 | 1 |
| LINK/BTC | 21 | 0.27 | 5.68 | 0.56822170 | 1 day, 0:06:00 | 5 | 0 |
| TOTAL | 275 | 0.05 | 13.48 | 1.34930881 | 23:42:00 | 82 | 8 |
================================================== SELL REASON STATS ==================================================
| Sell Reason | Count |
|:--------------|--------:|
| roi | 267 |
| force_sell | 8 |
=============================================== LEFT OPEN TRADES REPORT ===============================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:---------|------------:|---------------:|---------------:|-------------------:|:------------------|---------:|-------:|
| ETH/BTC | 1 | -14.26 | -14.26 | -1.42732928 | 32 days, 4:00:00 | 0 | 1 |
| SAND/BTC | 1 | -4.65 | -4.65 | -0.46588544 | 17:00:00 | 0 | 1 |
| BNB/BTC | 1 | -14.23 | -14.23 | -1.42444977 | 31 days, 13:00:00 | 0 | 1 |
| XRP/BTC | 1 | -8.85 | -8.85 | -0.88555957 | 18 days, 4:00:00 | 0 | 1 |
| SOL/BTC | 1 | -10.57 | -10.57 | -1.05781765 | 5 days, 14:00:00 | 0 | 1 |
| MANA/BTC | 1 | -3.17 | -3.17 | -0.31758065 | 17:00:00 | 0 | 1 |
| AVAX/BTC | 1 | -12.58 | -12.58 | -1.25910300 | 7 days, 9:00:00 | 0 | 1 |
| GALA/BTC | 1 | -23.66 | -23.66 | -2.36874608 | 7 days, 12:00:00 | 0 | 1 |
| TOTAL | 8 | -11.50 | -91.97 | -9.20647144 | 12 days, 23:15:00 | 0 | 8 |
Have accurately followed the tutorial. Don't know what I am doing wrong here.

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Pandas cant print the list of objects collected from web using xpath in Jupyter notebook

This is the code I used. I am using Jupternotebook web version. I upgraded the XML, and python version is 3.8.
import numpy as np
import requests
from lxml import html
import csv
import pandas as pd
# getting the web content
r = requests.get('http://www.pro-football-reference.com/years/2017/draft.htm')
data = html.fromstring(r.text)
collecting specific data
pick = data.xpath('//td[#data_stat="draft_pick"]//text()')
player = data.xpath('//td[#data_stat="player"]//text()')
position = data.xpath('//td[#data_stat="pos"]//text()')
age= data.xpath('//td[#data_stat="age"]//text()')
games_played = data.xpath('//td[#data_stat="g"]//text()')
cmp = data.xpath('//td[#data_stat="pass_cmp"]//text()')
att = data.xpath('//td[#data_stat="pass_att"]//text()')
college = data.xpath('//td[#data_stat="college_id"]//text()')
data = list(zip(pick,player,position,age,games_played,cmp,att,college))
df = pd.DataFrame(data)
df
There are two errors showing on two separate files I tried:
<module 'pandas' from 'C:\Users\anaconda3\lib\site-packages\pandas\init.py'>
AttributeError: 'list' object has no attribute 'xpath'
The code is not giving me the list of data I wanted from the webpage. Can anyone help me out with this? Thank you in advance.
You can load html tables directly into a dataframe using read_html:
import pandas as pd
df = pd.read_html('http://www.pro-football-reference.com/years/2017/draft.htm')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rnd'].ne('Rnd')] # remove mid-table header rows
Output:
| | Rnd | Pick | Tm | Player | Pos | Age | To | AP1 | PB | St | CarAV | DrAV | G | Cmp | Att | Yds | TD | Int | Att | Yds | TD | Rec | Yds | TD | Solo | Int | Sk | College/Univ | Unnamed: 28_level_1 |
|---:|------:|-------:|:-----|:------------------|:------|------:|-----:|------:|-----:|-----:|--------:|-------:|----:|------:|------:|------:|-----:|------:|------:|------:|-----:|------:|------:|-----:|-------:|------:|------:|:---------------|:----------------------|
| 0 | 1 | 1 | CLE | Myles Garrett | DE | 21 | 2020 | 1 | 2 | 4 | 35 | 35 | 51 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 107 | nan | 42.5 | Texas A&M | College Stats |
| 1 | 1 | 2 | CHI | Mitchell Trubisky | QB | 23 | 2020 | 0 | 1 | 3 | 33 | 33 | 51 | 1010 | 1577 | 10609 | 64 | 37 | 190 | 1057 | 8 | 0 | 0 | 0 | nan | nan | nan | North Carolina | College Stats |
| 2 | 1 | 3 | SFO | Solomon Thomas | DE | 22 | 2020 | 0 | 0 | 2 | 15 | 15 | 48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 73 | nan | 6 | Stanford | College Stats |
| 3 | 1 | 4 | JAX | Leonard Fournette | RB | 22 | 2020 | 0 | 0 | 3 | 25 | 20 | 49 | 0 | 0 | 0 | 0 | 0 | 763 | 2998 | 23 | 170 | 1242 | 2 | nan | nan | nan | LSU | College Stats |
| 4 | 1 | 5 | TEN | Corey Davis | WR | 22 | 2020 | 0 | 0 | 4 | 25 | 25 | 56 | 0 | 0 | 0 | 0 | 0 | 6 | 55 | 0 | 207 | 2851 | 11 | nan | nan | nan | West. Michigan | College Stats |

Binning Pandas value_counts

I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64

Numpy version of finding the highest and lowest value locations within an interval of another column?

Given the following numpy array. How can I find the highest and lowest value locations of column 0 within the interval on column 1 using numpy?
import numpy as np
data = np.array([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | -1 | min
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | 1 | max
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | -1 |min
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | 1 | max
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | -1 | min
| 75 | 1886.171 | 1 | 1 | max
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | 1 | max
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | -1 | min
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | 1 | max
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | -1 | min
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | 1 | max
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | -1 | min
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | 1 | max
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | -1 | min
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | -1 | min
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | 1 | max
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
I must specify upfront that this question has already been answered here with a pandas solution. The solution performs reasonable at about 300 seconds for a table of around 1 million rows. But after some more testing, I see that if the table is over 3 million rows, the execution time increases dramatically to over 2500 seconds and even more. This is obviously too long for such a simple task. How would the same problem be solved with numpy?
Here's one NumPy approach -
mask = ~np.isnan(data[:,1])
s0 = np.flatnonzero(mask[1:] > mask[:-1])+1
s1 = np.flatnonzero(mask[1:] < mask[:-1])+1
lens = s1 - s0
tags = np.repeat(np.arange(len(lens)), lens)
idx = np.lexsort((data[mask,0], tags))
starts = np.r_[0,lens.cumsum()]
offsets = np.r_[s0[0], s0[1:] - s1[:-1]]
offsets_cumsum = offsets.cumsum()
min_ids = idx[starts[:-1]] + offsets_cumsum
max_ids = idx[starts[1:]-1] + offsets_cumsum
out = np.full(data.shape[0], np.nan)
out[min_ids] = -1
out[max_ids] = 1
So this is a bit of a cheat since it uses scipy:
import numpy as np
from scipy import ndimage
markers = np.isnan(data[:, 1])
groups = np.cumsum(markers)
mins, max, min_idx, max_idx = ndimage.measurements.extrema(
data[:, 0], labels=groups, index=range(2, groups.max(), 2))

Categories