Not able to scrape entire table using pd.read_html

Not able to scrape entire table using pd.read_html - python

I tried using pd.read_html to scrape a table, but the last 3 columns are returning "nan". Here is the code I used:
import pandas as pd
url = 'https://www.actionnetwork.com/mlb/public-betting'
todays_games = pd.read_html(url)[0]
There are 7 columns in total, and it grabs all of the headers, but not the data in the last 3 columns. I also tried parsing this using BeautifulSoup, but got the same result.
print(todays_games)
Scheduled Open ... Diff Bets
0 5:05 PM 951MarlinsMIA952NationalsWSH -118+100 ... NaN NaN
1 5:10 PM 979BrewersMIL980TigersDET -227+188 ... NaN NaN
2 7:07 PM 965RaysTB966Blue JaysTOR +150-175 ... NaN NaN
3 8:10 PM 967Red SoxBOS968MarinersSEA -125+105 ... NaN NaN
4 10:35 PM 953RedsCIN954PiratesPIT -154+135 ... NaN NaN
5 11:05 PM 955CubsCHC956PhilliesPHI +170-200 ... NaN NaN
6 11:05 PM 969YankeesNYY970OriolesBAL -227+188 ... NaN NaN
7 11:10 PM 957CardinalsSTL958MetsNYM +135-154 ... NaN NaN
8 11:20 PM 959RockiesCOL960BravesATL +170-200 ... NaN NaN
9 11:40 PM 971IndiansCLE972TwinsMIN +100-118 ... NaN NaN
10 Thu 9/16, 12:05 AM 973AstrosHOU974RangersTEX -213+175 ... NaN NaN
11 Thu 9/16, 12:10 AM 975AngelsLAA976White SoxCWS +160-189 ... NaN NaN
12 Thu 9/16, 12:10 AM 977AthleticsOAK978RoyalsKC -149+125 ... NaN NaN
13 Thu 9/16, 1:45 AM 961PadresSD962GiantsSF +103-120 ... NaN NaN
14 Thu 9/16, 2:10 AM 963DiamondbacksARI964DodgersLAD -185+155 ... NaN NaN
I'm assuming the problem has something to do with the HTML code. Can anyone help me solve this?

Send HTTP GET to https://api.actionnetwork.com/web/v1/scoreboard/mlb?bookIds=15,30,68,75,69,76,71,79,247,123,263&date=20210915
and get the data you are looking for.
import requests
r = requests.get(
'https://api.actionnetwork.com/web/v1/scoreboard/mlb?bookIds=15,30,68,75,69,76,71,79,247,123,263&date=20210915')
print(r.json())

Related

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?

You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?

Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

How to drop subdataframe if it contains more than 40% NaN Pandas

Hello everyone i have such problem:
I have panel data for 400.000 objects and i want to drop objects if it contains more that 40% NaNs
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
... ... ... ... ... ... ...
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
... ... ... ... ...
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
I have such dataframe, and for each subdataframe grouped by (inn,time_reg) i need to drop it if total nans in columns (revenue1 balans1 equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stliabilities1) more than 40%.
I have an idea to do it in a loop but this it takes a lot of time
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
4081809 9909403758 2002 6078000.0 2270000.0 2195000.0 -32000.0
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
4081809 1324000.0 NaN 234000.0 75000.0
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
This subdataframe should be droped, coz it contains more than 40% nans
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
5 0101000021 2011 3480000.0 610000.0 71000.0 NaN
6 0101000021 2012 4820000.0 710000.0 139000.0 149000.0
7 0101000021 2013 5200000.0 790000.0 148000.0 170000.0
8 0101000021 2014 5450000.0 830000.0 155000.0 180000.0
9 0101000021 2015 5620000.0 860000.0 164000.0 189000.0
10 0101000021 2016 5860000.0 885000.0 175000.0 200000.0
11 0101000021 2017 15112000.0 1275000.0 298000.0 323000.0
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
5 NaN 61000.0 NaN NaN
6 869000.0 129000.0 700000.0 571000.0
7 1040000.0 138000.0 780000.0 642000.0
8 1090000.0 145000.0 820000.0 675000.0
9 1124000.0 154000.0 850000.0 696000.0
10 1172000.0 165000.0 875000.0 710000.0
11 3023000.0 288000.0 1265000.0 977000.0
This subdataframe contains less than 40% nans and must be in final dataframe

Would a loop be too slow too if you used a numpy/pandas function for the counting? You could use someDataFrame.isnull().sum().sum().
Probably a lot faster than writing your own loop to go over all the values in a dataframe, since those libraries tend to have very efficient implementations of those kinds of functions.

You can use the filter method of pd.DataFrame.groupby.
This allows you to pass a function that indicates whether a subframe should be filtered or not (in this case if it contains over 40% NaNs in the relevant columns). To get that information, you can use numpy to count the nans as in getNanFraction:
def getNanFraction(df):
nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
return nanCount/len(df)
df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )

get element in column that preceeded new column

I'm trying to collect items from column 'data' that just preceded the data I collected in column 'min' and create new column. See
Here is the data (importing with pd.read_csv):
time,data
12/15/18 01:10 AM,130352.146180556
12/16/18 01:45 AM,130355.219097222
12/17/18 01:47 AM,130358.223263889
12/18/18 02:15 AM,130361.281701389
12/19/18 03:15 AM,130364.406597222
12/20/18 03:25 AM,130352.427430556
12/21/18 03:27 AM,130355.431597222
12/22/18 05:18 AM,130358.663541667
12/23/18 06:44 AM,130361.842430556
12/24/18 07:19 AM,130364.915243056
12/25/18 07:33 AM,130352.944409722
12/26/18 07:50 AM,130355.979826389
12/27/18 09:13 AM,130359.153472222
12/28/18 11:53 AM,130362.4871875
12/29/18 01:23 PM,130365.673263889
12/30/18 02:17 PM,130353.785763889
12/31/18 02:23 PM,130356.798263889
01/01/19 04:41 PM,130360.085763889
01/02/19 05:01 PM,130363.128125
and my code:
import pandas as pd
import numpy as np
from scipy import signal
from scipy.signal import argrelextrema
import datetime
diff=pd.DataFrame()
df=pd.read_csv('saw_data2.csv')
df['time']=pd.to_datetime(df['time'])
print(df.head())
n=2 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
If you plot the data, you'll see it is similiar to a sawtooth. The element before in 'data' that I get in 'min' is the element I want to put in a new column df['new_col'].
I've tried many things like,
df['new_col']=df.index.get_loc(df['min'].df['data'])
and,
df['new_col']=df['min'].shift() #obviously wrong

IIUC, you can do the shift before selecting the rows with a value in min:
df['new_col'] = df.shift().loc[df['min'].notna(), 'data']
print (df)
time data min new_col
0 12/15/18 01:10 AM 130352.146181 130352.146181 NaN
1 12/16/18 01:45 AM 130355.219097 NaN NaN
2 12/17/18 01:47 AM 130358.223264 NaN NaN
3 12/18/18 02:15 AM 130361.281701 NaN NaN
4 12/19/18 03:15 AM 130364.406597 NaN NaN
5 12/20/18 03:25 AM 130352.427431 130352.427431 130364.406597
6 12/21/18 03:27 AM 130355.431597 NaN NaN
7 12/22/18 05:18 AM 130358.663542 NaN NaN
8 12/23/18 06:44 AM 130361.842431 NaN NaN
9 12/24/18 07:19 AM 130364.915243 NaN NaN
10 12/25/18 07:33 AM 130352.944410 130352.944410 130364.915243
11 12/26/18 07:50 AM 130355.979826 NaN NaN
12 12/27/18 09:13 AM 130359.153472 NaN NaN
13 12/28/18 11:53 AM 130362.487187 NaN NaN
14 12/29/18 01:23 PM 130365.673264 NaN NaN
15 12/30/18 02:17 PM 130353.785764 130353.785764 130365.673264
16 12/31/18 02:23 PM 130356.798264 NaN NaN
17 01/01/19 04:41 PM 130360.085764 NaN NaN
18 01/02/19 05:01 PM 130363.128125 NaN NaN

Dataframe shift moving data into random columns?

I'm using code to shift time series data that looks somewhat similar to this:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
with the intent of getting it into something like this:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
I'm using this code to make that happen:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
with the addition of a groupby.sum() because some players in the dataframe moved from one teamt o another within the same season and i want to combine those numbers. However, the data i have is actually coming out extremely wrong. The data has too many columns to post, but it seems like the data from the previous year (_prev) is placed into random columns. It doesn't change and will always place it into the same wrong columns. Is this an issue caused by the groupby.sum()? is it because i'm using a columns variable (containing all the same names as res.columns with a str(_prev) attached to them) and a list(res.columns)? And regardless of which it is, how do i solve this?
here's the outputs of columns and res.columns:
columns:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
both are length 35 when tested.

I suggest use:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to scrape entire table using pd.read_html - python

Related

pandas.read_html tables not found

Python/Pandas outer merge not including all relevant columns

How to drop subdataframe if it contains more than 40% NaN Pandas

get element in column that preceeded new column

Dataframe shift moving data into random columns?

Categories

Resources