Scraping non-interactable table from dynamic webpage - python

I've seen a couple of posts with this same question but their scripts usually waits until one of the elements (buttons) is clickable. Here is the table I'm trying to scrape:
https://ropercenter.cornell.edu/presidential-approval/highslows
First couple of tries my code was returning all the rows except both Polling Organization columns. Without changing anything, it now only scrapes the table headers and the tbody tag (no table rows).
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
Should I use explicit wait? If so, which condition should I wait for since the dynamic table is not interactive?
Also, why did the output of my code change after running it multiple times?

Maybe more cheating, but easier solution, which indeed solves your problem, but in other way, would be to take a look what frontend does (using developer tools), and discover it calls the api, which returns JSON value, so no selenium is really needed. requests and pandas are enough.
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')

Using only Selenium, GeckoDriver and firefox to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='table table-striped']"))).get_attribute("outerHTML")
tabledf = pd.read_html(tabledata)
print(tabledf)
driver.quit()
Console Output:
[ President Highest % ... Lowest % Polling Organization & Dates.1
0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021
2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19...
4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199...
5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
[15 rows x 5 columns]]

Related

Unable to do web scraping from URL using Python Alchemy

I have a script where I'm trying to web scraping the data into table. But I'm getting errors
raise exc.with_traceback(traceback)
ValueError: No tables found
Script :
import pandas as pd
import logging
from sqlalchemy import create engine
from urlib.parse import quote
db_connection = {mysql}://{username}:{quote'pwd'}#{DB:port}
ds_connection = create_engine(db_connection)
a = pd.read_html("https://www.centralbank.ae/en/forex-eibor/exchange-rates/")
df = pd.Dataframe(a[0])
df_final = df.loc[:,['Currency','Rate']]
df_final.to_sql('rate_table',db_connection,if_exists = append,index=false)
Can anyone suggest on this
One easy way to obtain those exchange rates would be to scrape the API accessed to retrieve information in page (check Dev Tools - network tab):
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.centralbank.ae/en/forex-eibor/exchange-rates/'
}
r = requests.post('https://www.centralbank.ae/umbraco/Surface/Exchange/GetExchangeRateAllCurrency', headers=headers)
dfs = pd.read_html(r.text)
print(dfs[0].loc[:,['Currency','Rates']])
This returns:
Currency
Rates
0
US Dollar
3.6725
1
Argentine Peso
0.026993
2
Australian Dollar
2.52753
3
Bangladesh Taka
0.038508
4
Bahrani Dinar
9.74293
5
Brunei Dollar
2.64095
6
Brazilian Real
0.706549
7
Botswana Pula
0.287552
8
Belarus Rouble
1.45526
9
Canadian Dollar
2.82565
10
Swiss Franc
3.83311
11
Chilean Peso
0.003884
12
Chinese Yuan - Offshore
0.536978
13
Chinese Yuan
0.538829
14
Colombian Peso
0.000832
15
Czech Koruna
0.149763
16
Danish Krone
0.496304
17
Algerian Dinar
0.025944
18
Egypt Pound
0.191775
19
Euro
3.69096
20
GB Pound
4.34256
21
Hongkong Dollar
0.468079
22
Hungarian Forint
0.009112
23
Indonesia Rupiah
0.000248
24
Indian Rupee
0.045976
25
Iceland Krona
0.026232
26
Jordan Dinar
5.17472
27
Japanese Yen
0.026818
28
Kenya Shilling
0.030681
29
Korean Won
0.002746
30
Kuwaiti Dinar
11.9423
31
Kazakhstan Tenge
0.007704
32
Lebanon Pound
0.002418
33
Sri Lanka Rupee
0.010201
34
Moroccan Dirham
0.353346
35
Macedonia Denar
0.059901
36
Mexican Peso
0.181874
37
Malaysia Ringgit
0.820395
38
Nigerian Naira
0.008737
39
Norwegian Krone
0.37486
40
NewZealand Dollar
2.27287
41
Omani Rial
9.53921
42
Peru Sol
0.952659
43
Philippine Piso
0.065562
44
Pakistan Rupee
0.017077
45
Polish Zloty
0.777446
46
Qatari Riyal
1.00254
47
Serbian Dinar
0.031445
48
Russia Rouble
0.06178
49
Saudi Riyal
0.977847
50
Sudanese Pound
0.006479
51
Swedish Krona
0.347245
52
Singapore Dollar
2.64038
53
Thai Baht
0.102612
54
Tunisian Dinar
1.1505
55
Turkish Lira
0.20272
56
Trin Tob Dollar
0.541411
57
Taiwan Dollar
0.121961
58
Tanzania Shilling
0.001575
59
Uganda Shilling
0.000959
60
Vietnam Dong
0.000157
61
Yemen Rial
0.01468
62
South Africa Rand
0.216405
63
Zambian Kwacha
0.227752
64
Azerbaijan manat
2.16157
65
Bulgarian lev
1.8873
66
Croatian kuna
0.491344
67
Ethiopian birr
0.069656
68
Iraqi dinar
0.002516
69
Israeli new shekel
1.12309
70
Libyan dinar
0.752115
71
Mauritian rupee
0.079837
72
Romanian leu
0.755612
73
Syrian pound
0.001462
74
Turkmen manat
1.05079
75
Uzbekistani som
0.000336

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)
The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Parsing html with correct encoding

I'm trying to use the basketball-reference API using python with the requests and bs4 libraries.
from requests import get
from bs4 import BeautifulSoup
Here's a minimal example of what I'm trying to do:
# example request
r = get(f'https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fteams%2FMIL%2F2015.html&div=div_roster')
soup = BeautifulSoup(dd.content, 'html.parser')
table = soup.find('table')
It all works well, I can then feed this table to pandas with its read_html and get the data I need nicely packed into a dataframe.
The problem I have is the encoding.
In this particular request I got two NBA player names with weird characters: Ersan Ä°lyasova (Ersan İlyasova) and Jorge Gutiérrez (Jorge Gutiérrez). In the current code they are interpreted as "Ersan Ä°lyasova" and "Jorge Gutiérrez", which is obviously not what I want.
So the question is -- how do I fix it? This website seems to suggest they have the windows-1251 encoding, but I'm not sure how to use that information (in fact I'm not even sure if that's true).
I know I'm missing something fundamental here as I'm a bit confused how these encodings work at which point they are being "interpreted" etc, so I'll be grateful if you help me with this!
I really don't know why you are usingformat string and even your question is not clear. you've just copy/paste the url from the network traffic and then you mixing things about quoted string with encoding.
Below you should be able to done it.
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/teams/MIL/2015.html")
print(df)
Output:
[ No. Player Pos ... Unnamed: 6 Exp College
0 34 Giannis Antetokounmpo SG ... gr 1 NaN
1 19 Jerryd Bayless PG ... us 6 Arizona
2 5 Michael Carter-Williams PG ... us 1 Syracuse
3 9 Jared Dudley SG ... us 7 Boston College
4 11 Tyler Ennis PG ... ca R Syracuse
5 13 Jorge Gutiérrez PG ... mx 1 California
6 31 John Henson C ... us 2 UNC
7 7 Ersan İlyasova PF ... tr 6 NaN
8 23 Chris Johnson SF ... us 2 Dayton
9 11 Brandon Knight PG ... us 3 Kentucky
10 5 Kendall Marshall PG ... us 2 UNC
11 6 Kenyon Martin PF ... us 14 Cincinnati
12 0 O.J. Mayo SG ... us 6 USC
13 22 Khris Middleton SF ... us 2 Texas A&M
14 3 Johnny O'Bryant PF ... us R LSU
15 27 Zaza Pachulia C ... ge 11 NaN
16 12 Jabari Parker PF ... us R Duke
17 21 Miles Plumlee C ... us 2 Duke
18 8 Larry Sanders C ... us 4 Virginia Commonwealth
19 6 Nate Wolters PG ... us 1 South Dakota State

How to pull only certain fields with BeautifulSoup

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")
Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")
Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

Read excel file in spyder but some data missing

I am a new in python and is trying to read my excel file in spyder, anaconda. However, when I run it, some row is missing and replaced with '...'. I have seven columns and 100 rows in my excel file. The column arrangement also quite weird.
This is my code:
import pandas as pd
print(" Comparing within 100 Airline \n\n")
def view():
airlines = pd.ExcelFile('Airline_final.xlsx')
df1 = pd.read_excel("Airline_final.xlsx",sheet_name=2)
print("\n\n 1: list of all Airlines \n")
print(df1)
view()
Here is what I get:
18 #051 Cubana Cuba
19 #003 Aigle Azur France
20 #011 Air Corsica France
21 #012 Air France France
22 #019 Air Mediterranee France
23 #050 Corsair France
24 #072 HOP France
25 #087 Joon France
26 #006 Air Berlin Germany
27 #049 Condor Flugdienst Germany
28 #057 Eurowings Germany
29 #064 Germania Germany
.. ... ... ...
70 #018 Air Mandalay Myanmar
71 #020 Air KBZ Myanmar
72 #067 Golden Myanmar Airlines Myanmar
73 #017 Air Koryo North Korea
74 #080 Jetstar Asia Singapore
75 #036 Binter Canarias Spain
76 #040 Canaryfly Spain
77 #073 Iberia and Iberia Express Spain
To print the whole dataframe use:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df1)

Categories