Efficiently web scrape tables with Selenium(Python) and Pandas

Efficiently web scrape tables with Selenium(Python) and Pandas - python

I have some questions regarding web scraping with selenium for python. I attempted to web scrape a table of pokemon names and stats from pokemondb.net, and I saved that data into a pandas dataframe in my jupyter notebook. The problem is that it takes 2-3 minutes to scrape all the data, and I assumed that this a bit too time consuming of a process. I was wondering if maybe I did a poor job of coding my web scraping program? I also programmed it to scrape all the table data 1 column at a time, and I believe that this may be one reason why it is not as efficient as possible. I would appreciate if anyone can take a look and offer any suggestions.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import os
import numpy as np
import pandas as pd
import matplotlib as plt
driver = webdriver.Chrome('drivers/chromedriver.exe') # assign the driver path to variable
driver.get('https://pokemondb.net/pokedex/all') # get request - opens chrome browser and navigates to URL
driver.minimize_window() # minimize window
pokemon_id = []
pokemon_id_html = driver.find_elements(By.CLASS_NAME, 'infocard-cell-data') # retrieve the pokemon id column from pokemondb.net
for poke_id in pokemon_id_html:
pokemon_id.append(poke_id.text)
pokemon_name = []
pokemon_name_html = driver.find_elements(By.CLASS_NAME, 'ent-name') # retrieve the pokemon name column
for name in pokemon_name_html:
pokemon_name.append(name.text)
pokemon_type = []
pokemon_type_html = driver.find_elements(By.CLASS_NAME, 'cell-icon') # retrieve pokemon type
for p_type in pokemon_type_html:
pokemon_type.append(p_type.text)
pokemon_total = []
pokemon_total_html = driver.find_elements(By.CLASS_NAME, 'cell-total') # retrieve pokemon total stats
for total in pokemon_total_html:
pokemon_total.append(total.text)
pokemon_hp = []
pokemon_hp_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][1]") # retrieve pokemon hp stat
for hp in pokemon_hp_html:
pokemon_hp.append(hp.text)
pokemon_attack = []
pokemon_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][2]") # retrieve pokemon attack stat
for attack in pokemon_attack_html:
pokemon_attack.append(attack.text)
pokemon_defense = []
pokemon_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][3]") # retrieve pokemon defense stat
for defense in pokemon_defense_html:
pokemon_defense.append(defense.text)
pokemon_special_attack = []
pokemon_special_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][4]") # retrieve pokemon sp. attack stat
for special_attack in pokemon_special_attack_html:
pokemon_special_attack.append(special_attack.text)
pokemon_special_defense = []
pokemon_special_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][5]") # retrieve pokemon sp. defense stat
for special_defense in pokemon_special_defense_html:
pokemon_special_defense.append(special_defense.text)
pokemon_speed = []
pokemon_speed_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][6]") # retrieve pokemon speed stat
for speed in pokemon_speed_html:
pokemon_speed.append(speed.text)
driver.close() # close driver, end session
columns = ['id', 'name', 'type', 'total', 'hp', 'attack', 'defense', 'special-attack', 'special-defense', 'speed'] # column names (labels) for dataset
attributes = [pokemon_id, pokemon_name, pokemon_type, pokemon_total, pokemon_hp, pokemon_attack, pokemon_defense, pokemon_special_attack, pokemon_special_defense, pokemon_speed] # list of values for each column (rows) for dataset

Though #platipus_on_fire_333 answer was perfecto using page_source, as an alternative you can also canonically identify the <table> element and achieve similar result.
Solution
To web scrape a table of pokemon names and stats from pokemondb.net you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver.execute("get", {'url': 'https://pokemondb.net/pokedex/all'})
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#pokedex"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
Console Output:
[ # Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ... ... ...
1070 902 Basculegion Female Water Ghost 530 120 92 65 100 75 78
1071 903 Sneasler Poison Fighting 510 80 130 60 40 80 120
1072 904 Overqwil Dark Poison 510 85 115 95 65 65 85
1073 905 Enamorus Incarnate Forme Fairy Flying 580 74 115 70 135 80 106
1074 905 Enamorus Therian Forme Fairy Flying 580 74 115 110 135 100 46
[1075 rows x 10 columns]]

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://pokemondb.net/pokedex/all")
dfs = pd.read_html(str(browser.page_source))
dfs[0]
This returns a dataframe with 1075 rows × 10 columns:
# Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ...

Related

How to create a list of N items with a budget constraint and multiple conditions on Python

I have the following df of Premier League players (ROI_top_players):
player team position cost_2223 total_points ROI
0 Mohamed Salah Liverpool FWD 13.0 259 29.77
1 Trent Alexander Liverpool DEF 8.4 206 24.52
2 Jarrod Bowen West Ham MID 8.5 204 23.56
3 Kevin De Bruyne Man City MID 12.0 190 15.70
4 Virgil van Dijk Liverpool DEF 6.5 183 14.91
... ... ... ... ... ... ... ... ... ...
151 Jamaal Lascelles Newcastle DEF 4.5 45 10.22
152 Ben Godfrey Everton GKP 4.5 45 9.57
153 Aaron Wan-Bissaka Man Utd DEF 4.5 41 8.03
154 Brandon Williams Norwich DEF 4.0 36 7.23
I want to create a list of 15 players (must be 15 - not more, not less), with the highest ROI possible, and it has to fulfill certain conditions:
Position constraints: it must have 2 GKP, 5 DEF, 5 MID, and 3 FWD
Budget constraint: I have a budget of $100, so for each player I add to the list, I must subtract the player's cost (cost_2223) from the budget.
Team constraint: It can't have more than 3 players per club.
Here's my current code:
def get_ideal_team_ROI(budget = 100, star_player_limit = 3, gk = 2, df = 5, md = 5, fwd = 3):
money_team = []
budget = budget
positions = {'GK': gk, 'DEF': df, 'MID': md, 'FWD': fwd}
for index, row in ROI_top_players.iterrows():
if (budget >= row['cost_2223'] and positions[row['position']] > 0):
money_team.append(row['player'])
budget -= row['cost_2223']
positions[row['position']] = positions[row['position']] - 1
return money_team
This code has two problems:
It creates the list BUT, the list does not end up with 15 players.
It doesn't fulfill the team constraint (I have more than 3 players per team).
How should I tackle this? I want my code to make sure that I always have enough budget to buy 15 players and that I always have at maximum 3 players per team.
**I do not need all possible combinations. Just ONE team with the highest possible ROI.

As OP did not provide the data, I went and scraped the first 'Fantasy Football players list' I could find. There is no ROI in that data, however there are 'Points', which we will try to maximize, so I guess OP can apply this to maximize the ROI in his data.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd
from pulp import *
## get some data approximating OP's data
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
big_df = pd.DataFrame()
url = 'https://fantasy.premierleague.com/player-list/'
browser.get(url)
try:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Accept All Cookies']"))).click()
print('cookies accepted')
except Exception as e:
print('no cookies for you!')
tables_divs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//table/parent::div/parent::div")))
for t in tables_divs:
category = t.find_element(By.TAG_NAME, 'h3')
print(category.text)
WebDriverWait(t, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//table")))
dfs = pd.read_html(t.get_attribute('outerHTML'))
for df in dfs:
df['Type'] = category.text
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
big_df.to_json('f_footie.json')
browser.quit()
footie_df = pd.read_json('f_footie.json')
footie_df.columns = ['Player', 'Team', 'Points', 'Cost', 'Position']
footie_df['Player'] = footie_df.apply( lambda row: row.Player.replace(' ', '_').strip(), axis=1)
footie_df['Cost'] = footie_df.apply( lambda row: row.Cost.split('£')[1], axis=1)
footie_df['Cost'] = footie_df['Cost'].astype('float')
footie_df['Points'] = footie_df['Points'].astype('int')
print(footie_df)
## constraining variables
positions = footie_df.Position.unique()
clubs = footie_df.Team.unique()
budget = 100
available_roles = {
'Goalkeepers': 2,
'Defenders': 5,
'Midfielders': 5,
'Forwards': 3
}
names = [footie_df.Player[i] for i in footie_df.index]
teams = [footie_df.Team[i] for i in footie_df.index]
roles = [footie_df.Position[i] for i in footie_df.index]
costs = [footie_df.Cost[i] for i in footie_df.index]
points = [footie_df.Points[i] for i in footie_df.index]
players = [LpVariable("player_" + str(i), cat="Binary") for i in footie_df.index]
prob = LpProblem("Secret Fantasy Player Choices", LpMaximize)
## define the objective -> maximize the points
prob += lpSum(players[i] * points[i] for i in range(len(footie_df)))
## define budget constraint
prob += lpSum(players[i] * footie_df.Cost[footie_df.index[i]] for i in range(len(footie_df))) <= budget
for pos in positions:
prob += lpSum(players[i] for i in range(len(footie_df)) if roles[i] == pos) <= available_roles[pos]
## add max 3 per team constraint
for club in clubs:
prob += lpSum(players[i] for i in range(len(footie_df)) if teams[i] == club) <= 3
prob.solve()
df_list = []
for variable in prob.variables():
if variable.varValue != 0:
name = footie_df.Player[int(variable.name.split("_")[1])]
club = footie_df.Team[int(variable.name.split("_")[1])]
role = footie_df.Position[int(variable.name.split("_")[1])]
points = footie_df.Points[int(variable.name.split("_")[1])]
cost = footie_df.Cost[int(variable.name.split("_")[1])]
df_list.append((name, club, role, points, cost))
# print(name, club, position, points, cost)
result_df = pd.DataFrame(df_list, columns = ['Name', 'Club', 'Role', 'Points', 'Cost'])
result_df.to_csv('win_at_fantasy_football.csv')
print(result_df)
This will display some control printouts, the data scraped, the long printout from pulp solver, and the result dataframe in the end, looking like this:
Name
Club
Role
Points
Cost
0
Alisson
Liverpool
Goalkeepers
176
5.5
1
Lloris
Spurs
Goalkeepers
158
5.5
2
Bowen
West Ham
Midfielders
206
8.5
3
Saka
Arsenal
Midfielders
179
8
4
Maddison
Leicester
Midfielders
181
8
5
Ward-Prowse
Southampton
Midfielders
159
6.5
6
Gallagher
Chelsea
Midfielders
140
6
7
Antonio
West Ham
Forwards
140
7.5
8
Toney
Brentford
Forwards
139
7
9
Mbeumo
Brentford
Forwards
119
6
10
Alexander-Arnold
Liverpool
Defenders
208
7.5
11
Robertson
Liverpool
Defenders
186
7
12
Cancelo
Man City
Defenders
201
7
13
Gabriel
Arsenal
Defenders
146
5
14
Cash
Aston Villa
Defenders
147
5
For PuLP documentation, visit https://coin-or.github.io/pulp/

How to get table and it's element with Python/Selenium

I'm trying to get all the price in the table at this URL:
https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01
The table elements are the days with the related price.
This is what I'm trying to do to get the table:
#Attempt 1
week = table.find_element(By.CLASS_NAME, "BpkCalendarGrid_bpk-calendar-grid__NzBmM month-view-grid--data-loaded")
#Attempt 2
table = driver.find_element(by=By.XPATH, value="Xpath copied using Crhome inspector"
However I cannot get it.
What is the correct way to extract all the price from this table? Thanks!

You can grab table data meaning all prices using selenium with pandas DataFrame. There are two tables exist of the table data prices
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
table = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[1]'))).get_attribute("outerHTML")
table_2 = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[2]'))).get_attribute("outerHTML")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
df1 = pd.read_html(table)[0]
print(df1)
df2 = pd.read_html(table_2)[0]
print(df2)
Output:
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
webdriverManager
Alternative solution(Table-1): Thus way you can extract prices from table two too.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
table = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '(//table)[1]/tbody/tr/td')))
for i in table:
price = i.find_element(By.XPATH,'.//div[#class="price"]').text.replace('€','').strip()
print(price)
Output:
39
30
32
37
34
35
34
34
28
27
26
26
46
35
35
40
36
52
29
34
37
39
39
30
50
44
50
52
38
36
58

using BeautifulSoup to grab data from a table

I've read through countless other posts and tried a lot of techniques, but I can't seem to get the data I want from a table on the below website. I can only return other divs and their classes, but not the values.
I am looking to get all the rows from the three columns (by airline, by origin airport, by destination airport) here:
https://flightaware.com/live/cancelled
I've tried searching for the 'th class' but it only returns the div information and not the data.
Any help is appreciated
Thank you
my attempt:
rows = soup.findAll('table', attrs={'class': 'cancellation_boards'})
for r in rows:
t = r.find_all_next('div', attrs= {'class':'cancellation_board'})
for r in rows:
r.text

The data you see is loaded via Ajax request so BeautifulSoup doesn't see it. You can simulate it via requests. To load the data to one big dataframe, you can use next example:
import requests
import pandas as pd
url = "https://flightaware.com/ajax/airport/cancelled_count.rvt"
params = {
"type": "airline",
"timeFilter": "((b.sch_block_out BETWEEN '2022-04-02 8:00' AND '2022-04-03 8:00') OR (b.sch_block_out IS NULL AND b.filed_departuretime BETWEEN '2022-04-02 8:00' AND '2022-04-03 8:00'))",
"timePeriod": "today",
"airportFilter": "",
}
all_dfs = []
for params["type"] in ("airline", "destination", "origin"):
df = pd.read_html(requests.get(url, params=params).text)[0]
df["type"] = params["type"]
all_dfs.append(df)
df_final = pd.concat(all_dfs)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
Airline Airport Cancelled Delayed type
Airline Airport # % # %
0 China Eastern NaN 509 45% 28 2% airline
1 Spring Airlines NaN 443 82% 5 0% airline
2 Southwest NaN 428 12% 1369 39% airline
3 American Airlines NaN 317 10% 472 16% airline
4 Delta NaN 229 8% 444 16% airline
5 Spirit NaN 190 23% 207 26% airline
6 Hainan Airlines NaN 167 41% 9 2% airline
7 JetBlue NaN 144 14% 494 48% airline
8 Lion Air NaN 129 20% 53 8% airline
9 easyJet NaN 121 8% 471 32% airline
...
and saves data.csv (screenshot from LibreOffice):

As the url is dynamic, you also can grab table data pandas with selenium.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
driver = webdriver.Chrome(ChromeDriverManager().install())
url ="https://flightaware.com/live/cancelled"
driver.maximize_window()
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
table=soup.select_one('table.cancellation_boards')
#driver.close()
df = pd.read_html(str(table),header=0)[0]
print(df)
Output:
By airline Unnamed: 1 By origin airport Unnamed: 3 By destination airport
0 Cancelled Cancelled Delayed Delayed Airline
1 # % # % Airline
2 Cancelled Cancelled Delayed Delayed Airport
3 # % # % Airport
4 Cancelled Cancelled Delayed Delayed Airport
.. ... ... ... ... ...
308 10 3% 55 17% Luis Munoz Marin Intl (SJU)
309 10 3% 126 39% Geneva Cointrin Int'l (GVA)
310 9 2% 33 10% Sydney (SYD)
311 9 11% 17 21% Punta Gorda (PGD)
312 9 3% 10 3% Chengdu Shuangliu Int'l (CTU)
[313 rows x 5 columns]

Python find_element in find_elements problem?

Good day everyone:
I’d like to get the basketball game data from the web include league , date, time and score ….
The first level for loop works fine to get every league title
for league in leagues:
But the second level for loop
for row in _rows:
I always get all leagues rows ,I just need data for league by league
What should I do to fix it?
Any help will greatly appreciated.
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500,1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://"+"we"+"b2."+"sa8"+"8"+"88.n"+"et"+"/sp"+"ort/Ga"+"mes.aspxdevice=pc")
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[1]/span[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# list all leagues schedule
leagues = []
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = []
_rows = league.find_elements(By.XPATH, "//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague : ",league_Title.text, 'len(_rows)=',len(_rows))
for row in _rows:
print(league_Title,row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow : ",row.text)
time.sleep(1)
time.sleep(120)
driver.quit()

I can't run code because page shows Error 404.
EDIT: in question you forgot ? in Games.aspx?device=pc and this made problem with 404
You have to use dot . at the beginning of xpath to use path relative to league
_rows = league.find_elements(By.XPATH, ".//...rest...") # <-- dot before `//`
You use absolute xpath and it searchs in full HTML.
EDIT:
Partial result with dot in xpath:
I used lang=3 to get English text,
And I used a[2] to select second date (03 / 06 (Sun)) because first date (03 / 05 (Sat)) was empty (no matches)
url: https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc
len(leagues): 115
league: NBA len(_rows)= 12
row: 06:05 Finished Dallas Mavericks 26 25 34 29 114 | Live Update
row: Sacramento Kings 36 29 27 21 113
row: 08:05 Finished Charlotte Hornets 31 31 37 24 123 | Live Update
row: San Antonio Spurs 30 30 37 20 117
row: 09:05 Finished Miami Heat 22 32 19 26 99 | Live Update
row: Philadelphia 76ers 14 26 28 14 82
row: 09:05 Finished Memphis Grizzlies 31 37 29 27 124 | Live Update
row: Orlando Magic 29 16 29 22 96
row: 09:05 Finished Minnesota Timberwolves 32 31 46 26 135 | Live Update
row: Portland Trail Blazers 34 30 37 20 121
row: 09:35 Finished Los Angeles Lakers 32 30 27 35 124 | Live Update
row: Golden State Warriors 25 42 27 22 116
---
league: NBA GATORADE LEAGUE len(_rows)= 8
row: 08:00 Finished Delaware Blue Coats 42 34 37 33 146 | Live Update
row: Westchester Knicks 28 28 24 31 111
row: 09:00 Finished Austin Spurs 35 21 23 31 110 | Live Update
row: Salt Lake City Stars 30 32 21 17 100
row: 09:00 Finished Wisconsin Herd 26 30 20 38 114 | Live Update
row: Capital City Go-Go 27 31 32 38 128
row: 11:00 Finished Santa Cruz Warriors 36 19 17 27 99 | Live Update
row: Memphis Hustle 26 29 22 30 107
---
league: CHINA PROFESSIONAL BASKETBALL LEAGUE len(_rows)= 12
row: 11:00 Finished Fujian Sturgeons 37 21 27 32 117 | Live Update
row: Ningbo Rockets 24 28 34 25 111
row: 11:00 Finished Sichuan Blue Whales 12 21 27 20 80 | Live Update
row: Zhejiang Lions 23 27 35 25 110
row: 15:00 Finished Shenzhen Leopards 23 32 30 33 118 | Live Update
row: Shandong Hi Speed 29 25 32 29 115
row: 15:30 Finished Jilin Northeast Tigers 36 39 25 18 118 | Live Update
row: Shanghai Sharks 15 25 32 36 108
row: 19:35 Finished Beijing Ducks 24 20 17 22 83 | Live Update
row: Beijing Royal Fighters 18 18 21 22 79
row: 20:00 Finished Nanjing Monkey King 23 24 23 25 95 | Live Update
row: Jiangsu Dragons 18 17 21 24 80
---
Full working code:
I added also WebDriverWait to wait for leagues.
And row.text.replace('\n', ' | ') to display row in one line
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500, 1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc") # lang=3 for English
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[2]/span[1]') # a[2] for second date, because first has no matches
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# wait for leagues
locator = (By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
# list all leagues schedule
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
print('len(leagues):', len(leagues))
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = league.find_elements(By.XPATH, ".//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague:", league_Title.text, 'len(_rows)=', len(_rows))
for row in _rows:
#print(league_Title, row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow:", row.text.replace('\n', ' | ')) # <- clean text
time.sleep(1)
print('---')
time.sleep(120)
driver.quit()

I think find_element() or find() is for Only one element on page. You will get just the first element of list of elements, if you use find_element() for multi elements on page.
And find_elements or findAll() is for all elements on page. This function will return data in Array format.
hope this help you some.

Not clickable point (Selenium)

I'm trying to click the "Show more" button, but I can't.
Any help? Thank you very much.
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
time.sleep(2)
showmore = driver.find_element_by_link_text("Show more")
showmore.click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")``

Dividing the height with some number can decrease the scroll height and will stop where Show More visible
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/1.35);")
time.sleep(2)
driver.find_element_by_class_name("show_more").click()
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

If you are after the tables, you don't need to use Selenium. you can pull the data straight away with requests, and parse with pandas.
To find that, when you go to the page, you want to right-click and 'Inspect' (or Shft-Ctrl-I). This will open a side panel. When it opens, you want to go to Network and XHR. And you want to sort of browse those requests (and you can click on Preview to see what it return. You may need to 1) reload the page; and 2) click around the table. For example, once I clicked "show more" at the bottom of the table, it popped up.
Once you find it, click on Headers and you'l see the url and payload, etc. I highlighted it in the pic for you:
import requests
import pandas as pd
url = 'https://www.scorespro.com/basketball/ajaxdata_more.php'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36'}
dfList = []
for season in ['2018-2019','2019-2020']:
continueLoop = True
page = 1
while continueLoop == True:
print ('%s Collecting Page: %s' %(season,page))
payload = {
'country': 'china',
'comp': 'cba',
'season': season,
'status': 'results',
'league': '',
'page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
try:
dfs = pd.read_html(response.text)
except ValueError:
print ('No more tables found.')
continueLoop = False
page+=1
dfList.extend(dfs)
dfList_single = []
cols = ['Date','Final Time', 'Team', 'Final Score','Q1','Q2','Q3','Q4','OT','Half Time Score','Combined Total Score']
for each in dfList:
each = each.loc[:,[0, 1, 2, 5, 6, 7, 8, 9, 10, 11, 12]]
each.columns = cols
teamA = each.iloc[0,:]
teamB = each.iloc[1,2:]
temp_df = pd.concat([teamA, teamB], axis=0).to_frame().T
dfList_single.append(temp_df)
df = pd.concat(dfList_single)
df = df.reset_index(drop=True)
Output:
print(df.head(10).to_string())
Date Final Time Team Final Score Q1 Q2 Q3 Q4 Half Time Score Combined Total Score
0 15.08.20 15:00 FT Guandong 123 26 28 41 28 54 238
1 15.08.20 15:00 FT Liaoning 115 29 18 39 29 47 238
2 13.08.20 15:00 FT Liaoning 115 34 24 23 34 58 228
3 13.08.20 15:00 FT Guandong 113 38 28 27 20 66 228
4 11.08.20 15:00 FT Guandong 110 25 30 25 30 55 198
5 11.08.20 15:00 FT Liaoning 88 24 26 23 15 50 198
6 08.08.20 15:00 FT Guandong 88 16 21 26 25 37 173
7 08.08.20 15:00 FT Beijing 85 13 24 20 28 37 173
8 07.08.20 15:00 FT Liaoning 119 22 40 29 28 62 232
9 07.08.20 15:00 FT Xinjiang 113 33 22 34 24 55 232

Two issue with your code:
First you are trying to scroll after clicking, where as it should be before. Second You are using height of screen which may work in one device bit not in other if size varies.
Better way is to scroll to element itself and the click. See below code. It worked fine:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('..\drivers\chromedriver')
driver.get("https://www.scorespro.com/basketball/china/cba/results/")
driver.maximize_window()
# I have accept cookies on page , so below step
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//a[text()='Agree']"))).click()
showMore = driver.find_element_by_link_text("Show more")
driver.execute_script("arguments[0].scrollIntoView();", showMore)
time.sleep(2)
showMore.click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently web scrape tables with Selenium(Python) and Pandas - python

Related

How to create a list of N items with a budget constraint and multiple conditions on Python

How to get table and it's element with Python/Selenium

using BeautifulSoup to grab data from a table

Python find_element in find_elements problem?

Not clickable point (Selenium)

Categories

Resources