I'm trying to get all the price in the table at this URL:
https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01
The table elements are the days with the related price.
This is what I'm trying to do to get the table:
#Attempt 1
week = table.find_element(By.CLASS_NAME, "BpkCalendarGrid_bpk-calendar-grid__NzBmM month-view-grid--data-loaded")
#Attempt 2
table = driver.find_element(by=By.XPATH, value="Xpath copied using Crhome inspector"
However I cannot get it.
What is the correct way to extract all the price from this table? Thanks!
You can grab table data meaning all prices using selenium with pandas DataFrame. There are two tables exist of the table data prices
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
table = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[1]'))).get_attribute("outerHTML")
table_2 = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[2]'))).get_attribute("outerHTML")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
df1 = pd.read_html(table)[0]
print(df1)
df2 = pd.read_html(table_2)[0]
print(df2)
Output:
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
webdriverManager
Alternative solution(Table-1): Thus way you can extract prices from table two too.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
table = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '(//table)[1]/tbody/tr/td')))
for i in table:
price = i.find_element(By.XPATH,'.//div[#class="price"]').text.replace('€','').strip()
print(price)
Output:
39
30
32
37
34
35
34
34
28
27
26
26
46
35
35
40
36
52
29
34
37
39
39
30
50
44
50
52
38
36
58
Related
I have some questions regarding web scraping with selenium for python. I attempted to web scrape a table of pokemon names and stats from pokemondb.net, and I saved that data into a pandas dataframe in my jupyter notebook. The problem is that it takes 2-3 minutes to scrape all the data, and I assumed that this a bit too time consuming of a process. I was wondering if maybe I did a poor job of coding my web scraping program? I also programmed it to scrape all the table data 1 column at a time, and I believe that this may be one reason why it is not as efficient as possible. I would appreciate if anyone can take a look and offer any suggestions.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import os
import numpy as np
import pandas as pd
import matplotlib as plt
driver = webdriver.Chrome('drivers/chromedriver.exe') # assign the driver path to variable
driver.get('https://pokemondb.net/pokedex/all') # get request - opens chrome browser and navigates to URL
driver.minimize_window() # minimize window
pokemon_id = []
pokemon_id_html = driver.find_elements(By.CLASS_NAME, 'infocard-cell-data') # retrieve the pokemon id column from pokemondb.net
for poke_id in pokemon_id_html:
pokemon_id.append(poke_id.text)
pokemon_name = []
pokemon_name_html = driver.find_elements(By.CLASS_NAME, 'ent-name') # retrieve the pokemon name column
for name in pokemon_name_html:
pokemon_name.append(name.text)
pokemon_type = []
pokemon_type_html = driver.find_elements(By.CLASS_NAME, 'cell-icon') # retrieve pokemon type
for p_type in pokemon_type_html:
pokemon_type.append(p_type.text)
pokemon_total = []
pokemon_total_html = driver.find_elements(By.CLASS_NAME, 'cell-total') # retrieve pokemon total stats
for total in pokemon_total_html:
pokemon_total.append(total.text)
pokemon_hp = []
pokemon_hp_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][1]") # retrieve pokemon hp stat
for hp in pokemon_hp_html:
pokemon_hp.append(hp.text)
pokemon_attack = []
pokemon_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][2]") # retrieve pokemon attack stat
for attack in pokemon_attack_html:
pokemon_attack.append(attack.text)
pokemon_defense = []
pokemon_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][3]") # retrieve pokemon defense stat
for defense in pokemon_defense_html:
pokemon_defense.append(defense.text)
pokemon_special_attack = []
pokemon_special_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][4]") # retrieve pokemon sp. attack stat
for special_attack in pokemon_special_attack_html:
pokemon_special_attack.append(special_attack.text)
pokemon_special_defense = []
pokemon_special_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][5]") # retrieve pokemon sp. defense stat
for special_defense in pokemon_special_defense_html:
pokemon_special_defense.append(special_defense.text)
pokemon_speed = []
pokemon_speed_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][6]") # retrieve pokemon speed stat
for speed in pokemon_speed_html:
pokemon_speed.append(speed.text)
driver.close() # close driver, end session
columns = ['id', 'name', 'type', 'total', 'hp', 'attack', 'defense', 'special-attack', 'special-defense', 'speed'] # column names (labels) for dataset
attributes = [pokemon_id, pokemon_name, pokemon_type, pokemon_total, pokemon_hp, pokemon_attack, pokemon_defense, pokemon_special_attack, pokemon_special_defense, pokemon_speed] # list of values for each column (rows) for dataset
Though #platipus_on_fire_333 answer was perfecto using page_source, as an alternative you can also canonically identify the <table> element and achieve similar result.
Solution
To web scrape a table of pokemon names and stats from pokemondb.net you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver.execute("get", {'url': 'https://pokemondb.net/pokedex/all'})
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#pokedex"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
Console Output:
[ # Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ... ... ...
1070 902 Basculegion Female Water Ghost 530 120 92 65 100 75 78
1071 903 Sneasler Poison Fighting 510 80 130 60 40 80 120
1072 904 Overqwil Dark Poison 510 85 115 95 65 65 85
1073 905 Enamorus Incarnate Forme Fairy Flying 580 74 115 70 135 80 106
1074 905 Enamorus Therian Forme Fairy Flying 580 74 115 110 135 100 46
[1075 rows x 10 columns]]
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://pokemondb.net/pokedex/all")
dfs = pd.read_html(str(browser.page_source))
dfs[0]
This returns a dataframe with 1075 rows × 10 columns:
# Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ...
Good day everyone:
I’d like to get the basketball game data from the web include league , date, time and score ….
The first level for loop works fine to get every league title
for league in leagues:
But the second level for loop
for row in _rows:
I always get all leagues rows ,I just need data for league by league
What should I do to fix it?
Any help will greatly appreciated.
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500,1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://"+"we"+"b2."+"sa8"+"8"+"88.n"+"et"+"/sp"+"ort/Ga"+"mes.aspxdevice=pc")
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[1]/span[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# list all leagues schedule
leagues = []
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = []
_rows = league.find_elements(By.XPATH, "//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague : ",league_Title.text, 'len(_rows)=',len(_rows))
for row in _rows:
print(league_Title,row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow : ",row.text)
time.sleep(1)
time.sleep(120)
driver.quit()
I can't run code because page shows Error 404.
EDIT: in question you forgot ? in Games.aspx?device=pc and this made problem with 404
You have to use dot . at the beginning of xpath to use path relative to league
_rows = league.find_elements(By.XPATH, ".//...rest...") # <-- dot before `//`
You use absolute xpath and it searchs in full HTML.
EDIT:
Partial result with dot in xpath:
I used lang=3 to get English text,
And I used a[2] to select second date (03 / 06 (Sun)) because first date (03 / 05 (Sat)) was empty (no matches)
url: https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc
len(leagues): 115
league: NBA len(_rows)= 12
row: 06:05 Finished Dallas Mavericks 26 25 34 29 114 | Live Update
row: Sacramento Kings 36 29 27 21 113
row: 08:05 Finished Charlotte Hornets 31 31 37 24 123 | Live Update
row: San Antonio Spurs 30 30 37 20 117
row: 09:05 Finished Miami Heat 22 32 19 26 99 | Live Update
row: Philadelphia 76ers 14 26 28 14 82
row: 09:05 Finished Memphis Grizzlies 31 37 29 27 124 | Live Update
row: Orlando Magic 29 16 29 22 96
row: 09:05 Finished Minnesota Timberwolves 32 31 46 26 135 | Live Update
row: Portland Trail Blazers 34 30 37 20 121
row: 09:35 Finished Los Angeles Lakers 32 30 27 35 124 | Live Update
row: Golden State Warriors 25 42 27 22 116
---
league: NBA GATORADE LEAGUE len(_rows)= 8
row: 08:00 Finished Delaware Blue Coats 42 34 37 33 146 | Live Update
row: Westchester Knicks 28 28 24 31 111
row: 09:00 Finished Austin Spurs 35 21 23 31 110 | Live Update
row: Salt Lake City Stars 30 32 21 17 100
row: 09:00 Finished Wisconsin Herd 26 30 20 38 114 | Live Update
row: Capital City Go-Go 27 31 32 38 128
row: 11:00 Finished Santa Cruz Warriors 36 19 17 27 99 | Live Update
row: Memphis Hustle 26 29 22 30 107
---
league: CHINA PROFESSIONAL BASKETBALL LEAGUE len(_rows)= 12
row: 11:00 Finished Fujian Sturgeons 37 21 27 32 117 | Live Update
row: Ningbo Rockets 24 28 34 25 111
row: 11:00 Finished Sichuan Blue Whales 12 21 27 20 80 | Live Update
row: Zhejiang Lions 23 27 35 25 110
row: 15:00 Finished Shenzhen Leopards 23 32 30 33 118 | Live Update
row: Shandong Hi Speed 29 25 32 29 115
row: 15:30 Finished Jilin Northeast Tigers 36 39 25 18 118 | Live Update
row: Shanghai Sharks 15 25 32 36 108
row: 19:35 Finished Beijing Ducks 24 20 17 22 83 | Live Update
row: Beijing Royal Fighters 18 18 21 22 79
row: 20:00 Finished Nanjing Monkey King 23 24 23 25 95 | Live Update
row: Jiangsu Dragons 18 17 21 24 80
---
Full working code:
I added also WebDriverWait to wait for leagues.
And row.text.replace('\n', ' | ') to display row in one line
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500, 1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc") # lang=3 for English
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[2]/span[1]') # a[2] for second date, because first has no matches
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# wait for leagues
locator = (By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
# list all leagues schedule
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
print('len(leagues):', len(leagues))
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = league.find_elements(By.XPATH, ".//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague:", league_Title.text, 'len(_rows)=', len(_rows))
for row in _rows:
#print(league_Title, row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow:", row.text.replace('\n', ' | ')) # <- clean text
time.sleep(1)
print('---')
time.sleep(120)
driver.quit()
I think find_element() or find() is for Only one element on page. You will get just the first element of list of elements, if you use find_element() for multi elements on page.
And find_elements or findAll() is for all elements on page. This function will return data in Array format.
hope this help you some.
I tried to use the following code but it doesn't find the table, despite this having worked on other web pages.
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver = (r'C:\Users\c\chromedriver.exe')
driver = webdriver.Chrome(chromedriver)
driver.get("https://isodzz.nafta.sk/yCapacity/#/?nav=ss.od.nom.c&lng=EN")
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
table = soup.find_all('table', {'id':'nominations_point_data_c'})
print(table)
Do it like this. First you need to wait for the table to appear. This site is awfully slow to load. Since there is a table element in the HTML we can use pandas for a neat print.
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import pandas as pd
driver = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe')
driver.get("https://isodzz.nafta.sk/yCapacity/#/?nav=ss.od.nom.c&lng=EN")
element = WebDriverWait(driver, 25).until(EC.visibility_of_element_located((By.CLASS_NAME, "MobileOverflow"))) #Element is present now
page = driver.page_source #Get the HTML of the page
df = pd.read_html(page) #Make pandas read the HTML
table = df[0] #Get the first table on the page
print(table)
Output:
Date: Confirmed Nomination
Date: Injection [MWh] Withdrawal [MWh]
0 01.11.2020 13 410.490 11 626.856
1 02.11.2020 11 874.096 12 227.510
2 03.11.2020 0.000 0.000
3 04.11.2020 0.000 0.000
4 05.11.2020 0.000 0.000
5 06.11.2020 0.000 0.000
6 07.11.2020 0.000 0.000
7 08.11.2020 0.000 0.000
8 09.11.2020 0.000 0.000
9 10.11.2020 0.000 0.000
10 11.11.2020 34 201.032 37 624.672
11 12.11.2020 54 427.560 27 940.872
12 13.11.2020 49 069.584 21 538.372
13 14.11.2020 54 361.138 15 312.000
14 15.11.2020 57 592.332 15 804.000
15 16.11.2020 57 515.424 20 280.000
16 17.11.2020 53 315.328 29 432.000
17 18.11.2020 48 960.672 26 192.000
18 19.11.2020 46 716.561 33 873.233
19 20.11.2020 43 852.200 43 806.382
20 21.11.2020 29 639.328 33 888.000
21 22.11.2020 0.000 0.000
I'm trying to click the "Show more" button, but I can't.
Any help? Thank you very much.
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
time.sleep(2)
showmore = driver.find_element_by_link_text("Show more")
showmore.click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")``
Dividing the height with some number can decrease the scroll height and will stop where Show More visible
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/1.35);")
time.sleep(2)
driver.find_element_by_class_name("show_more").click()
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
If you are after the tables, you don't need to use Selenium. you can pull the data straight away with requests, and parse with pandas.
To find that, when you go to the page, you want to right-click and 'Inspect' (or Shft-Ctrl-I). This will open a side panel. When it opens, you want to go to Network and XHR. And you want to sort of browse those requests (and you can click on Preview to see what it return. You may need to 1) reload the page; and 2) click around the table. For example, once I clicked "show more" at the bottom of the table, it popped up.
Once you find it, click on Headers and you'l see the url and payload, etc. I highlighted it in the pic for you:
import requests
import pandas as pd
url = 'https://www.scorespro.com/basketball/ajaxdata_more.php'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36'}
dfList = []
for season in ['2018-2019','2019-2020']:
continueLoop = True
page = 1
while continueLoop == True:
print ('%s Collecting Page: %s' %(season,page))
payload = {
'country': 'china',
'comp': 'cba',
'season': season,
'status': 'results',
'league': '',
'page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
try:
dfs = pd.read_html(response.text)
except ValueError:
print ('No more tables found.')
continueLoop = False
page+=1
dfList.extend(dfs)
dfList_single = []
cols = ['Date','Final Time', 'Team', 'Final Score','Q1','Q2','Q3','Q4','OT','Half Time Score','Combined Total Score']
for each in dfList:
each = each.loc[:,[0, 1, 2, 5, 6, 7, 8, 9, 10, 11, 12]]
each.columns = cols
teamA = each.iloc[0,:]
teamB = each.iloc[1,2:]
temp_df = pd.concat([teamA, teamB], axis=0).to_frame().T
dfList_single.append(temp_df)
df = pd.concat(dfList_single)
df = df.reset_index(drop=True)
Output:
print(df.head(10).to_string())
Date Final Time Team Final Score Q1 Q2 Q3 Q4 Half Time Score Combined Total Score
0 15.08.20 15:00 FT Guandong 123 26 28 41 28 54 238
1 15.08.20 15:00 FT Liaoning 115 29 18 39 29 47 238
2 13.08.20 15:00 FT Liaoning 115 34 24 23 34 58 228
3 13.08.20 15:00 FT Guandong 113 38 28 27 20 66 228
4 11.08.20 15:00 FT Guandong 110 25 30 25 30 55 198
5 11.08.20 15:00 FT Liaoning 88 24 26 23 15 50 198
6 08.08.20 15:00 FT Guandong 88 16 21 26 25 37 173
7 08.08.20 15:00 FT Beijing 85 13 24 20 28 37 173
8 07.08.20 15:00 FT Liaoning 119 22 40 29 28 62 232
9 07.08.20 15:00 FT Xinjiang 113 33 22 34 24 55 232
Two issue with your code:
First you are trying to scroll after clicking, where as it should be before. Second You are using height of screen which may work in one device bit not in other if size varies.
Better way is to scroll to element itself and the click. See below code. It worked fine:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('..\drivers\chromedriver')
driver.get("https://www.scorespro.com/basketball/china/cba/results/")
driver.maximize_window()
# I have accept cookies on page , so below step
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//a[text()='Agree']"))).click()
showMore = driver.find_element_by_link_text("Show more")
driver.execute_script("arguments[0].scrollIntoView();", showMore)
time.sleep(2)
showMore.click()
I want to web-scrape the information on:
https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d
There is a main table with a column user. When you click on a user, there is another table beside that shows the information of the team of that user enters in the contest. I want to extract the team of all the users. Therefore, I need to be able to go through all the users by clicking on them and then extracting the information on the second table. Here is my code to extract the team of the first user:
from selenium import webdriver
import csv
from selenium.webdriver.support.ui import Select
from datetime import date, timedelta
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)
DFSteam = []
driver.get("https://rotogrinders.com/resultsdb/date/2019-01- 13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d")
Team1=driver.find_element_by_css_selector("table.ant-table-fixed")
driver.close
print(Team1.text)
However, I am not able to iterate through the different users. I noticed that when I click on a user the tr class of that row switch for inactive to active in the page source code, but I do not know how to use that. Moreover, I would like to store the team extracted in a data frame. I am not sure if it is better to do it at the same time or afterwards.
The data frame would look like this:
RANK(team) / C / C / W / W / W / D / D /G/ UTIL/ TOTAL($) / Total Points
1 / Mark Scheifel/ Mickael Backlund/ Artemi Panarin / Nick Foligno / Michael Frolik / Mark Giordano / Zach Werenski / CConnor Hellebuyck / Brandon Tanev / 50 000 / 54.60
You have the right idea. It's just a matter of finding the username element to click on then grab the lineup table, reformat to combine into one results dataframe.
The user name text is tagged with <a>. Just need to find the <a> tag that matched the user name.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
url = 'https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d'
# Open Browser and go to site
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
# Waits until tables are loaded and has text. Timeouts after 60 seconds
WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]')))
# Get tables to get the user names
tables = pd.read_html(driver.page_source)
users_df = tables[0][['Rank','User']]
users_df['User'] = users_df['User'].str.replace(' Member', '')
# Initialize results dataframe and iterate through users
results = pd.DataFrame()
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
# Find the user name and click on the name
user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
user_link.click()
# Get the lineup table after clicking on the user name
tables = pd.read_html(driver.page_source)
lineup = tables[1]
#print (user)
#print (lineup)
# Restructure to put into resutls dataframe
lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']
temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11),
columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )
temp_df.insert(loc=0, column = 'User', value = user)
temp_df.insert(loc=0, column = 'Rank', value = rank)
results = results.append(temp_df)
results = results.reset_index(drop=True)
driver.close()
Output:
print (results)
Rank User ... Total_$ Total_Pts
0 1 Canadaman101 ... $50,000.00 54.6
1 2 MayhemLikeMe27 ... $50,000.00 53.9
2 2 gunslinger58 ... $50,000.00 53.9
3 4 oilkings ... $48,600.00 53.6
4 5 TTB19 ... $50,000.00 53.4
5 6 Adamjloder ... $49,800.00 53.1
6 7 DollarBillW ... $49,900.00 52.6
7 8 Biglarry696 ... $49,900.00 52.4
8 8 tical1994 ... $49,900.00 52.4
9 8 rollem02 ... $49,900.00 52.4
10 8 kchoban ... $50,000.00 52.4
11 8 TBirdSCIL ... $49,900.00 52.4
12 13 manny716 ... $49,900.00 52.1
13 14 JayKooks ... $50,000.00 51.9
14 15 Cambie19 ... $49,900.00 51.4
15 16 mjh6588 ... $50,000.00 51.1
16 16 shanefriesen ... $50,000.00 51.1
17 16 mnfish42 ... $50,000.00 51.1
18 19 Pugsly55 ... $49,900.00 50.9
19 19 volpez7 ... $50,000.00 50.9
20 19 Scherr47 ... $49,900.00 50.9
21 19 Testosterown ... $50,000.00 50.9
22 23 markm22 ... $49,700.00 50.6
23 23 foreveryoung12 ... $49,800.00 50.6
24 23 STP_Picks ... $49,900.00 50.6
25 26 jibbinghippo ... $49,800.00 50.4
26 26 loumister35 ... $49,900.00 50.4
27 26 creels3 ... $50,000.00 50.4
28 26 JayKooks ... $50,000.00 51.9
29 26 mmeiselman731 ... $49,900.00 50.4
30 26 volpez7 ... $50,000.00 50.9
31 26 tommienation1 ... $49,900.00 50.4
32 26 jibbinghippo ... $49,800.00 50.4
33 26 Testosterown ... $50,000.00 50.9
34 35 nut07 ... $50,000.00 49.9
35 35 volpez7 ... $50,000.00 50.9
36 35 durfdurf ... $50,000.00 49.9
37 35 chupacabra21 ... $50,000.00 49.9
38 39 Mbermes01 ... $50,000.00 49.6
39 40 suerte41 ... $50,000.00 49.4
40 40 spliksskins77 ... $50,000.00 49.4
41 42 Andrewskoff ... $49,600.00 49.1
42 42 Alky14 ... $49,800.00 49.1
43 42 bretned ... $50,000.00 49.1
44 42 bretned ... $50,000.00 49.1
45 42 gehrig38 ... $49,700.00 49.1
46 42 d-train_91 ... $49,500.00 49.1
47 42 DiamondDallas ... $50,000.00 49.1
48 49 jdmre ... $50,000.00 48.9
49 49 Devosty ... $50,000.00 48.9
[50 rows x 13 columns]