Not clickable point (Selenium) - python

I'm trying to click the "Show more" button, but I can't.
Any help? Thank you very much.
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
time.sleep(2)
showmore = driver.find_element_by_link_text("Show more")
showmore.click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")``

Dividing the height with some number can decrease the scroll height and will stop where Show More visible
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.scorespro.com/basketball/china/cba/results/')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/1.35);")
time.sleep(2)
driver.find_element_by_class_name("show_more").click()
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

If you are after the tables, you don't need to use Selenium. you can pull the data straight away with requests, and parse with pandas.
To find that, when you go to the page, you want to right-click and 'Inspect' (or Shft-Ctrl-I). This will open a side panel. When it opens, you want to go to Network and XHR. And you want to sort of browse those requests (and you can click on Preview to see what it return. You may need to 1) reload the page; and 2) click around the table. For example, once I clicked "show more" at the bottom of the table, it popped up.
Once you find it, click on Headers and you'l see the url and payload, etc. I highlighted it in the pic for you:
import requests
import pandas as pd
url = 'https://www.scorespro.com/basketball/ajaxdata_more.php'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36'}
dfList = []
for season in ['2018-2019','2019-2020']:
continueLoop = True
page = 1
while continueLoop == True:
print ('%s Collecting Page: %s' %(season,page))
payload = {
'country': 'china',
'comp': 'cba',
'season': season,
'status': 'results',
'league': '',
'page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
try:
dfs = pd.read_html(response.text)
except ValueError:
print ('No more tables found.')
continueLoop = False
page+=1
dfList.extend(dfs)
dfList_single = []
cols = ['Date','Final Time', 'Team', 'Final Score','Q1','Q2','Q3','Q4','OT','Half Time Score','Combined Total Score']
for each in dfList:
each = each.loc[:,[0, 1, 2, 5, 6, 7, 8, 9, 10, 11, 12]]
each.columns = cols
teamA = each.iloc[0,:]
teamB = each.iloc[1,2:]
temp_df = pd.concat([teamA, teamB], axis=0).to_frame().T
dfList_single.append(temp_df)
df = pd.concat(dfList_single)
df = df.reset_index(drop=True)
Output:
print(df.head(10).to_string())
Date Final Time Team Final Score Q1 Q2 Q3 Q4 Half Time Score Combined Total Score
0 15.08.20 15:00 FT Guandong 123 26 28 41 28 54 238
1 15.08.20 15:00 FT Liaoning 115 29 18 39 29 47 238
2 13.08.20 15:00 FT Liaoning 115 34 24 23 34 58 228
3 13.08.20 15:00 FT Guandong 113 38 28 27 20 66 228
4 11.08.20 15:00 FT Guandong 110 25 30 25 30 55 198
5 11.08.20 15:00 FT Liaoning 88 24 26 23 15 50 198
6 08.08.20 15:00 FT Guandong 88 16 21 26 25 37 173
7 08.08.20 15:00 FT Beijing 85 13 24 20 28 37 173
8 07.08.20 15:00 FT Liaoning 119 22 40 29 28 62 232
9 07.08.20 15:00 FT Xinjiang 113 33 22 34 24 55 232

Two issue with your code:
First you are trying to scroll after clicking, where as it should be before. Second You are using height of screen which may work in one device bit not in other if size varies.
Better way is to scroll to element itself and the click. See below code. It worked fine:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('..\drivers\chromedriver')
driver.get("https://www.scorespro.com/basketball/china/cba/results/")
driver.maximize_window()
# I have accept cookies on page , so below step
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//a[text()='Agree']"))).click()
showMore = driver.find_element_by_link_text("Show more")
driver.execute_script("arguments[0].scrollIntoView();", showMore)
time.sleep(2)
showMore.click()

Related

Efficiently web scrape tables with Selenium(Python) and Pandas

I have some questions regarding web scraping with selenium for python. I attempted to web scrape a table of pokemon names and stats from pokemondb.net, and I saved that data into a pandas dataframe in my jupyter notebook. The problem is that it takes 2-3 minutes to scrape all the data, and I assumed that this a bit too time consuming of a process. I was wondering if maybe I did a poor job of coding my web scraping program? I also programmed it to scrape all the table data 1 column at a time, and I believe that this may be one reason why it is not as efficient as possible. I would appreciate if anyone can take a look and offer any suggestions.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import os
import numpy as np
import pandas as pd
import matplotlib as plt
driver = webdriver.Chrome('drivers/chromedriver.exe') # assign the driver path to variable
driver.get('https://pokemondb.net/pokedex/all') # get request - opens chrome browser and navigates to URL
driver.minimize_window() # minimize window
pokemon_id = []
pokemon_id_html = driver.find_elements(By.CLASS_NAME, 'infocard-cell-data') # retrieve the pokemon id column from pokemondb.net
for poke_id in pokemon_id_html:
pokemon_id.append(poke_id.text)
pokemon_name = []
pokemon_name_html = driver.find_elements(By.CLASS_NAME, 'ent-name') # retrieve the pokemon name column
for name in pokemon_name_html:
pokemon_name.append(name.text)
pokemon_type = []
pokemon_type_html = driver.find_elements(By.CLASS_NAME, 'cell-icon') # retrieve pokemon type
for p_type in pokemon_type_html:
pokemon_type.append(p_type.text)
pokemon_total = []
pokemon_total_html = driver.find_elements(By.CLASS_NAME, 'cell-total') # retrieve pokemon total stats
for total in pokemon_total_html:
pokemon_total.append(total.text)
pokemon_hp = []
pokemon_hp_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][1]") # retrieve pokemon hp stat
for hp in pokemon_hp_html:
pokemon_hp.append(hp.text)
pokemon_attack = []
pokemon_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][2]") # retrieve pokemon attack stat
for attack in pokemon_attack_html:
pokemon_attack.append(attack.text)
pokemon_defense = []
pokemon_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][3]") # retrieve pokemon defense stat
for defense in pokemon_defense_html:
pokemon_defense.append(defense.text)
pokemon_special_attack = []
pokemon_special_attack_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][4]") # retrieve pokemon sp. attack stat
for special_attack in pokemon_special_attack_html:
pokemon_special_attack.append(special_attack.text)
pokemon_special_defense = []
pokemon_special_defense_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][5]") # retrieve pokemon sp. defense stat
for special_defense in pokemon_special_defense_html:
pokemon_special_defense.append(special_defense.text)
pokemon_speed = []
pokemon_speed_html = driver.find_elements(By.XPATH, "//*[#class='cell-num'][6]") # retrieve pokemon speed stat
for speed in pokemon_speed_html:
pokemon_speed.append(speed.text)
driver.close() # close driver, end session
columns = ['id', 'name', 'type', 'total', 'hp', 'attack', 'defense', 'special-attack', 'special-defense', 'speed'] # column names (labels) for dataset
attributes = [pokemon_id, pokemon_name, pokemon_type, pokemon_total, pokemon_hp, pokemon_attack, pokemon_defense, pokemon_special_attack, pokemon_special_defense, pokemon_speed] # list of values for each column (rows) for dataset
Though #platipus_on_fire_333 answer was perfecto using page_source, as an alternative you can also canonically identify the <table> element and achieve similar result.
Solution
To web scrape a table of pokemon names and stats from pokemondb.net you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver.execute("get", {'url': 'https://pokemondb.net/pokedex/all'})
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#pokedex"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
Console Output:
[ # Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ... ... ...
1070 902 Basculegion Female Water Ghost 530 120 92 65 100 75 78
1071 903 Sneasler Poison Fighting 510 80 130 60 40 80 120
1072 904 Overqwil Dark Poison 510 85 115 95 65 65 85
1073 905 Enamorus Incarnate Forme Fairy Flying 580 74 115 70 135 80 106
1074 905 Enamorus Therian Forme Fairy Flying 580 74 115 110 135 100 46
[1075 rows x 10 columns]]
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://pokemondb.net/pokedex/all")
dfs = pd.read_html(str(browser.page_source))
dfs[0]
This returns a dataframe with 1075 rows × 10 columns:
# Name Type Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ...

How to get table and it's element with Python/Selenium

I'm trying to get all the price in the table at this URL:
https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01
The table elements are the days with the related price.
This is what I'm trying to do to get the table:
#Attempt 1
week = table.find_element(By.CLASS_NAME, "BpkCalendarGrid_bpk-calendar-grid__NzBmM month-view-grid--data-loaded")
#Attempt 2
table = driver.find_element(by=By.XPATH, value="Xpath copied using Crhome inspector"
However I cannot get it.
What is the correct way to extract all the price from this table? Thanks!
You can grab table data meaning all prices using selenium with pandas DataFrame. There are two tables exist of the table data prices
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
table = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[1]'))).get_attribute("outerHTML")
table_2 = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '(//table)[2]'))).get_attribute("outerHTML")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
df1 = pd.read_html(table)[0]
print(df1)
df2 = pd.read_html(table_2)[0]
print(df2)
Output:
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
lun mar mer gio ven sab dom
0 1€ 40 2€ 28 3€ 32 4€ 37 5€ 34 6€ 35 7€ 34
1 8€ 34 9€ 28 10€ 27 11€ 26 12€ 26 13€ 46 14€ 35
2 15€ 35 16€ 40 17€ 36 18€ 51 19€ 28 20€ 33 21€ 36
3 22€ 38 23€ 38 24€ 30 25€ 50 26€ 43 27€ 50 28€ 51
4 29€ 38 30€ 36 31€ 58 1- 2- 3- 4-
5 5- 6- 7- 8- 9- 10- 11-
webdriverManager
Alternative solution(Table-1): Thus way you can extract prices from table two too.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.skyscanner.it/trasporti/voli/bud/rome/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27539793&inboundaltsenabled=true&infants=0&iym=2208&originentityid=27539604&outboundaltsenabled=true&oym=2208&preferdirects=false&ref=home&rtn=1&selectedoday=01&selectediday=01')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="acceptCookieButton"]'))).click()
table = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '(//table)[1]/tbody/tr/td')))
for i in table:
price = i.find_element(By.XPATH,'.//div[#class="price"]').text.replace('€','').strip()
print(price)
Output:
39
30
32
37
34
35
34
34
28
27
26
26
46
35
35
40
36
52
29
34
37
39
39
30
50
44
50
52
38
36
58

Python find_element in find_elements problem?

Good day everyone:
I’d like to get the basketball game data from the web include league , date, time and score ….
The first level for loop works fine to get every league title
for league in leagues:
But the second level for loop
for row in _rows:
I always get all leagues rows ,I just need data for league by league
What should I do to fix it?
Any help will greatly appreciated.
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500,1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://"+"we"+"b2."+"sa8"+"8"+"88.n"+"et"+"/sp"+"ort/Ga"+"mes.aspxdevice=pc")
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[1]/span[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# list all leagues schedule
leagues = []
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = []
_rows = league.find_elements(By.XPATH, "//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague : ",league_Title.text, 'len(_rows)=',len(_rows))
for row in _rows:
print(league_Title,row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow : ",row.text)
time.sleep(1)
time.sleep(120)
driver.quit()
I can't run code because page shows Error 404.
EDIT: in question you forgot ? in Games.aspx?device=pc and this made problem with 404
You have to use dot . at the beginning of xpath to use path relative to league
_rows = league.find_elements(By.XPATH, ".//...rest...") # <-- dot before `//`
You use absolute xpath and it searchs in full HTML.
EDIT:
Partial result with dot in xpath:
I used lang=3 to get English text,
And I used a[2] to select second date (03 / 06 (Sun)) because first date (03 / 05 (Sat)) was empty (no matches)
url: https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc
len(leagues): 115
league: NBA len(_rows)= 12
row: 06:05 Finished Dallas Mavericks 26 25 34 29 114 | Live Update
row: Sacramento Kings 36 29 27 21 113
row: 08:05 Finished Charlotte Hornets 31 31 37 24 123 | Live Update
row: San Antonio Spurs 30 30 37 20 117
row: 09:05 Finished Miami Heat 22 32 19 26 99 | Live Update
row: Philadelphia 76ers 14 26 28 14 82
row: 09:05 Finished Memphis Grizzlies 31 37 29 27 124 | Live Update
row: Orlando Magic 29 16 29 22 96
row: 09:05 Finished Minnesota Timberwolves 32 31 46 26 135 | Live Update
row: Portland Trail Blazers 34 30 37 20 121
row: 09:35 Finished Los Angeles Lakers 32 30 27 35 124 | Live Update
row: Golden State Warriors 25 42 27 22 116
---
league: NBA GATORADE LEAGUE len(_rows)= 8
row: 08:00 Finished Delaware Blue Coats 42 34 37 33 146 | Live Update
row: Westchester Knicks 28 28 24 31 111
row: 09:00 Finished Austin Spurs 35 21 23 31 110 | Live Update
row: Salt Lake City Stars 30 32 21 17 100
row: 09:00 Finished Wisconsin Herd 26 30 20 38 114 | Live Update
row: Capital City Go-Go 27 31 32 38 128
row: 11:00 Finished Santa Cruz Warriors 36 19 17 27 99 | Live Update
row: Memphis Hustle 26 29 22 30 107
---
league: CHINA PROFESSIONAL BASKETBALL LEAGUE len(_rows)= 12
row: 11:00 Finished Fujian Sturgeons 37 21 27 32 117 | Live Update
row: Ningbo Rockets 24 28 34 25 111
row: 11:00 Finished Sichuan Blue Whales 12 21 27 20 80 | Live Update
row: Zhejiang Lions 23 27 35 25 110
row: 15:00 Finished Shenzhen Leopards 23 32 30 33 118 | Live Update
row: Shandong Hi Speed 29 25 32 29 115
row: 15:30 Finished Jilin Northeast Tigers 36 39 25 18 118 | Live Update
row: Shanghai Sharks 15 25 32 36 108
row: 19:35 Finished Beijing Ducks 24 20 17 22 83 | Live Update
row: Beijing Royal Fighters 18 18 21 22 79
row: 20:00 Finished Nanjing Monkey King 23 24 23 25 95 | Live Update
row: Jiangsu Dragons 18 17 21 24 80
---
Full working code:
I added also WebDriverWait to wait for leagues.
And row.text.replace('\n', ' | ') to display row in one line
from selenium import webdriver
#from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_window_size(1500, 1350)
# open url (sorry for the url , cause system always report its a spam)
driver.get("https://web2.sa8888.net/sport/Games.aspx?lang=3&device=pc") # lang=3 for English
# jump to basketball
locator = (By.XPATH, '//*[#id="menuList"]/div/ul/li[3]/div[2]/a[1]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
time.sleep(1)
# date menu
locator = (By.XPATH, '//*[#id="chooseDate"]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# jump to date 1
locator = (By.XPATH, '//*[#id="dateOption"]/a[2]/span[1]') # a[2] for second date, because first has no matches
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
# close AD by double clicl
locator = (By.ID, 'btn_close')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
actions = ActionChains(driver)
actions.click(pointer).perform()
actions = ActionChains(driver)
actions.click(pointer).perform()
# wait for leagues
locator = (By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
pointer = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(locator),
"element not found"
)
# list all leagues schedule
leagues = driver.find_elements(By.XPATH, '//*[#id="scheduleBottom"]/table[*]')
print('len(leagues):', len(leagues))
for league in leagues:
#print("Block.text=",Block.text,"\n")
#_rows = Block.find_elements(By.TAG_NAME, "tr")
league_Title = league.find_element(By.TAG_NAME ,'caption')
_rows = league.find_elements(By.XPATH, ".//*[contains(#id, '_mainRow') or contains(#id, '_secondRow')]")
print("\nleague:", league_Title.text, 'len(_rows)=', len(_rows))
for row in _rows:
#print(league_Title, row.text) #," / _rows=",_rows)
# first_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_mainRow')]")
# second_rows = Block.find_element(By.XPATH, "//*[contains(#id, '_secondRow')]")
print("\trow:", row.text.replace('\n', ' | ')) # <- clean text
time.sleep(1)
print('---')
time.sleep(120)
driver.quit()
I think find_element() or find() is for Only one element on page. You will get just the first element of list of elements, if you use find_element() for multi elements on page.
And find_elements or findAll() is for all elements on page. This function will return data in Array format.
hope this help you some.

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

How to web scrape from two websites in one script?

I am currently working on a model and need to gather information not just regarding game results
(this link https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31)
but I would also like the script to open another link within the HTML source.. the link is available in the source and it'll take me to a page that explains each matches detailed result,
(as in who want what round, https://www.hltv.org/stats/matches/mapstatsid/89458/cr4zy-vs-fnatic?startDate=2019-01-01&endDate=2019-12-31&contextIds=4991&contextTypes=team), the main objective is I want to know who won the match (from first link) and who won the first round of each individual match (in the second link). Is this possible? This is my current script;
import requests
r = requests.get('https://www.hltv.org/stats/teams/maps/6665/Astralis')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
AstralisResults = []
for result in results[1:]:
date = result.contents[1].text
event = result.contents[3].text
opponent = result.contents[7].text
Map = result.contents[9].text
Score = "'" + result.contents[11].text
WinorLoss = result.contents[13].text
AstralisResults.append((date,event,opponent,Map,Score,WinorLoss))
import pandas as pd
df5 = pd.DataFrame(AstralisResults,columns=['date','event','opponent','Map','Score','WinorLoss'])
df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')
So I would be looking for the following information:
Date | Opponent | Map | Score | Result | Round1Result |
Looks like the site blocks if you scrape too fast, so had to put in a time delay. There's ways to make this code more efficient, but overall, I think it gets what you asked for:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
r = requests.get('https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31' , headers=headers)
print (r)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
df5 = pd.DataFrame()
cnt=1
for result in results[1:]:
print ('%s of %s' %(cnt, len(results)-1))
date = result.contents[1].text
event = result.contents[3].text
opponent = result.contents[7].text
Map = result.contents[9].text
Score = "'" + result.contents[11].text
WinorLoss = result.contents[13].text
round_results = result.find('td', {'class':'time'})
link = round_results.find('a')['href']
r2 = requests.get('https://www.hltv.org' + link ,headers=headers)
soup2 = BeautifulSoup(r2.text, 'html.parser')
round_history = soup2.find('div', {'class':'standard-box round-history-con'})
teams = round_history.find_all('img', {'class':'round-history-team'})
teams_list = [ x['title'] for x in teams ]
rounds_winners = {}
n = 1
row = round_history.find('div',{'class':'round-history-team-row'})
for each in row.find_all('img',{'class':'round-history-outcome'}):
if 'emptyHistory' in each['src']:
winner = teams_list[1]
loser = teams_list[0]
else:
winner = teams_list[0]
loser = teams_list[1]
rounds_winners['Round%02dResult' %n] = winner
n+=1
round_row_df = pd.DataFrame.from_dict(rounds_winners,orient='index').T
temp_df = pd.DataFrame([[date,event,opponent,Map,Score,WinorLoss]],columns=['date','event','opponent','Map','Score','WinorLoss'])
temp_df = temp_df.merge(round_row_df, left_index=True, right_index=True)
df5 = df5.append(temp_df, sort=True).reset_index(drop=True)
time.sleep(.5)
cnt+=1
df5 = df5[['date','event','opponent','Map','Score','WinorLoss', 'Round01Result']]
df5 = df5.rename(columns={'date':'Date',
'event':'Event',
'WinorLoss':'Result',
'Round01Result':'Round1Result'})
df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')
Output:
print (df5.head(10).to_string())
Date Event opponent Map Score Result Round1Result
0 20/07/19 Europe Minor - StarLadder Major 2019 CR4ZY Dust2 '13 - 16 L fnatic
1 20/07/19 Europe Minor - StarLadder Major 2019 CR4ZY Train '13 - 16 L fnatic
2 19/07/19 Europe Minor - StarLadder Major 2019 mousesports Inferno '8 - 16 L mousesports
3 19/07/19 Europe Minor - StarLadder Major 2019 mousesports Dust2 '13 - 16 L fnatic
4 17/07/19 Europe Minor - StarLadder Major 2019 North Train '16 - 9 W fnatic
5 17/07/19 Europe Minor - StarLadder Major 2019 North Nuke '16 - 2 W fnatic
6 17/07/19 Europe Minor - StarLadder Major 2019 Ancient Mirage '16 - 7 W fnatic
7 04/07/19 ESL One Cologne 2019 Vitality Overpass '17 - 19 L Vitality
8 04/07/19 ESL One Cologne 2019 Vitality Mirage '16 - 19 L fnatic
9 03/07/19 ESL One Cologne 2019 Astralis Nuke '6 - 16 L fnatic

Categories