Apologies in advance if this long question seems quite basic!
Given:
search query link in a library website:
url = 'https://digi.kansalliskirjasto.fi/search?query=economic%20crisis&orderBy=RELEVANCE'
I'd like to extract all useful information for each individual search result (total 20 in 1 page) of this specific query as depicted by red rectangles in this figure:
currently, I have the following code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
def run_selenium(URL):
options = Options()
options.add_argument("--remote-debugging-port=9222"),
options.headless = True
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(URL)
pt = "//app-digiweb/ng-component/section/div/div/app-binding-search-results/div/div"
medias = driver.find_elements(By.XPATH, pt) # expect to obtain a list with 20 elements!!
print(medias) # >>>>>> result: []
print("#"*100)
for i, v in enumerate(medias):
print(i, v.get_attribute("innerHTML"))
if __name__ == '__main__':
url = 'https://digi.kansalliskirjasto.fi/search?query=economic%20crisis&orderBy=RELEVANCE'
run_selenium(URL=url)
Problem:
Having a look at part of the inspect in chrome:
I have tried several xpath generated by Chrome Extensions XPath Helper and SelectorsHub to produce XPath and use it as pt variable in my python code this library search engine, but the result is [] or simply nothing.
Using SelectorsHub and hovering the mouse over Rel XPath, I get this warning: id & class both look dynamic. Uncheck id & class checkbox to generate rel xpath without them if it is generated with them.
Question:
Assuming selenium as a tool for web scraping of a page containing dynamic attributes instead of BeautifulSoup as recommended here and here, shouldn't driver.find_elements(), return a list of 20 elements each of which containing all info and to be extracted?
>>>>> UPDATE <<<<< Working Solution (although time inefficient)
As recommended by #JaSON in the solution, I now use WebDriverWait in try except block as follows:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common import exceptions
def get_all_search_details(URL):
st_t = time.time()
SEARCH_RESULTS = {}
options = Options()
options.headless = True
options.add_argument("--remote-debugging-port=9222")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-extensions")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver =webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(URL)
print(f"Scraping {driver.current_url}")
try:
medias = WebDriverWait(driver,timeout=10,).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'result-row')))
for media_idx, media_elem in enumerate(medias):
outer_html = media_elem.get_attribute('outerHTML')
result = scrap_newspaper(outer_html) # some function to retrieve results
SEARCH_RESULTS[f"result_{media_idx}"] = result
except exceptions.StaleElementReferenceException as e:
print(f"Selenium: {type(e).__name__}: {e.args}")
return
except exceptions.NoSuchElementException as e:
print(f"Selenium: {type(e).__name__}: {e.args}")
return
except exceptions.TimeoutException as e:
print(f"Selenium: {type(e).__name__}: {e.args}")
return
except exceptions.WebDriverException as e:
print(f"Selenium: {type(e).__name__}: {e.args}")
return
except exceptions.SessionNotCreatedException as e:
print(f"Selenium: {type(e).__name__}: {e.args}")
return
except Exception as e:
print(f"Selenium: {type(e).__name__} line {e.__traceback__.tb_lineno} of {__file__}: {e.args}")
return
except:
print(f"Selenium General Exception: {URL}")
return
print(f"\t\tFound {len(medias)} media(s) => {len(SEARCH_RESULTS)} search result(s)\tElapsed_t: {time.time()-st_t:.2f} s")
return SEARCH_RESULTS
if __name__ == '__main__':
url = 'https://digi.kansalliskirjasto.fi
get_all_search_details(URL=url)
This approach works but seems to be very time consuming and inefficient:
Found 20 media(s) => 20 search result(s) Elapsed_t: 15.22 s
This is an answer for Question#2 only since #1 and #3 (as Prophet've already said in comment) are not valid for SO.
Since you're dealing with dynamic content find_elements is not what you need. Try to wait for required data to appear:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
medias = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'media')))
On top of the search results there is an option to download search results as excel, there comes the newspaper/journal metadata and the text surrounding the search. Could it be easier to use than scrape individual elements? (Excel contains only 10.000 first hits, thou...)
Related
I'm trying to create a script to show only pikachus on singapore poke map and the rest of the code is to go over the elements and get the coords for it and print the list.
I'm trying for a long time many suggestions I've seen here but still unable to make the checkbox be set with the latest code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
def find_pokemon():
links = []
service = ChromeService(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get('https://sgpokemap.com/index.html?fbclid=IwAR2p_93Ll6K9b923VlyfaiTglgeog4uWHOsQksvzQejxo2fkOj4JN_t-MN8')
driver.find_element(By.ID, 'filter_link').click()
driver.find_element(By.ID, 'deselect_all_btn').click()
driver.find_element(By.ID, 'search_pokemon').send_keys("pika")
driver.switch_to.frame(driver.find_elements(By.ID, "filter"))
driver.find_element(By.ID, 'checkbox_25').click()
The second part of the code is working when I'm checking the box manually after putting a breakpoint and ignoring the checkbox click() exception.
Do you have any suggestions what can I try?
Bonus question, how can I determine and close the donate view:
There are several problems with your code:
There is no element with ID = 'search_pokemon'
There is no frame there to switch into it.
You need to use WebDriverWait expected_conditions to wait for elements to be clickable.
And generally you need to learn how to create correct locators.
The following code works:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 30)
url = "https://sgpokemap.com/index.html?fbclid=IwAR2p_93Ll6K9b923VlyfaiTglgeog4uWHOsQksvzQejxo2fkOj4JN_t-MN8"
driver.get(url)
try:
wait.until(EC.element_to_be_clickable((By.ID, 'close_donation_button'))).click()
except:
pass
wait.until(EC.element_to_be_clickable((By.ID, 'filter_link'))).click()
wait.until(EC.element_to_be_clickable((By.ID, "deselect_all_btn"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[name='search_pokemon']"))).send_keys("pika")
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='filter_checkbox'][not(#style)]//label"))).click()
The result is:
UPD
This time I saw the donation dialog so I added the mechanism to close it.
I still can't see there element with ID = 'search_pokemon' as you mentioned.
As about the XPath to find the relevant checkbox - when pokemon name is inserted you can see in the dev tools that there are a lot of checkboxes there but all of them are invisibly while only one in our case is visible. The invisible elements are all have attribute style="display: none;" while the enabled element does not have style attribute. This is why [not(#style)] is coming there. So, I'm looking for parent element //div[#class='filter_checkbox'] who is also have no style attribute. In XPath words //div[#class='filter_checkbox'][not(#style)] then I'm just looking for it label child to click it. This can also be done with CSS Selectors as well.
The list of invisible elements with the enabled one:
With the help and answers from #Prophet , the current code for crawling the map and getting all of the Pikachus coordinates:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from keep import saveToKeep
def find_pokemon():
links = []
options = Options()
options.add_argument("--headless")
options.add_argument("disable-infobars")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 30)
driver.get('https://sgpokemap.com')
try:
wait.until(EC.element_to_be_clickable((By.ID, 'close_donation_button'))).click()
except:
pass
wait.until(EC.element_to_be_clickable((By.ID, 'filter_link'))).click()
wait.until(EC.element_to_be_clickable((By.ID, "deselect_all_btn"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[name='search_pokemon']"))).send_keys("pika")
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='filter_checkbox'][not(#style)]//label"))).click()
# count = 0
wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'pokemon_icon_img')))
pokeList = driver.find_elements(By.CLASS_NAME, 'pokemon_icon_img')
for poke in pokeList:
# count += 1
try:
poke.click()
links.append(driver.find_element(By.LINK_TEXT, "Maps").get_attribute('href'))
except Exception:
pass
# if count > 300:
# break
res = []
for link in links:
res.append(link.split("=")[1].replace("'", ""))
# for item in res:
# print(item)
if len(res) > 1:
saveToKeep(res)
print("success")
else:
print("unsuccessful")
find_pokemon()
if __name__ == '__main__':
find_pokemon()
Used the headless chrome option in hope to achieve better
performance.
Commented out 'count' in case I want to limit list results
(currently I'm getting like 15 results tops when unlimited
although there are many more...weird :( )
the following code wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'pokemon_icon_img'))) is needed for now since it's not always
showing icons right away, so it's either that or adding a constant
time delay.
Have made this method recursive in case it's unsuccessful(sometimes it still gives out exceptions)
Lastly, saveToKeep(res) method is a simple method I'm using to open
and write results into my google keep notes. Needed to get an app
password within google security settings and I'm using it with my google account credentials for login.
Any comments or regards for improvements are welcomed :D
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time
from twilio.rest import Client
from datetime import datetime
import datefinder
import os
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
url = 'https://www.recreation.gov/permits/233262/registration/detailed-availability?type=overnight-permit'
driver = webdriver.Chrome()
driver.get(url)
title = ""
text = ""
campsites = ""
waiter = webdriver.support.wait.WebDriverWait(driver, 30)
waiter.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="per-availability-main"]/div[1]/div[1]/div/div/div/div/div[3]/div/div[2]/button[2]')))
element = driver.find_element(By.XPATH, '//*[#id="per-availability-main"]/div[1]/div[1]/div/div/div/div/div[3]/div/div[2]/button[2]')
element.click()
time.sleep(0.3)
element.click()
time.sleep(0.4)
waiter.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="per-availability-main"]/div[1]/div[3]/div/fieldset/div/div[2]/label/span')))
element = driver.find_element(By.XPATH, '//*[#id="per-availability-main"]/div[1]/div[3]/div/fieldset/div/div[2]/label/span')
element.click()
time.sleep(0.5)
waiter.until(EC.visibility_of_element_located((By.CLASS_NAME, 'rec-grid-grid-cell available')))
elements = driver.find_elements(By.CLASS_NAME, 'rec-grid-grid-cell available')
time.sleep(4)
So this code is to eventually compile a list of available permits for a given date for me to quickly find out which I want to do. It clicks 2 users and selects "no" for the guided trip. This reveals a grid, which shows the available sites. The first 2 steps work completely fine. It stops working when it tries to work with the grid.
I'm trying to locate available sites with the class name "rec-grid-grid-cell available"
I have also tried locating anything on that grid by XPATH and it can't seem to find anything. Is there a special way to deal with grids that appear after a few clicks?
If you need more information, please ask.
Unfortunately you cannot pass multiple css class names to By.CLASS_NAME.
So you can do either:
available_cells_css = ".rec-grid-grid-cell.available"
available_cells = waiter.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, available_cells_css)))
or
available_cells_xpath = "//div[#class='rec-grid-grid-cell available']"
available_cells = waiter.until(EC.visibility_of_all_elements_located((By.XPATH, available_cells_xpath)))
I want to create a script that grabs the info from the website using selenium.
However, if it doesn't find the info and shows an error message, it skips that request and continues to the next one.
from selenium import webdriver
import pandas as pd
import undetected_chromedriver as uc
list1 = [6019306,6049500,6051161,6022230,5776662,6151430]
for x in range(0, list1.count()):
while True:
try:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
url = 'https://www.axie.tech/axie-pricing/'+str(list1[x])
driver.get(url)
driver.implicitly_wait(10)
test = driver.find_element_by_xpath('//*[#id="root"]/div[1]/div[2]/div[2]/div/div/div[1]/div/div[1]/div[4]/div/div[3]/div/span').text
test = float(test[1:])
print(test)
driver.close()
except NoSuchElementException:
'This Value doesnt exist'
driver.close()
A bit unclear what exactly you are trying to do through the line test = float(test[1:]).
However to extract the desired text from the list of websites you need to induce WebDriverWait for visibility_of_element_located() and you can use the following locator strategy:
Code Block:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
list1 = [6019306, 6049500, 6051161, 6022230, 5776662, 6151430]
for x in range(0, len(list1)):
try:
url = 'https://www.axie.tech/axie-pricing/'+str(list1[x])
driver.get(url)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Reasonable']//following::div[1]//span[contains(#class, 'MuiTypography-root')]"))).text)
except TimeoutException:
continue
driver.quit()
Console Output:
Ξ0.01
Ξ0.012
Ξ0.0162
Ξ0.026
Note : You have to add the following imports :
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
I try to scrape data from the public Microsoft Power Bi Dashboard (4th page).
But unfortunately, I can't understand how with selenium I can change periods of time.
Tell me, please, is this even possible using python + selenium? Maybe move these sliders, or enter dates into inputs.
Thanks.
Dashboard screen
Here is my code to load the dashboard page:
from selenium import webdriver
import time
fp = webdriver.FirefoxProfile()
url='https://app.powerbi.com/view?r=eyJrIjoiNjIwNzg5NzQtNzRlYS00YzFmLWJiNTUtOTM2MGEwY2FjOGJlIiwidCI6ImE3NWRkYWZlLWQ2MmYtNGIxOS04NThhLTllYzFhYjI1NDdkNCIsImMiOjl9'
driver = webdriver.Firefox(firefox_profile=fp)
driver.get(url)
time.sleep(5)
driver.find_element_by_xpath('//*[#id="embedWrapperID"]/div[2]/logo-bar/div/div/div/logo-bar-navigation/span/button[2]/i').click()
driver.find_element_by_xpath('//*[#id="embedWrapperID"]/div[2]/logo-bar/div/div/div/logo-bar-navigation/span/button[2]/i').click()
driver.find_element_by_xpath('//*[#id="embedWrapperID"]/div[2]/logo-bar/div/div/div/logo-bar-navigation/span/button[2]/i').click()
I was able to get it to set the date by clearing the existing date, and then using a date string + tab to refresh the graph, is that what you are looking for?
I also packaged up my attempts to click on the various items with a wait function so I'm not waiting some unnecessary amount of time before something shows up on the screen.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common import exceptions
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
def InputByXPATH(NameOfObject, WhatToSend):
try:
item = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, NameOfObject)))
item.click()
item.send_keys(Keys.CONTROL, 'a')
item.send_keys(WhatToSend)
except TimeoutException as e:
print("InputByXPATH Error: Couldn't input by XPATH on: " + str(NameOfObject))
pass
def ClickByXPATH(NameOfObject):
try:
item = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, NameOfObject)))
item.click()
except TimeoutException as e:
print("ClickByXpath Error: Couldn't Click by XPATH on: " + str(NameOfObject))
pass
driver = webdriver.Chrome()
driver.maximize_window()
url='https://app.powerbi.com/view?r=eyJrIjoiNjIwNzg5NzQtNzRlYS00YzFmLWJiNTUtOTM2MGEwY2FjOGJlIiwidCI6ImE3NWRkYWZlLWQ2MmYtNGIxOS04NThhLTllYzFhYjI1NDdkNCIsImMiOjl9'
driver.get(url)
ClickByXPATH('//*[#id="embedWrapperID"]/div[2]/logo-bar/div/div/div/logo-bar-navigation/span/button[2]/i')
ClickByXPATH('//*[#id="embedWrapperID"]/div[2]/logo-bar/div/div/div/logo-bar-navigation/span/button[2]/i')
InputByXPATH('/html/body/div[1]/report-embed/div/div/div[1]/div/div/div/exploration-container/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[8]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div/div[1]/div/div[1]/input', '2/1/2022' + Keys.TAB )
time.sleep(10)
import time
import sys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver=webdriver.Chrome('chromedriver',options=options)
driver.get("https://rentry.co/wftw8/edit")
driver.implicitly_wait(20)
#print (driver.page_source)
try:
# here I selected the **span** element that I talk above
span = driver.find_element_by_xpath('//*[#id="text"]/div/div[5]/div[1]/div/div/div/div[5]/pre/span')
# change the innerText thwough js
driver.execute_script('arguments[0].innerText="Hello boy"', span)
# just wait for the id_edit_code to be present
edit = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "id_edit_code")))
edit.send_keys("iRfiNq6M")
# and same as you had it
#driver.find_element_by_id("submitButton").send_keys(Keys.ENTER)
#driver.find_element_by_link_text("Save").click()
driver.find_element_by_id("submitButton").click()
except:
print("Oops!", sys.exc_info()[0], "occurred.")
finally:
driver.close()
print("done")
There is no exception but still updating of the text is not reflecting in the url?
Even though there is timer which is enough for the whole code to get processed.
Then also there is no updation.
Currently the text is Hello if u visit the URL but I want it to be Hello boy using selenium , which is done by the below code line:
span = driver.find_element_by_xpath('//*[#id="text"]/div/div[5]/div[1]/div/div/div/div[5]/pre/span')
# change the innerText thwough js
driver.execute_script('arguments[0].innerText="Hello boy"', span)
But no updation!!?