I am using Selenium plus python to search a keyword and then in the search result i am trying to clicking top 5 urls and getting data from p tag and then going back. So basically then i am storing the data from these 5 sites. But somehow after searching the keyword i am not being to click the urls and getting the data. i don't know whats wrong. This is the code i have written. Please Help.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path="E:\chromedriver\chromedriver.exe")
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//div[#class='FPdoLc tfB0Bf']//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/a[#href]")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
driver.close()
EDIT :
While navigating through links it is also including links from "People also ask " . i dont want to navigate through this box. How can i do it?
If you want the 16 or so links use.
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/div/div/a")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
You have the wrong xpath for the links, should be:
"//div[#class='yuRUbf']/a[#href]"
If you look at the relevant part of the code, you'll see the <a> tag is not a child of <div class="g">, but of <div class="yuRUbf">
<div class="g"><!--m-->
<div class="tF2Cxc" data-hveid="CAkQAA" data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFSgAMAp6BAgJEAA">
<div class="yuRUbf"><a href="https://www.healthline.com/nutrition/selenium-benefits"
data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"
ping="/url?sa=t&source=web&rct=j&url=https://www.healthline.com/nutrition/selenium-benefits&ved=2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"><br>
<h3 class="LC20lb DKV0Md"><span>7 Science-Based Health Benefits of Selenium - Healthline</span></h3>
<div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">www.healthline.com<span
class="dyjrff qzEoUe"><span> › nutrition › selenium-benefits</span></span></cite></div>
</a>
...
</div>
</div>
</div>
You can also change your search lines a bit too, but it doesn't change the overall effect:
driver.find_element_by_xpath("//input[#name='q']").send_keys('selenium', Keys.ENTER)
Related
Using Selenium (Python) to avoid spoilers of a soccer game
I am trying to grab the url for a video of soccer match replay from a dynamically changing webpage. The webpage shows the score and I'd rather get the link directly, rather than visiting the website that almost certainly will show me the score. There are other related videos of the match, like 10 minute highlight reel. But I would like the full replay only.
There is a list of videos on the page to choose from. But the 'h1' heading indicating it's a full replay is wrapped inside the 'a' tag (see below). There are ~10 of these list items on the page but they are distinguished only from the content of 'h1', buried as child. The text that I'm after Brentford v LFC : Full match. The "full match" part is the give away.
My problem is how do I get the link when the important information comes in a later child??
<li data-sidebar-video="0_5de4sioh" class="js-subscribe-entitlement">
<a class="" href="//video.liverpoolfc.com/player/0_5de4sioh/">
<article class="video-thumb video-thumb--fade-in js-thumb video-thumb--no-duration video-thumb--sidebar">
<figure class="video-thumb__img">
<div class="site-loader">
<ul>
<li></li>
<li></li>
<li></li>
</ul>
</div> <img class="video-thumb__img-container loaded" data-src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" alt="Brentford v LFC : Full match" onerror="PULSE.app.common.VideoThumbError(this)" onload="PULSE.app.common.VideoThumbLoaded(this)"
src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" data-image-initialised="true"> <span class="video-thumb__premium">Premium</span> <i class="video-thumb__play-btn"></i> <span class="video-thumb__time"> <i class="video-thumb__icon"></i> 1:45:07 </span> </figure>
<div class="video-thumb__txt-container"> <span class="video-thumb__tag js-video-tag">Match Action</span>
<h1 class="video-thumb__heading">Brentford v LFC : Full match</h1> <time class="video-thumb__date">25th Sep 2021</time> </div>
</article>
</a>
</li>
My code looks like this at the moment. It gives me a list of the links but I don't know which one is which.
from selenium import webdriver
#------------------------Account login---------------------------#
#I have to login to my account first.
#----------------------------------------------------------------#
username = "<my username goes here>"
password = "<my password goes here>"
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#I have to go to the matches section of my account and click on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay.
#--------------------------------------------------#
#prints all the videos in the list. They all have the same "data-sidebar-video" attribute
web_element1 = driver.find_elements_by_css_selector('li[data-sidebar-video*=""] > a')
print(web_element1)
for i in web_element1:
print(i.get_attribute('href'))
You can do this with a simple XPath locator since you are searching based on contained text.
//a[.//h1[contains(text(),'Full match')]]
^ an A tag
^ that has an H1 descendant
^ that contains the text "Full match"
NOTE: You can't just get the href from the A tag since it isn't a complete URL, e.g. //video.liverpoolfc.com/player/0_5de4sioh/. I would suggest you just click on the link. If you want to write it to a file, you'll have to append "https:" to the front of these partial URLs to make them usable.
You can try like below.
Extract the list of videos with li tags, check if the h1 tag inside the respective list has Full match if yes get the a tag with its href.
# Imports Required:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver.get("https://video.liverpoolfc.com/player/0_5j5fsdzg/?contentReferences=FOOTBALL_FIXTURE%3Ag2210322&page=0&pageSize=20&sortOrder=desc&title=Highlights%3A%20Brentford%203-3%20LFC&listType=LIST-DEFAULT")
wait = WebDriverWait(driver,30)
wait.until(EC.visibility_of_element_located((By.XPATH,"//ul[contains(#class,'related-videos')]/li")))
videos = driver.find_elements_by_xpath("//ul[contains(#class,'related-videos')]/li")
for video in videos:
option = video.find_element_by_tag_name("h1").get_attribute("innerText")
if "Full match" in option:
link = video.find_element_by_tag_name("a").get_attribute("href")
print(f"{option} : {link}")
Brentford v LFC : Full match : https://video.liverpoolfc.com/player/0_5de4sioh/
You can use driver.execute_script to grab only the links that have the "Full match" designation as a child:
links = driver.execute_script('''
var links = [];
for (var i of document.querySelectorAll('li[data-sidebar-video*=""] > a')){
if (i.querySelector('h1.video-thumb__heading').textContent.endsWith('Full match')){
links.push(i.getAttribute('href'));
}
}
return links;
''')
This is what worked. I used both #JeffC's and #pmadhu's responses to get a stable/working code. I also added a headless option so you can run the code without having to view the webpages, which inadvertently might show you the score you're trying to avoid! As a result I had to remove the two lines of wait code, which I've just commented out in case you want to keep it.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
#------------------------Account login---------------------------#
#Logs into my account
#----------------------------------------------------------------#
username = "" #<----my username goes here
password = "" #<----my password goes here
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
#headless option is added so that this can operate in the background.
headless_option = webdriver.ChromeOptions()
headless_option.add_argument("headless")
driver = webdriver.Chrome("/usr/local/bin/chromedriver", options=headless_option)
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#Clicks on the match section of my account and clicks on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay of the most recent game.
#--------------------------------------------------#
#institutes a maximum wait time for the page to load; I could have a slow connection one day.
#wait = WebDriverWait(driver,30)
#wait.until(EC.visibility_of_element_located((By.XPATH,"//a[.//h1[contains(text(),'Full match')]]")))
#finds the full match link using an xpath search term, which is in the brackets
full_replay_xpath_element = driver.find_element_by_xpath("//a[.//h1[contains(text(),'Full match')]]")
#gets the value from the 'href' attribute
full_match_link = full_replay_xpath_element.get_attribute('href')
#finds the game title so I know what match relates to link I'm getting.
match_title = driver.find_element_by_xpath("//h1[contains(text(),'Full match')]")
#gets the value using innerText
match_title_innertext = match_title.get_attribute("innerText")
#prints both the game title and the link.
print(f"{match_title_innertext} : {full_match_link}")
#An example output is:
#Porto v LFC: Full match : https://video.liverpoolfc.com/player/0_i6064wb1/
I am entirely new to webpage scraping and have been looking at a few YouTube videos and online to get me started.
So far, I have been trying to get all the webpage elements from the following website: https://www.letsride.co.uk/routes/search?sort_by=rating
Here is what I have so far:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
s = HTMLSession()
url = 'https://www.letsride.co.uk/routes/search?sort_by=rating'
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
# for i in range(1, 103):
def getnextpage(soup):
page = soup.find('ul', {'class': 'pagination'})
return page
soup = getdata(url)
print(getnextpage(soup))
This prints:
<ul class="pagination">
<li class="disabled"><span>«</span></li>
<li class="active"><span>1</span></li>
<li>2</li>
<li>3</li>
<li>4</li>
<li>5</li>
<li>6</li>
<li>7</li>
<li>8</li>
<li class="disabled"><span>...</span></li>
<li>101</li>
<li>102</li>
<li>»</li>
</ul>
Which is not exactly what I am looking for, I wanted to return only the html elements from the first page to the last page for example:
https://www.letsride.co.uk/routes/search?sort_by=rating&page=1
https://www.letsride.co.uk/routes/search?sort_by=rating&page=2
...
..
.
https://www.letsride.co.uk/routes/search?sort_by=rating&page=102
You can use selenium with python to simulate a browser and get the site then click on the button as many times as you want or until the button is no longer there. I chose to do it only 10 times because the list seems to be almost infinite.
Then I printed out all the URLs on the site, but you can just as easily store them in a list instead.
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver import ActionChains
import time
options = Options()
options.headless = False
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
driver.get("https://www.letsride.co.uk/routes/search?sort_by=rating")
load_more = True
#while load_more:
for i in range(10):
time.sleep(0.2)
try:
load_more_btn = driver.find_element_by_xpath('/html/body/div[2]/section/div[2]/div/div[3]/div/a')
load_more_btn.click()
except:
load_more = False
links = driver.find_elements_by_xpath("//a[#href]")
for link in links:
print(link.get_attribute('href'))
you could use a cleaning function to get rid of non-url elements - basically you need to check each element against a variable that has the canonical url form (https:// ....)
i haven't exactly tested the proof on your code, sorry, hope you'll be able to add it accordingly.
tester = "https://www.letsride.co.uk" #modify this var accordingly to your needs
def cleaner(data):
clean_data = []
for items in data:
if items[0:len(tester)] == tester:
clean_data.append(items)
return clean_data
My code navigates to a website, and in the website there is an article which contains its own link/url/href.
I want to print this field.
My current code highlights the container that it is in, and then I try to do a for loop to get the href.
from selenium import webdriver
driver = webdriver.Chrome()
import time
url = 'https://library.ehaweb.org/eha/#!*menu=6*browseby=8*sortby=2*media=3*ce_id=2035*label=21986*ot_id=25553*marker=1283*featured=17286'
driver.get(url)
time.sleep(3)
page_source = driver.page_source
container=driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']")
for j in container:
link= j.find_element_by_css_selector('a').get_attribute('href')
print(link)
If I correctly understand what you want, you just need to print element's child (a) attribute:
link = driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']/a").get_attribute("href")
print(link)
This prints:
https://library.ehaweb.org/eha/2021/eha2021-virtual-congress/324511/hanny.al-samkari.pazopanib.for.severe.bleeding.and.transfusion-dependent.html?f=menu%3D6%2Abrowseby%3D8%2Asortby%3D2%2Amedia%3D3%2Ace_id%3D2035%2Alabel%3D21986%2Aot_id%3D25553%2Amarker%3D1283%2Afeatured%3D17286
If you want to use loop, then change container=driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']") to
container=driver.find_elements_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']")
For exactly this element the following locator would be enough:
//div[contains(#class, 'test')]/a
With the following code:
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
url = 'https://library.ehaweb.org/eha/#!*menu=6*browseby=8*sortby=2*media=3*ce_id=2035*label=21986*ot_id=25553*marker=1283*featured=17286'
driver.get(url)
driver.implicitly_wait(10)
container = driver.find_elements_by_xpath("//div[contains(#class, 'test')]")
for j in container:
link = j.find_element_by_css_selector('a').get_attribute('href')
print(link)
driver.close()
That page contain lots of inner URL. To click on EHA 2021 virtual container, you can use the below code.
eha_2021 = driver.find_element_by_css_selector('div#listing-main a')
eha_2021_link = eha_2021_link.get_attribute('href')
print(eha_2021_link)
Just in case if you want to click on COVID-19 Outbreak you may try the below code.
Code :
covid_19_element = driver.find_element(By.ID, 'menu-8')
covid_19_url = covid_19_element.get_attribute('href')
print(covid_19_url)
Suggestion :
Try to avoid this kind of xpath //div[#class='list-box col-md-6 col-lg-6 col-xl-4 test'] this looks bit dynamic and may change region wise. Always use locater in below order :
ID
Name
TagName
Class Name
Link Text
Partial Link Text
CSS selector
XPATH
Learn about them here
This works for me in getting the href
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).
I am trying to use Selenium and BeautifulSoup to extract some information from https://superbet.ro/pariuri-sportive/live.
I created the urls for the live matches, and now I'm iterating through them to extract some statistics. But STASTISTICS TAB is not loading when I use this code:
def get_soup(url):
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.execute_script('return document.body.innerHTML')
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
print(soup)
return soup
So I'm trying to click the Statistics button to find the divs I need, because the html obtained in my script is partially loaded and different than the original one from the chrome developer tools.
Here are the difference between what I get and what I need:
<div class="statistics__content">
<div class="sa-sdk-v5">
<div class="sa-sdk-unknown-tab" eventdetails="[object Object]">
Here are the divs I need
I don't know exactly how to click on Statistics because I don't have any button tag.
Here are the tabs
Finally, I solved the problem by clicking on that tab.
I would say stick with Selenium for that process.
You will need to:
Locate the element using selenium. In your case, you will need to grab all the matches and then go to its relative path to find out the box that you can click on. I don't think it has to be a button.
Then you can use something like this.
def wait_for_field(self, xpath, driver, interval=10):
try:
element = WebDriverWait(driver, interval).until(EC.presence_of_element_located((By.XPATH, xpath)))
except Exception as e:
raise CrawlerException(field + " failed, ", str(e))
return elemenT
def click_on_match(self, page_browser):
try:
element_to_click= self.wait_for_field("**fill out here**", page_browser,
interval=5)
print("found match to click")
page_browser.execute_script("arguments[0].click()", element_to_click)
except:
pass'
Try this:
import requests
url = "https://old.superbet.ro/rest/SBWeb.Models.Casino/getAllGames"
r = requests.get(url)
json_data = r.json()
My main html page has an iframe it and I to need to get the text Code: LWBAD that lives there.
Check picture for a better understanding:
Bellow is my main html page source that has an iframe in it:
<td class="centerdata flag"><iframe style="width: 200px; height: 206px;" scrolling="no" src="https://www.example.com/test/somewhere" ></iframe></td>
The redirect link (iframe page) has this html source
<body>
<a href="http://www.test2.com" target="_blank">
<img src="https://img2.test2.com/LWBAD-1.jpg"></a>
<br/>Code: LWBAD
So far I can get the complete page source from my main html page.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import html5lib
driver_path = '/usr/local/bin/chromedriver 2'
driver = webdriver.Chrome(driver_path)
driver.implicitly_wait(10)
driver.get('http://example.com')
try:
time.sleep(4)
iframe = driver.find_elements_by_tag_name('iframe')
driver.switch_to_default_content()
output = driver.page_source
print (output)
finally:
driver.quit();
*urls are not accesible from outside of my network that's why I used example.com
you should use
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.frame(iframe)
# your work to extract link
driver.switch_to_default_content()
for multiple url
find_elements_by_tag_name will return an array. so use for loop
iframe = driver.find_elements_by_tag_name('iframe')
for i in iframe:
driver.switch_to.frame(i)
# your work to extract link
driver.switch_to_default_content()
to get only text
use
text = driver.find_element_by_tag_name('body').text
after driver.switch_to.frame(i)
try this:
iframe = driver.find_elements_by_tag_name('iframe')
for i in range(0, len(iframe)):
f = driver.find_elements_by_tag_name('iframe')[i]
driver.switch_to.frame(i)
# your work to extract link
text = driver.find_element_by_tag_name('body').text
print(text)
driver.switch_to_default_content()