I'm fairly new to the web scraping world but I really need to do some web scraping on the Thesaurus website for a project I'm working on. I have successfully created a program using beautifulsoup4 that asks the user for a word, then returns the most likely synonyms based on Thesaurus. However, I would like to not only have those synonyms but also the synonyms of every sense of the word (which is depicted on Thesaurus by a list of buttons above the synonyms). I noticed that when clicking a button, the name of the classes also change, so I did a little digging and decided to go with Selenium instead of beautifulsoup.
I have now a code that writes a word on the search bar and clicks it, however, I'm unable to get the synonyms or the said buttons, simply because the find_element finds nothing, and being new to this, I'm afraid I'm using the wrong syntax.
This is my code at the moment (it looks for synonyms of "good"):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
PATH = "C:\Program Files (x86)\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://thesaurus.com")
search = driver.find_element_by_id("searchbar_input")
search.send_keys('good')
search.send_keys(Keys.RETURN)
try:
headword = WebDriverWait(driver,10).until(
EC.presence_of_element_located((By.ID, "headword"))
)
print(headword.text)
#buttons = headword.find_element_by_class_name("css-bjn8wh e1br8a1p0")
#print(buttons.text)
meanings = WebDriverWait(driver,10).until(
EC.presence_of_element_located((By.ID, "meanings"))
)
print(meanings.text)
#words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
#print(words.text)
except:
print('failed')
driver.quit()
For the first part, I want to access the buttons. The headword is simply the element that contains all the buttons I want to press. This is the headword element according to the inspect tool:
<div id="headword" class="css-bjn8wh e1br8a1p0">
<div class="css-vw3jp5 e1ibdjtj4">
*unecessary stuff*
<div class="css-bjn8wh e1br8a1p0">
<div class="postab-container css-cthfds ew5makj3">
<ul class="css-gap396 ew5makj2">
<li data-test-pos-tab="true" class="active-postab css-kgfkmr ew5makj4">
<a class="css-sc11zf ew5makj1">
<em class="css-1v93s5a ew5makj0">adj.</em>
<strong>pleasant, fine</strong>
</a>
</li>
<li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
*similar stuff*
<li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
...
where each one these <li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4"> is a button I want to click. So far I have tried a bunch of things like the one showed in the code, and also things like:
buttons = headword.find_elements_by_class_name("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_css_selector("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_class_name("postab-container css-cthfds ew5makj3")
buttons = headword.find_elements_by_css_selector("postab-container css-cthfds ew5makj3")
but in any case Selenium can find these elements.
For the second part I want the synonyms. Here is the meaning element:
<div id="meanings" class="css-16lv1yi e1qo4u831">
<div class="css-1f3egm3 efhksxz0">
*unecessary stuff*
<div data-testid="word-grid-container" class="css-ixatld e1cc71bi0">
<ul class="css-1ngwve3 e1ccqdb60">
<li>
<a font-weight="inherit" href="/browse/acceptable" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
</a>
</li>
<li>
<a font-weight="inherit" href="/browse/bad" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
...
where each of these elements is a synonym I want to get. Similarly to the previous case I tried several things such as:
synGrid = meanings.find_element_by_class_name("css-ixatld e1cc71bi0")
synGrid = meanings.find_element_by_css_selector("css-ixatld e1cc71bi0")
words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
words = meanings.find_elements_by_css_selector("css-1kg1yv8 eh475bn0")
And again Selenium cannot find these elements...
I would really appreciate some help in order to achieve this, even if it is just a push in the right direction instead of giving a full solution.
Hope I wrote all the needed information, if not, please let me know.
If you use css selector then you have to use dot for class
css_selector(".css-ixatld.e1cc71bi0")
and hash for id
css_selector("#headword")
like you would use in files .css
In css selector you can use also other methods avaliable in CSS.
See css selectors on w3schools.com
Selenium converts class_name to css selector but class_name() expects single name and Selenium has problems when there are two or more names. When it converts class_name to css_selector then it adds dot only before first name but it needs dot also before second and other names. So you have to manually add second dot
class_name("css-ixatld.e1cc71bi0")
See if this works:
meanings = driver.find_elements_by_xpath(".//div[#id='meanings']/div[#data-testid='word-grid-container']/ul/li")
for e in meanings:
e.find_element_by_tag_name("a").click()
//Add a implicit wait if you need
driver.back()
Related
Using Selenium (Python) to avoid spoilers of a soccer game
I am trying to grab the url for a video of soccer match replay from a dynamically changing webpage. The webpage shows the score and I'd rather get the link directly, rather than visiting the website that almost certainly will show me the score. There are other related videos of the match, like 10 minute highlight reel. But I would like the full replay only.
There is a list of videos on the page to choose from. But the 'h1' heading indicating it's a full replay is wrapped inside the 'a' tag (see below). There are ~10 of these list items on the page but they are distinguished only from the content of 'h1', buried as child. The text that I'm after Brentford v LFC : Full match. The "full match" part is the give away.
My problem is how do I get the link when the important information comes in a later child??
<li data-sidebar-video="0_5de4sioh" class="js-subscribe-entitlement">
<a class="" href="//video.liverpoolfc.com/player/0_5de4sioh/">
<article class="video-thumb video-thumb--fade-in js-thumb video-thumb--no-duration video-thumb--sidebar">
<figure class="video-thumb__img">
<div class="site-loader">
<ul>
<li></li>
<li></li>
<li></li>
</ul>
</div> <img class="video-thumb__img-container loaded" data-src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" alt="Brentford v LFC : Full match" onerror="PULSE.app.common.VideoThumbError(this)" onload="PULSE.app.common.VideoThumbLoaded(this)"
src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" data-image-initialised="true"> <span class="video-thumb__premium">Premium</span> <i class="video-thumb__play-btn"></i> <span class="video-thumb__time"> <i class="video-thumb__icon"></i> 1:45:07 </span> </figure>
<div class="video-thumb__txt-container"> <span class="video-thumb__tag js-video-tag">Match Action</span>
<h1 class="video-thumb__heading">Brentford v LFC : Full match</h1> <time class="video-thumb__date">25th Sep 2021</time> </div>
</article>
</a>
</li>
My code looks like this at the moment. It gives me a list of the links but I don't know which one is which.
from selenium import webdriver
#------------------------Account login---------------------------#
#I have to login to my account first.
#----------------------------------------------------------------#
username = "<my username goes here>"
password = "<my password goes here>"
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#I have to go to the matches section of my account and click on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay.
#--------------------------------------------------#
#prints all the videos in the list. They all have the same "data-sidebar-video" attribute
web_element1 = driver.find_elements_by_css_selector('li[data-sidebar-video*=""] > a')
print(web_element1)
for i in web_element1:
print(i.get_attribute('href'))
You can do this with a simple XPath locator since you are searching based on contained text.
//a[.//h1[contains(text(),'Full match')]]
^ an A tag
^ that has an H1 descendant
^ that contains the text "Full match"
NOTE: You can't just get the href from the A tag since it isn't a complete URL, e.g. //video.liverpoolfc.com/player/0_5de4sioh/. I would suggest you just click on the link. If you want to write it to a file, you'll have to append "https:" to the front of these partial URLs to make them usable.
You can try like below.
Extract the list of videos with li tags, check if the h1 tag inside the respective list has Full match if yes get the a tag with its href.
# Imports Required:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver.get("https://video.liverpoolfc.com/player/0_5j5fsdzg/?contentReferences=FOOTBALL_FIXTURE%3Ag2210322&page=0&pageSize=20&sortOrder=desc&title=Highlights%3A%20Brentford%203-3%20LFC&listType=LIST-DEFAULT")
wait = WebDriverWait(driver,30)
wait.until(EC.visibility_of_element_located((By.XPATH,"//ul[contains(#class,'related-videos')]/li")))
videos = driver.find_elements_by_xpath("//ul[contains(#class,'related-videos')]/li")
for video in videos:
option = video.find_element_by_tag_name("h1").get_attribute("innerText")
if "Full match" in option:
link = video.find_element_by_tag_name("a").get_attribute("href")
print(f"{option} : {link}")
Brentford v LFC : Full match : https://video.liverpoolfc.com/player/0_5de4sioh/
You can use driver.execute_script to grab only the links that have the "Full match" designation as a child:
links = driver.execute_script('''
var links = [];
for (var i of document.querySelectorAll('li[data-sidebar-video*=""] > a')){
if (i.querySelector('h1.video-thumb__heading').textContent.endsWith('Full match')){
links.push(i.getAttribute('href'));
}
}
return links;
''')
This is what worked. I used both #JeffC's and #pmadhu's responses to get a stable/working code. I also added a headless option so you can run the code without having to view the webpages, which inadvertently might show you the score you're trying to avoid! As a result I had to remove the two lines of wait code, which I've just commented out in case you want to keep it.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
#------------------------Account login---------------------------#
#Logs into my account
#----------------------------------------------------------------#
username = "" #<----my username goes here
password = "" #<----my password goes here
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
#headless option is added so that this can operate in the background.
headless_option = webdriver.ChromeOptions()
headless_option.add_argument("headless")
driver = webdriver.Chrome("/usr/local/bin/chromedriver", options=headless_option)
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#Clicks on the match section of my account and clicks on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay of the most recent game.
#--------------------------------------------------#
#institutes a maximum wait time for the page to load; I could have a slow connection one day.
#wait = WebDriverWait(driver,30)
#wait.until(EC.visibility_of_element_located((By.XPATH,"//a[.//h1[contains(text(),'Full match')]]")))
#finds the full match link using an xpath search term, which is in the brackets
full_replay_xpath_element = driver.find_element_by_xpath("//a[.//h1[contains(text(),'Full match')]]")
#gets the value from the 'href' attribute
full_match_link = full_replay_xpath_element.get_attribute('href')
#finds the game title so I know what match relates to link I'm getting.
match_title = driver.find_element_by_xpath("//h1[contains(text(),'Full match')]")
#gets the value using innerText
match_title_innertext = match_title.get_attribute("innerText")
#prints both the game title and the link.
print(f"{match_title_innertext} : {full_match_link}")
#An example output is:
#Porto v LFC: Full match : https://video.liverpoolfc.com/player/0_i6064wb1/
I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:
Load relevant libraries
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then load the content I wish to analyse
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[#type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)
scrolls = 2
while True:
scrolls -= 1
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
Then to get the content for each selector separately, call for css_selector
titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
names.text
TitlesList.append(names.text)
times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
names.text
Times.append(names.text)
It all works so far...Now trying to bring them together with the aim to identify only choices from 2016
choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")
browser.quit()
On this last snippet, I always get an empty list.
So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:
[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]
Thanks
Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?
You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']", but this will not work because its syntax is not valid (and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.
I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.
<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
data-source="search_post---------2">
<div class="u-clearfix u-marginBottom15 u-paddingTop5">
<div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
<div class="u-flexCenter">
<div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
<div
class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
<a class="link link--darken"
href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-source="preview-listing">
<time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="postArticle-content">
<a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post" data-action-source="search_post---------2"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-index="2" data-post-id="d17220aecaa8">
<section class="section section--body section--first section--last">
<div class="section-divider">
<hr class="section-divider">
</div>
<div class="section-content">
<div class="section-inner sectionLayout--insetColumn">
<h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
International Development for the 21st Century.</h3>
</div>
</div>
</section>
</a>
</div>
</div>
Both time and h3 are in a big div with class of postArticle. The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?
Using XPATH is much more powerful & easier to write:
This will get all articles div that contains class name of postArticle--short: article_xpath = '//div[contains(#class, "postArticle--short")]'
This will get all time tag that contains class name of 2016: //time[contains(#datetime, "2016")]
Let's combine both of them. I want to get article div that contains a time tag with classname of 2016:
article_2016_xpath = '//div[contains(#class, "postArticle--short")][.//time[contains(#datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)
# now let's get the title
for article in article_element_list:
title = article.find_element_by_tag_name("h3").text
I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.
By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html
This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.
Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3.
so I'm currently using python to import data from an excel sheet and then take that information and use it to fill out a form on a webpage.
The problem I'm having is selecting a profile of the drop-down menu.
I've been using the Selenium library and I can actually select the element using find_element_by_xpath assuming, but that's assuming I know the data value, the data value is auto-generated for each new profile that's added so I can't use that as a reliable means.
Profile = Browser.find_element_by_xpath("/html/something/something/.....")
Profile.click()
time.sleep(0.75) #allowing time for link to be clickable
The_Guy = Browser.find_element_by-xpath("/html/something/something/...")
The_Guy.click()
This works only on known paths I would like to do something like this
Profile = Browser.find_element_by_xpath("/html/something/something/.....")
Profile.click()
time.sleep(0.75) #allowing time for link to be clickable
The_Guy = Browser.find_element_by_id("Caption.A")
The_Guy.click()
EXAMPLE OF HTML
<ul class ="list">
<li class = "option" data-value= XXXXX-XXXXX-XXXXX-XX-XXX>
::marker
Thor
</li>
<li class = "option" data-value= XXXXX-XXXXX-XXXXX-XX-XXX>
::marker
IronMan
</li>
<li class = "option" data-value= XXXXX-XXXXX-XXXXX-XX-XXX>
::marker
Caption.A
</li>
....
</ul>
What I'll like to be able to do is search via name (like Caption.A) and then step back to select the parent Li. Thanks in advance
Try using following xpath to find the li containing desired text and then click on it. Sample code:
driver.find_element(By.xpath("//li[contains(text(), 'Caption.A')]")).click()
Hope it helps :)
I am trying to print off some housing prices and am having trouble using Xpath. Here's my code:
from selenium import webdriver
driver = webdriver.Chrome("my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath("""//*[#id="
{}"]/div[2]/div[1]""".format(house_number))
print(price.text)
except:
print('couldnt find')
I am on this website, trying to print off the housing prices of the first ten houses.
My output is that for all the houses that say "NEW", that gets taken as the price instead of the actual price. But for the bottom two, which don't have that NEW sticker, the actual price is recorded.
How do I make my Xpath selector so it selects the numbers and not NEW?
You can write it like this without loading the image, which can increase your fetching speed
from selenium import webdriver
# Unloaded image
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_opt.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_opt,executable_path="my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath('//*[#id="{}"]/div[2]/div[#class="srp-item-price"]'.format(house_number))
print(price.text)
except:
print('couldnt find')
You're on the right track, you've just made an XPath that is too brittle. I would try making it a little more verbose, without relying on indices and wildcards.
Here's your XPath (I used id="1" for example purposes):
//*[#id="1"]/div[2]/div[1]
And here's the HTML (some attributes/elements removed for brevity):
<li id="1">
<div></div>
<div class="srp-item-body">
<div>New</div><!-- this is optional! -->
<div class="srp-item-price">$100,000</div>
</div>
</li>
First, replace the * wildcard with the element that you are expecting to contain the id="1". This simply serves as a way to help "self-document" the XPath a little bit better:
//li[#id="1"]/div[2]/div[1]
Next, you want to target the second <div>, but instead of searching by index, try to use the element's attributes if applicable, such as class:
//li[#id="1"]/div[#class="srp-item-body"]/div[1]
Lastly, you want to target the <div> with the price. Since the "New" text was in it's own <div>, your XPath was targeting the first <div> ("New"), not the <div> with the price. Your XPath did however work, if the "New" text <div> did not exist.
We can use a similar method as the previous step, targeting by attribute. This forces the XPath to always target the <div> with the price:
//li[#id="1"]/div[#class="srp-item-body"]/div[#class="srp-item-price"]
Hope this helps!
And so... having said all of that, if you are just interested in the prices and nothing else, this would probably also work :)
for price in driver.find_elements_by_class_name('srp-item-price'):
print(price.text)
Can you try this code:
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
prices=driver.find_elements_by_xpath('//*[#class="data-price-display"]')
for price in prices:
print(price.text)
It will print
$39,900
$86,500
$39,500
$40,000
$179,000
$31,000
$104,900
$94,900
$54,900
$19,900
Do let me know if any other details are also required
I am working with a web page that needs some automation and having trouble interacting with certain elements due to their structure. Brief example:
<ul>
<li data-title="Search" data-action="search">
<li class="disabled" data-title="Ticket Grid" data-action="ticket-grid">
<li data-title="Create Ticket" data-action="create">
<li data-title="Settings" data-action="settings">
</ul>
I am aware of all the locator strategies like id and name listed here:
http://selenium-python.readthedocs.org/en/latest/locating-elements.html
However, is there a way to specify finding something by a custom value like in this example "data-title"?
You can use CSS to select any attribute, this is what the formula looks like:
element[attribute(*|^|$|~)='value']
Per your example, it would be:
li[data-title='Ticket Grid']
(source http://ddavison.io/css/2014/02/18/effective-css-selectors.html)
If there are multiple possibilities it is also worth knowing the following option
from selenium.webdriver import Firefox
driver = Firefox()
driver.get(<your_html>)
li_list = driver.find_elements_by_tag_name('li')
for li in li_list:
if li.get_attribute('data-title') == '<wanted_value>':
<do_your_thing>
You can use:
"//li[#data-title='Ticket Grid']"