I am trying to write code with selenium python.
I work on site like https://www.thewatchcartoononline.tv/anime/south-park-season-1. As you can see this page is the page for the series with links to all the episode of the series.
I want to get the link of a given episode (the user chooses which one).
Important to note that not every series page has the same naming format for the episodes, some series has only "Episode 1" in the link text, others may have "South park season 1 episode 1" in the link, so I cant count on the naming format of the link's text.
this is the code I used to get the link to the episode (episode_num is given by the user)
episode_num = 1
chrome_driver = Chrome()
chrome_driver.get("https://www.thewatchcartoononline.tv/anime/south-park-season-1")
# This xpath takes you to the div of the episode list and then it search for a link which has a certain text in it
links = chrome_driver.find_elements_by_xpath(
f"//*[#id='sidebar_right3']//"
f"a[contains(text(), 'Episode {episode_num}')]"
)
However when I check links I see that there are more than one link. I get both episode 1 and episode 10 (since both of them contain the string "Episode 1")
Is there a way to get only the link I want? (maybe to make selenium take the link that doesn't have any digit after the text I want)
EDIT:
Well, kind of ugly, but in Xpath 1.0 I think this is the best you can do.
links = chrome_driver.find_elements_by_xpath(f"//*[#id='sidebar_right3']//a[(contains(., 'Episode {episode_num} ')) or (substring(text(), string-length(text()) - string-length('Episode {episode_num}') +1) = 'Episode {episode_num}') or (contains(., 'Episode {episode_num}-')) ]")
Finds Episode 10-11 for episode_num = 10 but not for episode_num = 11.
Checks for:
is Episode x in the text()
text() ends with Episode x
is Episode x- in the text()
I was checking the urls of the episodes. Wouldn't be a better approach relying on the #href instead of the text()? This is a bit shorter:
links = chrome_driver.find_elements_by_xpath(f"//*[#id='sidebar_right3']//a[(contains(#href, 'episode-{episode_num}-')) or (substring(#href, string-length(#href) - string-length('episode-{episode_num}') +1) = 'episode-{episode_num}')]")
Checks for:
is episode-x- in the url
url ends with episode-x
Try following xpath.Use last() option this should give count 1.
links = chrome_driver.find_elements_by_xpath("(//*[#id='sidebar_right3']//a[contains(text(), 'Episode {episode_num}')])[last()]")
print(len(links))
Related
As an exercise for learning Python and Selenium, I'm trying to write a script that checks a web page with all kinds of commercial deals, find all the specific food deals (class name 'tag-food'), put them in a list (elem), then check which ones contain the text 'sushi', and for those elements extract the html element which contains price. And print the results.
I have:
elem = driver.find_elements_by_class_name('tag-food')
i = 0
while i < len(elem):
source_code = elem[i].get_attribute("innerHTML")
# ?? how to check if source_code contains 'sushi'?
# ?? if true how to extract price data?
i = i + 1
driver.quit()
What's the best and most direct way to do these checks? Thanks! 🙏
I don't think you need a while loop for this. Also, you would be looking for a text value, not innerHTML
You can make it more simple like this:
for row in driver.find_elements_by_class_name('tag-food'):
if "sushi" in row.get_attribute("innerText"):
print("Yes this item has sushi")
# find element to grab price, store in variable to do something else with
else:
print("No sushi in this item")
Or even just this, depending on how the text in the HTML is structured:
for row in driver.find_elements_by_class_name('tag-food'):
if "sushi" in row.text:
print("Yes this item has sushi")
# find element to grab price, store in variable to do something else with
else:
print("No sushi in this item")
I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!
You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives
I'm trying to fetch the web-table data using for loop. And the table has pagination up-to 42. here my code:
driver.get()
#identification and Locators
stack = driver.find_elements_by_xpath("//*[#id='container']/div/div[4]/table/tbody/tr/td[10]/div/ul/li")
quant = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[7]/div")
link = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[15]/a")
#Start a procedure
for i in driver.find_elements_by_xpath("//*[#id='container']/div/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/button[2]"):
for steck,quanty,links in zip(stack,quant,link):
stuck = steck.text
quantity = quanty.text
linkes = links.get_attribute("href")
if stuck != 'No manage stock':
word = "Default Stock: "
stock = stuck.replace(word, '')
stocks = int(stock)
quanties = int(float(quantity))
if stocks < 0:
print(stocks,quanties,linkes)
stacks = abs(stocks)
total = stacks+quanties+1
print(total)
i.click()
driver.implicitly_wait(10)
print("Next Page")
This code fetches data from the 1st page. after clicking the next page. the 2nd for-loop didn't fetch 2nd-page data from web-table.
Most likely your query driver.find_elements_by_xpath("//*[#id='container']/div/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/button[2]") only returns one element (the actual button to go to the next page) so I guess you should read the number of page and use it for an outer loop (or at least, you might have to rebind the selection on the HTML element representing the clickable button because it might change when a new page of the table is loaded) :
driver.get()
# Read the number of page and store it as an integer
nb_pages = int(driver.find_element_by_id('someId').text)
# Repeat your code (and rebind your selections, notably the one
# on the button to go to the next page) on each page of the table
for page in nb_pages:
# lines below are adapted from your code, I notably removed you first loop
stack = driver.find_elements_by_xpath("//*[#id='container']/div/div[4]/table/tbody/tr/td[10]/div/ul/li")
quant = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[7]/div")
link = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[15]/a")
# loop removed here (i also splited the string for readability
# (but it don't change the actual string value)
i = driver.find_elements_by_xpath(
"//*[#id='container']/div/div[2]/div[2]/div[2]"
"/div/div[2]/div/div[2]/button[2]")[0]
for steck, quanty, links in zip(stack, quant, link):
# your logic ...
# ...
# Load the next page:
i.click()
If you can't read the number of page, you may also use a while loop and exit it when you can't find a button to load the next page whit something like:
while True:
i = driver.find_elements_by_xpath(
"//*[#id='container']/div/div[2]/div[2]/div[2]"
"/div/div[2]/div/div[2]/button[2]")
if not i:
break
i = i[0]
# the rest of your logic
# ...
i.click()
This is only a guess (as we don't have a sample HTML code of the page / table structure that you are trying to use).
Until now I used a for cycle to get all the elements on a page in a certain path with this script:
for username in range(range_for_like):
link_username_like = "//article/div[2]/div[2]/ul/div/li[" + str(num) + "]/div/div[1]/div/div[1]/a[contains(#class, 'FPmhX notranslate zsYNt ')]"
user = browser.find_element_by_xpath(link_username_like).get_attribute("title")
num += 1
sleep(0.3)
But sometimes my cpu will exceed 100%, which is not ideal.
My solution was to find all the elements in one line using find_elements_by_xpath but in doing so, I can't figure out how to get all the "title" attributes.
I know that the path changes for every title, //article/div[2]/div[2]/ul/div/li[" + str(num) + "]/div/div[1]/div/div[1]/a that's why I kept increasing the num variable, but how can I use this tecnique without a cycle for?
What's the most efficient way in term of performance to get all the attributes? I don't mind if it does take also 2 minutes or more
Here how you can get all the people that like your photo by xpath:
//div[text()='Likes']/..//a[#title]
Code below get first 12 liker:
likes = browser.find_elements_by_xpath("//div[text()='Likes']/..//a[#title]")
for like in likes:
user = like.get_attribute("title")
To get all likes you have to scroll, for that you can get total likes you have and then scroll until all likes will be loaded. To get total likes you can use //a[contains(.,'likes')]/span xpath and convert it to integer.
To scroll use javascript .scrollIntoView() to last like, final code would look like:
totalLikes = int(browser.find_element_by_xpath("//a[contains(.,'likes')]/span").text)
browser.find_element_by_xpath("//a[contains(.,'likes')]/span").click()
while true:
likes=browser.find_elements_by_xpath("//div[text()='Likes']/..//a[#title]")
likesLen = len(likes)
if (likesLen == totalLikes - 1)
break
browser.execute_script("arguments[0].scrollIntoView()", likes.get(likesLen-1))
for like in likes:
user = like.get_attribute("title")
How it works:
With //div[text()='Likes'] I found unique div with window contains likes. Then to get all likes that is li I go to parent div with /.. selector and then get all a with title attribute. Because all likes not loading immediately you have to scroll down. For that I get total likes amount before click to likes. Than I scroll to last like (a[#title]) to force instagram to load some data until total likes I got not equals to list of likes. When scroll completes I just iterate throw all likes in list I got inside while loop and get titles.
On a typical eBay search query where more than 50 listings are returned, such as this, eBay displays in the a grid format (whether you have it set up to display as grid or a list).
I'm using class name to pull out the prices using WebDriver:
prices = webdriver.find_all_elements_by_class_name("bidsold")
The challenge: although all prices on the page look identical in structure, the ones that are crossed out (where Buy It Now is not available and it's Best offer accepted) are actually contained within a child span of the above span:
I could pull these out separately by repeating the find_all_elements_by_class_name method with class sboffer, but (i) I will lose track of the order, and more importantly (ii) it will roughly double the time it takes to extract the prices.
The CSS selector for both types of prices also differ, as do the XPaths.
How do we catch all prices in one go?
Try this:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Columbia+Hiking+Pants&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000&_pgn=2')
prices_list = driver.find_elements_by_css_selector('span.amt')
prices_on_page = []
for span in prices_list:
unsold_item = span.find_elements_by_css_selector('span.bidsold.bold')
sold_item = span.find_elements_by_css_selector('span.sboffer')
if len(sold_item):
prices_on_page.append(sold_item[0].text)
elif len(unsold_item):
prices_on_page.append(unsold_item[0].text)
elif span.text:
prices_on_page.append(span.text)
print prices_on_page
driver.quit()
In this case, you will have track of the order and you will only query the specific span element instead of the entire page. This should improve performance.
I would go for xpath- below code worked for me. It grabbed 50 prices!
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Columbia+Hiking+Pants&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000&_pgn=2')
my_prices = []
itms = driver.find_elements_by_xpath("//div[#class='bin']")
for i in itms:
prices = i.find_elements_by_xpath(".//span[contains(text(),'$')]")
val = ','.join(i.text for i in prices)
my_prices.append([val])
print my_prices
driver.quit()
Result is
[[u'$64.95'], [u'$59.99'], [u'$49.95'], [u'$46.89,$69.99'], [u'$44.98'], [u'$42.95'], [u'$39.99'], [u'$39.99'], [u'$37.95'], [u'$36.68'], [u'$35.96,$44.95'], [u'$34.99'], [u'$34.99'], [u'$34.95'], [u'$30.98'], [u'$29.99'], [u'$29.99'], [u'$29.65,$32.95'], [u'$29.00'], [u'$27.96,$34.95'], [u'$27.50'], [u'$27.50'], [u'$26.99,$29.99'], [u'$26.95'], [u'$26.55,$29.50'], [u'$24.99'], [u'$24.99'], [u'$24.99'], [u'$24.99'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$22.00'], [u'$22.00'], [u'$22.00'], [u'$22.00'], [u'$18.00'], [u'$18.00'], [u'$17.95'], [u'$11.99'], [u'$9.99'], [u'$6.00']]