How to Handle Nested Loops in Selenium - python

I am hoping someone can help me handle nested loop in selenium. I am trying to Scrape a website using selenium, it happens that i have to scrape multiple information with different links.
So i got all the links and looped through each, but in the process, the first link only displayed the items i needed, then the code breaks.
def get_financial_info(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='/home/miracle/chromedriver')
driver.get("https://www.financialjuice.com")
try:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trendWrap']")))
except TimeoutException:
driver.quit()
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
for record in category_url:
driver.get(record.get_attribute("href"))
news = {}
title_element = driver.find_elements_by_xpath("//p[#class='headline-title']")
for news_record in title_element:
news['title'] = news_record.text
print news

Your category_url will be valid only on page where you've defined it and after first re-direction to another page it becomes stale...
You need to replace
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
with
category_url = [a.get_attribute("href") for a in driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a")]
and then loop through the list of links as
for record in category_url:
driver.get(record)

Related

How to get hidden element value in selenium?

I want to get product_id from this website but this code don't work.
url = 'https://beyours.vn/products/mia-circle-mirror-white'
driver.get(url)
driver.implicitly_wait(10)
sku = driver.find_elements_by_xpath('//div[#class="product__id hide"]')[0].text
print(sku)
My result I expected is '1052656577'.
It took me a minute but I found this post and it answers your question.
All you have to do is replace the text attribute with the get_attribute method and pass 'innerText' to it. As the post explains, you need to that method when dealing with hidden html.
Here's my code. I used find_elements(By.XPATH, '//div[#class="product__id hide"]') instead of the method you used due to the version of selenium I have (3.141.0), but the solution should still work. You check your version with print(selenium.__version__)
url = 'https://beyours.vn/products/mia-circle-mirror-white'
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
sku = driver.find_elements(By.XPATH, '//div[#class="product__id hide"]')[0].get_attribute('innerText')
driver.quit()
print(sku)

Web Scrapping - empty list

I am trying to scrap data from the below ecommerce site. But I am getting blank list while trying to append names in the Brand_Names list. Below is my code for the same. I am performing this task using selenium. I have not mentioned code for importing the required libraries. Rest code is pasted below
driver=webdriver.Chrome('chromedriver.exe')
time.sleep(2)
url='https://www.amazon.in/'
driver.get(url)
time.sleep(1)
driver.maximize_window()
#Searching element for search field
search_field=driver.find_element_by_xpath('//div[#class="nav-search-field "]/input')
user_inp=input("Enter your search value: ")
search_field.send_keys(user_inp)
time.sleep(2)
search_btn=driver.find_element_by_xpath('//div[#class="nav-search-submit nav-sprite"]')
search_btn.click()
URLs=[]
for page in range(0,3):
links=driver.find_elements_by_xpath('//h2[#class="a-size-mini a-spacing-none a-color-base s-line-clamp-2"]//a')
for i in links:
URLs.append(i.get_attribute('href'))
nxt_button=driver.find_element_by_xpath('//*[#class="s-pagination-item s-pagination-next s-pagination-button s-pagination-separator"]')
nxt_button.click()
time.sleep(2)
Brand_Name=[]
Product_Name=[]
Price=[]
Expected_Delivery=[]
for i in URLs:
driver.get(url)
time.sleep(2)
try:
brands=driver.find_element_by_xpath('//a[#id="bylineInfo"]')
Brand_Name.append(brands.text)
except NoSuchElementException:
Brand_Name.append('-')
The middle part of your code is where the bugs lie:
URLs=[]
time.sleep(2)
for page in range(0,3):
URL = [my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//a[#class='a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal']")]
URLs.extend(URL)
nxt_button=driver.find_element_by_xpath('//*[#class="s-pagination-item s-pagination-next s-pagination-button s-pagination-separator"]')
nxt_button.click()
time.sleep(2)
Brand_Name=[]
Product_Name=[]
Price=[]
Expected_Delivery=[]
for i in URLs:
driver.get(i)
I've fixed the 1st loop so the URLs are fetched properly. Remember to import the relevant libraries, e.g., from selenium.webdriver.common.by import By. The 2nd loop begins with a bug, but the rest of your code seems fine.

Inconsistent results for iframe with selenium

I am trying to scrape the twitter username of crypto currencies from coinmarketcap (https://coinmarketcap.com/currencies/ethereum/social/). Some of them don't have the twitter iframe like (https://coinmarketcap.com/currencies/bitcoin/social/).
The problem is that the iframe loads in around 3 seconds. But I tested my program many times and I found that the iframe does not always load even after I wait for 5 seconds. Sometimes I manually tried to open the page and it didn't even appear on the screen (but very rare).
I am expecting that it should work perfectly and scrape everything, but it seems that it is prone to error as it depends on loading time and server response?
Is there a better more stable way of doing this? This is my first web scraping project and it seems like the only solution that could work
Is there another method which I could use while waiting?
I know that you can get the source from the iframe and scrape it but I was not able to find it.
Here is my function:
def get_crypto_currency_social(slug):
url = "https://coinmarketcap.com/currencies/"+slug+"/social/"
browser = webdriver.Chrome('./chromedriver')
# .add_argument('headless')
browser.get(url)
try:
wait(browser, 5).until(EC.presence_of_element_located((By.ID, "twitter-widget-0")))
except:
pass
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
market_cap = soup.find('div', {'class': 'statsValue___2iaoZ'}).text.split('$')[-1]
coin_name = soup.find('small', {'class': 'nameSymbol___1arQV'}).text
coin_rank = soup.find('div', {'class': 'namePillPrimary___2-GWA'}).text.split('#')[-1]
try:
iframe = browser.find_elements_by_tag_name('iframe')[0]
browser.switch_to.frame(iframe)
twitter_username = browser.find_element_by_class_name("customisable-highlight").text
except NoSuchElementException:
twitter_username = ""
except:
print("Error getting twitter username")
finally:
browser.quit()
return {
"coin_rank": coin_rank,
"market_cap": market_cap,
"coin_name": coin_name,
"twitter_username": twitter_username
}
If there is a random delay b/w times you could probably make use of WebDriverWait class from selenium.
Sample code :
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"YOUR IFRAME XPATH")))

BeautifulSoup Python Selenium - Wait for tweet to load before scraping website

I am trying to scrape a website to extract tweet links (Specifically DW in this case) but I am unable to get any data because the tweets are not loading immediately so the request executes before there is time to give it to load. I have tried using requests timeout as well as time.sleep() but without luck. After using those two options I tried using Selenium to load the webpage locally and give it time to load, but I can't seem to make it work. I believe this can be done with Selenium. Here is what I tried so far:
links = 'https://www.dw.com/en/vaccines-appear-effective-against-india-covid-variant/a-57344037'
driver.get(links)
delay = 30 #seconds
try:
WebDriverWait(driver, delay).until(EC.visibility_of_all_elements_located((By.ID, "twitter-widget-0")))
except:
pass
tweetSource = driver.page_source
tweetSoup = BeautifulSoup(tweetSource, features='html.parser')
linkTweets = tweetSoup.find_all('a')
for linkTweet in linkTweets:
try:
tweetURL = linkTweet.attrs['href']
except: # pass on KeyError or any other error
pass
if "twitter.com" in tweetURL and "status" in tweetURL:
# Run getTweetID function
tweetID = getTweetID(tweetURL)
newdata = [tweetID, date_tag, "DW", links, title_tag, "News", ""]
# Write to dataframe
df.loc[len(df)] = newdata
print("working on tweetID: " + str(tweetID))
If anyone could get Selenium to find the tweet that would be great!
it's an iframe first you need to switch to that iframe
iframe = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "twitter-widget-0"))
)
driver.switch_to.frame(iframe)

Waiting for page to load selenium, Python. All the pages have the same structure

I am trying to scrape some data using selenium and python. I have a list with some links and I have to go through every link. What I do now is the following:
for link in links:
self.page_driver.get(link)
time.sleep(5)
#scrape data
It works just fine, the problem is that I have a lot of links and waiting 5 seconds for each one is a waste of time. That's why I decided to try with something like:
self.driver.get(link)
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'cell-box'))
WebDriverWait(self.driver, 10).until(element_present)
except TimeoutException:
logging.info("Timed out waiting for page to load")
The problem is that every link has the exact same structure inside, only data change, so the element is found even if the page hasn't changed. What I would like to do is to save the name of the product in the link in a variable, change page wait until the name of the product is different than the one saved, which means the new page loaded. Any help would be really appreciated.
You can add the staleness_of Expected Condition
wait = WebDriverWait(self.driver, 10)
element = None
for link in links:
self.page_driver.get(link)
if (element is not None):
wait.until(EC.staleness_of(element)
try:
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'cell-box')))
except TimeoutException:
logging.info("Timed out waiting for page to load")
#scrape data

Categories