Inconsistent results for iframe with selenium - python

I am trying to scrape the twitter username of crypto currencies from coinmarketcap (https://coinmarketcap.com/currencies/ethereum/social/). Some of them don't have the twitter iframe like (https://coinmarketcap.com/currencies/bitcoin/social/).
The problem is that the iframe loads in around 3 seconds. But I tested my program many times and I found that the iframe does not always load even after I wait for 5 seconds. Sometimes I manually tried to open the page and it didn't even appear on the screen (but very rare).
I am expecting that it should work perfectly and scrape everything, but it seems that it is prone to error as it depends on loading time and server response?
Is there a better more stable way of doing this? This is my first web scraping project and it seems like the only solution that could work
Is there another method which I could use while waiting?
I know that you can get the source from the iframe and scrape it but I was not able to find it.
Here is my function:
def get_crypto_currency_social(slug):
url = "https://coinmarketcap.com/currencies/"+slug+"/social/"
browser = webdriver.Chrome('./chromedriver')
# .add_argument('headless')
browser.get(url)
try:
wait(browser, 5).until(EC.presence_of_element_located((By.ID, "twitter-widget-0")))
except:
pass
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
market_cap = soup.find('div', {'class': 'statsValue___2iaoZ'}).text.split('$')[-1]
coin_name = soup.find('small', {'class': 'nameSymbol___1arQV'}).text
coin_rank = soup.find('div', {'class': 'namePillPrimary___2-GWA'}).text.split('#')[-1]
try:
iframe = browser.find_elements_by_tag_name('iframe')[0]
browser.switch_to.frame(iframe)
twitter_username = browser.find_element_by_class_name("customisable-highlight").text
except NoSuchElementException:
twitter_username = ""
except:
print("Error getting twitter username")
finally:
browser.quit()
return {
"coin_rank": coin_rank,
"market_cap": market_cap,
"coin_name": coin_name,
"twitter_username": twitter_username
}

If there is a random delay b/w times you could probably make use of WebDriverWait class from selenium.
Sample code :
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"YOUR IFRAME XPATH")))

Related

How to scrape website if it has load more button to load more content on the page?

from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path=r'C:\Users\gkhat\Downloads\chromedriver.exe')
driver.get('https://www.allrecipes.com/recipes/233/world-cuisine/asian/indian/')
card_titles = driver.find_elements_by_class_name('card__detailsContainer')
button = driver.find_element_by_id('category-page-list-related-load-more-button')
for card_title in card_titles:
rname = card_title.find_element_by_class_name('card__title').text
print(rname)
time.sleep(3)
driver.execute_script("arguments[0].scrollIntoView(true);", button)
driver.execute_script("arguments[0].click();", button)
time.sleep(3)
driver.quit()
The website loads the food cards after clicking on the the "Load More" button the above code scrape the recipe title I want it keep scraping the title even after clicking the load more button.
I tried the going to the Network tab the clicking on XHR but none of the requests shows the JSON. What should I do?
I tried below code for that. It works, but I am not sure if this is the best way to do it. FYI I handled those pop-ups for email manually. You need to find a way to handle them.
from selenium import webdriver
import time
from selenium.common.exceptions import StaleElementReferenceException
driver = webdriver.Chrome(executable_path="path")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.allrecipes.com/recipes/233/world-cuisine/asian/indian/")
receipes = driver.find_elements_by_class_name("card__detailsContainer")
for rec in receipes:
name = rec.find_element_by_tag_name("h3").get_attribute("innerText")
print(name)
loadmore = driver.find_element_by_id("category-page-list-related-load-more-button")
j = 0
try:
while loadmore.is_displayed():
loadmore.click()
time.sleep(5)
lrec = driver.find_elements_by_class_name("recipeCard__detailsContainer")
newlist = lrec[j:]
for rec in newlist:
name = rec.find_element_by_tag_name("h3").get_attribute("innerText")
print(name)
j = len(lrec)+1
time.sleep(5)
except StaleElementReferenceException:
pass
driver.quit()
Actually there is a json that returns the data. However the json returns it in html, so just need to parse that.
Note: You can change the chunk size so you can get more than 24 items per "page"
import requests
from bs4 import BeautifulSoup
size = 24
page = 0
hasNext = True
while hasNext == True:
page +=1
print('\tPage: %s' %page)
url = 'https://www.allrecipes.com/element-api/content-proxy/aggregate-load-more?sourceFilter%5B%5D=alrcom&id=cms%2Fonecms_posts_alrcom_2007692&excludeIds%5B%5D=cms%2Fallrecipes_recipe_alrcom_142967&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_231026&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_247233&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_246179&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_256599&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_247204&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_34591&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_245131&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_220560&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_212721&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_236563&excludeIds%5B%5D=cms%2Fallrecipes_recipe_alrcom_14565&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8189766&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8188886&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_8189135&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_2052087&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_7986932&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_2040338&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_280310&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_142967&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_14565&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_228957&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_46822&excludeIds%5B%5D=cms%2Fonecms_posts_alrcom_72349&page={page}&orderBy=Popularity30Days&docTypeFilter%5B%5D=content-type-recipe&docTypeFilter%5B%5D=content-type-gallery&size={size}&pagesize={size}&x-ssst=iTv629LHnNxfbQ1iVslBTZJTH69zVWEa&variant=food'.format(size=size, page=page)
jsonData = requests.get(url).json()
hasNext = jsonData['hasNext']
soup = BeautifulSoup(jsonData['html'], 'html.parser')
cardTitles = soup.find_all('h3',{'class':'recipeCard__title'})
for title in cardTitles:
print(title.text.strip())

After using Selenium to load the whole page, my code stop scraping after the first 100 items?

I am trying to get a list of all the movie/series on my personal IMDb watchlist. I am using selenium to click the load more button so everything shows up in the html code. However, when I try and scrape that data, only the first 100 movies show up.
Nothing past 'page3' shows up.
The image below shows the part of the html that connotes page 3:
After clicking the load button with selenium, all the movies are shown in the chrome pop up. However, only the first 100/138 are printed to my console.
Here is the URL: https://www.imdb.com/user/ur130279232/watchlist
This is my current code:
URL = "https://www.imdb.com/user/ur130279232/watchlist"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,20)
driver.get(URL)
while True:
try:
watchlist = driver.find_element_by_xpath("//div[#class='lister-list mode-detail']")
watchlistHTML = watchlist.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='load-more']")
soup = BeautifulSoup(watchlistHTML, 'html.parser')
content = soup.find_all('h3', class_ ='lister-item-header')
#pdb.set_trace()
print('length: ',len(content))
for elem in content:
print(elem.find('a').contents[0])
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
Even though after clicking load more button, "lister-list mode-detail" includes everything up until Sound of Music?
Rest of the data is returned by doing HTTP GET to (scroll down and hit Load more)
https://www.imdb.com/title/data?ids=tt0144117,tt0116996,tt0106179,tt0118589,tt11252090,tt13132030,tt6083778,tt0106611,tt0115685,tt1959563,tt8385148,tt0118971,tt0340855,tt8629748,tt13932270,tt11185940,tt5580390,tt4975722,tt2024544,tt1024648,tt1504320,tt1010048,tt0169547,tt0138097,tt0112573,tt0109830,tt0108052,tt0097239,tt0079417,tt0071562,tt0068646,tt0070735,tt0067116,tt0059742,tt0107207,tt0097937&tracking_tag=&pageId=ls089853956&pageType=list&subpageType=watchlist
What #balderman mentioned works if you can access the HTTP GET.
The main thing is that there's a delayed loading of the titles and it doesn't load the later ones until the earlier ones are loaded. I don't know if they only load if you're in the right region, but a janky way to get around it is to programmatically scroll through the page and let it load.

BeautifulSoup Python Selenium - Wait for tweet to load before scraping website

I am trying to scrape a website to extract tweet links (Specifically DW in this case) but I am unable to get any data because the tweets are not loading immediately so the request executes before there is time to give it to load. I have tried using requests timeout as well as time.sleep() but without luck. After using those two options I tried using Selenium to load the webpage locally and give it time to load, but I can't seem to make it work. I believe this can be done with Selenium. Here is what I tried so far:
links = 'https://www.dw.com/en/vaccines-appear-effective-against-india-covid-variant/a-57344037'
driver.get(links)
delay = 30 #seconds
try:
WebDriverWait(driver, delay).until(EC.visibility_of_all_elements_located((By.ID, "twitter-widget-0")))
except:
pass
tweetSource = driver.page_source
tweetSoup = BeautifulSoup(tweetSource, features='html.parser')
linkTweets = tweetSoup.find_all('a')
for linkTweet in linkTweets:
try:
tweetURL = linkTweet.attrs['href']
except: # pass on KeyError or any other error
pass
if "twitter.com" in tweetURL and "status" in tweetURL:
# Run getTweetID function
tweetID = getTweetID(tweetURL)
newdata = [tweetID, date_tag, "DW", links, title_tag, "News", ""]
# Write to dataframe
df.loc[len(df)] = newdata
print("working on tweetID: " + str(tweetID))
If anyone could get Selenium to find the tweet that would be great!
it's an iframe first you need to switch to that iframe
iframe = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "twitter-widget-0"))
)
driver.switch_to.frame(iframe)

Blocking login overlay window when scraping web page using Selenium

I am trying to scrape a long list of books in 10 web pages. When the loop clicks on next > button for the first time the website displays a login overlay so selenium can not find the target elements.
I have tried all the possible solutions:
Use some chrome options.
Use try-except to click X button on the overlay. But it appears only one time (when clicking next > for the first time). The problem is that when I put this try-except block at the end of while True: loop, it became infinite as I use continue in except as I do not want to break the loop.
Add some popup blocker extensions to Chrome but they do not work when I run the code although I add the extension using options.add_argument('load-extension=' + ExtensionPath).
This is my code:
options = Options()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('disable-avfoundation-overlays')
options.add_argument('disable-internal-flash')
options.add_argument('no-proxy-server')
options.add_argument("disable-notifications")
options.add_argument("disable-popup")
Extension = (r'C:\Users\DELL\AppData\Local\Google\Chrome\User Data\Profile 1\Extensions\ifnkdbpmgkdbfklnbfidaackdenlmhgh\1.1.9_0')
options.add_argument('load-extension=' + Extension)
options.add_argument('--disable-overlay-scrollbar')
driver = webdriver.Chrome(options=options)
driver.get('https://www.goodreads.com/list/show/32339._50_?page=')
wait = WebDriverWait(driver, 2)
review_dict = {'title':[], 'author':[],'rating':[]}
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('table', class_ = 'tableList js-dataTooltip')
while True:
table = driver.find_element_by_xpath('//*[#id="all_votes"]/table')
for product in table.find_elements_by_xpath(".//tr"):
for td in product.find_elements_by_xpath('.//td[3]/a'):
title = td.text
review_dict['title'].append(title)
for td in product.find_elements_by_xpath('.//td[3]/span[2]'):
author = td.text
review_dict['author'].append(author)
for td in product.find_elements_by_xpath('.//td[3]/div[1]'):
rating = td.text[0:4]
review_dict['rating'].append(rating)
try:
close = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div/div/div[1]/button')))
close.click()
except NoSuchElementException:
continue
try:
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
element.click()
except TimeoutException:
break
df = pd.DataFrame.from_dict(review_dict)
df
Any help like if I can change the loop to for loop clicks next > button until the end rather than while loop or where should I put try-except block to close the overlay or if there is Chromeoption can disable overlay.
Thanks in advance
Thank you for sharing your code and the website that you are having trouble with. I was able to close the Login Modal by using xpath. I took this challenge and broke up the code using class objects. 1 object is for the selenium.webdriver.chrome.webdriver and the other object is for the page that you wanted to scrape the data against ( https://www.goodreads.com/list/show/32339 ). In the following methods, I used the Javascript return arguments[0].scrollIntoView(); method and was able to scroll to the last book that displayed on the page. After I did that, I was able to click the next button
def scroll_to_element(self, xpath : str):
element = self.chrome_driver.find_element(By.XPATH, xpath)
self.chrome_driver.execute_script("return arguments[0].scrollIntoView();", element)
def get_book_count(self):
return self.chrome_driver.find_elements(By.XPATH, "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr").__len__()
def click_next_page(self):
# Scroll to last record and click "next page"
xpath = "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr[{0}]".format(self.get_book_count())
self.scroll_to_element(xpath)
self.chrome_driver.find_element(By.XPATH, "//div[#id='all_votes']//div[#class='pagination']//a[#class='next_page']").click()
Once I clicked on the "Next" button, I saw the modal display. I was able to find the xpath for the modal and was able to close the modal.
def is_displayed(self, xpath: str, int = 5):
try:
webElement = DriverWait(self.chrome_driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def is_modal_displayed(self):
return self.is_displayed("//body[#class='modalOpened']")
def close_modal(self):
self.chrome_driver.find_element(By.XPATH, "//div[#class='modal__content']//div[#class='modal__close']").click()
if(self.is_modal_displayed()):
raise Exception("Modal Failed To Close")
I hope this helps you to solve your problem.

Is python selenium browser.quit() method processes safe?

I am running multiple browsers using multiprocessing Pool. Each process will be opening and closing browsers over and over as well. The reason for each process to close the browser and open a new one is that the sites I am visiting will block with captcha on the second visit if not. If each process calls browser.quit() will it affect chrome instances running in other processes? I am having trouble with some sites failing when I know they are good URLs.
EDIT:
Let me explain further. Selenium visits the site and I am returning the HTML for scraping. The error I receive in the log file:
Failed While scraping
object of type 'NoneType' has no len()
example selenium code:
def get_page(url):
browser = webdriver.Chrome(executable_path=r'chromedriver.exe')
time.sleep(2)
browser.get(url)
# verify page
if browser.current_url[-19:] == 'noResultsFound=true' or browser.current_url[-13:] == 'error404.html':
browser.quit()
return None
else:
html = browser.page_source
browser.quit()
return html
example scraping code:
def scrape(html):
soup = BeautifulSoup(html, 'html.parser')
search_items = soup.find('div', {'class': 'row product-grid results'})
if search_items is not None:
search_items = search_items.find_all('div', {'class': 'col-xs-12 col-xs-6 col-sm-4 col-md-3 text-center'})
for i in range(len(search_items)):
# scrape each search result
I can visit the URL and verify that the <div>s do exists but I am failing on the for loop by taking the length. My thought was either JavaScript is not fully loading on page before returning the page_source or another processes calling browser.quit() affects the other processes.

Categories