I'm curious if anyone has found a work around for handling the random "Please Verify you are human" pop up in FireFox when using Selenium and BeautifulSoup. Currently, it pops up about every 500 or 1,000 URL requests, but I'd love an automated workaround.
My driver is just the default driver = webdriver.Firefox() with selenium. The pop up is a press AND hold button (picture below) which I've just done manually as I've seen it pop up. Any info would be great thanks!
So I've figured out a workaround for this. Since the URL doesn't actually change / redirect when the 'Please verify you are human' popup occurs I've added a step prior to getting the elements with beautifulSoup.
For each URL in the list that is being scraped I do a time.sleep(5.5) to allow URL to fully load or for the verify popup to occur. Then, I interact with the URL and look for the verify indicator. For StockX it works like this: while true, try soup.find('div', class_='page-title').text and if it finds '\nPlease verify you are a human\n' then close browser and sleep (driver.quit() and time.sleep(20)) else scrape elements.
I dont have the full code written up to work but I do know I can detect if its a verify page as mentioned above. Something like this below maybe:
for url in url_list:
for attempt in range(5):
try:
if soup.find('div', class_='page-title').text == '\nPlease verify you are a human\n':
driver.quit()
time.sleep(20)
else:
scrape_everything()
except:
print(f'Hit Verify Page Attempt Num.: {attempt}')
else:
break
else:
continue
Related
I am trying to get a list of all the movie/series on my personal IMDb watchlist. I am using selenium to click the load more button so everything shows up in the html code. However, when I try and scrape that data, only the first 100 movies show up.
Nothing past 'page3' shows up.
The image below shows the part of the html that connotes page 3:
After clicking the load button with selenium, all the movies are shown in the chrome pop up. However, only the first 100/138 are printed to my console.
Here is the URL: https://www.imdb.com/user/ur130279232/watchlist
This is my current code:
URL = "https://www.imdb.com/user/ur130279232/watchlist"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,20)
driver.get(URL)
while True:
try:
watchlist = driver.find_element_by_xpath("//div[#class='lister-list mode-detail']")
watchlistHTML = watchlist.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='load-more']")
soup = BeautifulSoup(watchlistHTML, 'html.parser')
content = soup.find_all('h3', class_ ='lister-item-header')
#pdb.set_trace()
print('length: ',len(content))
for elem in content:
print(elem.find('a').contents[0])
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
Even though after clicking load more button, "lister-list mode-detail" includes everything up until Sound of Music?
Rest of the data is returned by doing HTTP GET to (scroll down and hit Load more)
https://www.imdb.com/title/data?ids=tt0144117,tt0116996,tt0106179,tt0118589,tt11252090,tt13132030,tt6083778,tt0106611,tt0115685,tt1959563,tt8385148,tt0118971,tt0340855,tt8629748,tt13932270,tt11185940,tt5580390,tt4975722,tt2024544,tt1024648,tt1504320,tt1010048,tt0169547,tt0138097,tt0112573,tt0109830,tt0108052,tt0097239,tt0079417,tt0071562,tt0068646,tt0070735,tt0067116,tt0059742,tt0107207,tt0097937&tracking_tag=&pageId=ls089853956&pageType=list&subpageType=watchlist
What #balderman mentioned works if you can access the HTTP GET.
The main thing is that there's a delayed loading of the titles and it doesn't load the later ones until the earlier ones are loaded. I don't know if they only load if you're in the right region, but a janky way to get around it is to programmatically scroll through the page and let it load.
So I am doing some web scraping and I am going to wait indefinitely on a page until it is available to purchase and my code is the following.
def searchPage(input_url):
currentUrl = input_url
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get(currentUrl)
while True:
try:
shipIt = driver.find_element_by_css_selector('[data-test="shippingBlock"]')
alertUsers(currentUrl)
time.sleep(5)
except NoSuchElementException:
continue
Do I need to do the .get(currentUrl) inside of the while True to constantly get the updated site or will the information update and the condition be met?
If the webpage is updating itself via a javascript (Ajax, whatever...) call, then you don't need to reload/re-get the page.
Hint:
You can check this by opening your webpage in a browser so you can look wether the values/fields are updating automatically without reloading.
I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
However, after a couple of entries the page blocks you from doing anything. When this happens the URL changes and you'll get this Block.aspx?c=475412
Now how'd I be able to check if the url contains Block.aspx?c=475412 after each try?
I've tried looking for this but I could only find people asking how to get the current URL, not what I'm looking for, I need to check what the url contains.
Here is my code.
import selenium
from selenium import webdriver
url_list = open("path')
try:
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
except ValueError:
print("Something went wrong checking the URL.")
I suppose I'd add an if statement checking if the URL contains Block.aspx?c=475412, if anyone would be able to help me out I'd greatly appreciate it.
If you want to check what the URL contains, you can just use the in method built in with Python strings.
if "Block.aspx?c=475412" in driver.current_url: # check if "Block.aspx?c=475412" is in URL
print("Block.aspx is in the URL")
I'm using chromedriver + selenium to try to loop through a website with the same url structure for all their pages like so:
for i in range(1,3):
#iterate pages and get url
ureq = "https://somewebsite.com/#page=" + str(i)
driver.get(ureq.strip())
#create
soup = []
#main code here
try:
#Wait for page to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"some element in the DOM")))
src = driver.page_source
#Parse page with bs
soup = bs(src, "lxml")
except TimeoutException:
print("Timed out")
driver.quit()
#main code
driver.quit()
The problem is when the loop fires a second time and the url changes to "#page=2", I can see the webpage and url has changed in the webdriver but the script just hangs. There is no timeout or error message, the script just freezes.
I've also tried placing a print statement before "webDriverWait" to see where the program hangs but that also doesn't fire. I think for some reason, the second get url request is the culprit.
Why is that, or is something else here the issue?
If you can obtain the url directly from the href attribute of link element, the url should work when you enter into address bar directly. But it's not always work, see the below explain for click event
But if you obtain the url from address bar after click on some element, you will fail to open the destination page by enter the url into address bar.
Because when you click on the element, there maybe a click event triggered which executed a javascript code in background to fetch data from backend or whatever.
For those background stuff, you can't trigger them by enter the url in address bar. So the safety way is to click on the element.
I know Angularjs App acted as such way in most time.
I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.