I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
However, after a couple of entries the page blocks you from doing anything. When this happens the URL changes and you'll get this Block.aspx?c=475412
Now how'd I be able to check if the url contains Block.aspx?c=475412 after each try?
I've tried looking for this but I could only find people asking how to get the current URL, not what I'm looking for, I need to check what the url contains.
Here is my code.
import selenium
from selenium import webdriver
url_list = open("path')
try:
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
except ValueError:
print("Something went wrong checking the URL.")
I suppose I'd add an if statement checking if the URL contains Block.aspx?c=475412, if anyone would be able to help me out I'd greatly appreciate it.
If you want to check what the URL contains, you can just use the in method built in with Python strings.
if "Block.aspx?c=475412" in driver.current_url: # check if "Block.aspx?c=475412" is in URL
print("Block.aspx is in the URL")
Related
I want to check in websites(calling from database) that does it has captcha or not. Now the problem is that for every website the captcha ID, Class, Xpath is different. I want a smater way to check if captcha avaliable or not. if it is avalible store in either in database or in CSV etc. second check currently URLs I have is in this from e.g google.com if I gave this to selenium it does not pick this URL. what I understand that it need a complete URL e.g https://www.google.com not www is common for all but http and https is different. how can be this solved. now for www I wrote below code. but the rest I'm not able to make Logic
db_conn = base_class.table_selected_urls() # data coming from..
# ...database
db_conn.execute("SELECT DISTINCT Domain,Site_name FROM URLS_INFO")
urls = db_conn.fetchall() # after that all distinct urls is saving.
# ...in urls
for url in urls:
url_name = url[0] # spliting url to add **www.** with it
url_source = url[1]
if url_name.split('.')[0] != "www":
url_name = "www." + url_name
print(url_name)
The data from the database is comming like www.google.com after this code. now how to check that this website is actually http or https. because selenium can't process it
I'm curious if anyone has found a work around for handling the random "Please Verify you are human" pop up in FireFox when using Selenium and BeautifulSoup. Currently, it pops up about every 500 or 1,000 URL requests, but I'd love an automated workaround.
My driver is just the default driver = webdriver.Firefox() with selenium. The pop up is a press AND hold button (picture below) which I've just done manually as I've seen it pop up. Any info would be great thanks!
So I've figured out a workaround for this. Since the URL doesn't actually change / redirect when the 'Please verify you are human' popup occurs I've added a step prior to getting the elements with beautifulSoup.
For each URL in the list that is being scraped I do a time.sleep(5.5) to allow URL to fully load or for the verify popup to occur. Then, I interact with the URL and look for the verify indicator. For StockX it works like this: while true, try soup.find('div', class_='page-title').text and if it finds '\nPlease verify you are a human\n' then close browser and sleep (driver.quit() and time.sleep(20)) else scrape elements.
I dont have the full code written up to work but I do know I can detect if its a verify page as mentioned above. Something like this below maybe:
for url in url_list:
for attempt in range(5):
try:
if soup.find('div', class_='page-title').text == '\nPlease verify you are a human\n':
driver.quit()
time.sleep(20)
else:
scrape_everything()
except:
print(f'Hit Verify Page Attempt Num.: {attempt}')
else:
break
else:
continue
This is my first question so please bear with me (I have googled this and I did not find anything)
I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
So far I've got the first two steps done but I'm having some issues.
I want Selenium to repeat this process with multiple urls (if possible, multiple at a time).
I have all the urls in a txt called output.txt
At first I did
url_list = https://example.com
to see if my program even worked, and it did however I am stuck on how to get it to go to the next URL in the list and I am unable to find anything on the internet which helps me.
This is my code so far
import selenium
from selenium import webdriver
url_list = "C\\user\\python\\output.txt"
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
driver.get(url_list)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
I have no idea as to how I'd get selenium to go to the first url in the list then go onto the second one and so forth.
If anyone would be able to help me I'd be very grateful.
I think the problem is that you assumed the name of the file containing the url, is a url. You need to open the file first and build the url list.
According to the docs https://selenium.dev/documentation/en/webdriver/browser_manipulation/, get expect a url, not a file path.
import selenium
from selenium import webdriver
with open("C\\user\\python\\output.txt") as f:
url_list = f.read().split('\n')
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
I've written a script in python in combination with selenium to parse the emails of some companies out of a webpage. The problem is that the emails are either within span[data-mail] or span[data-mail-e-contact-mail]. If I try the two conditions seperately, I can get all the emails. However, when I try to wrap them within try:except:else block, they no longer work. Where am I going wrong?
website link
This is the script:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "replace with the link above"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.select("article.vcard"):
try: #the following works when tried individually
email = links.select_one(".hit-footer-wrapper span[data-mail]").get("data-mail")
except: #the following works as well when tried individually
email = links.select_one(".hit-footer-wrapper span[data-mail-e-contact-mail]").get("data-mail-e-contact-mail")
else:
email = ""
print(email)
driver.quit()
When I execute the above script, It prints nothing. They both work, if printed individually, though.
Note that exception will not be raised with your code as both get("data-mail") and get("data-mail-e-contact-mail") will return value (empty or not), but not exception
Try below code to get required output:
for links in soup.select("article.vcard"):
email = links.select_one(".hit-footer-wrapper span[data-mail]").get("data-mail") or links.select_one(".hit-footer-wrapper span[data-mail-e-contact-mail]").get("data-mail-e-contact-mail")
print(email)
I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.