Cycle trough URLs from a txt - python

This is my first question so please bear with me (I have googled this and I did not find anything)
I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
So far I've got the first two steps done but I'm having some issues.
I want Selenium to repeat this process with multiple urls (if possible, multiple at a time).
I have all the urls in a txt called output.txt
At first I did
url_list = https://example.com
to see if my program even worked, and it did however I am stuck on how to get it to go to the next URL in the list and I am unable to find anything on the internet which helps me.
This is my code so far
import selenium
from selenium import webdriver
url_list = "C\\user\\python\\output.txt"
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
driver.get(url_list)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
I have no idea as to how I'd get selenium to go to the first url in the list then go onto the second one and so forth.
If anyone would be able to help me I'd be very grateful.

I think the problem is that you assumed the name of the file containing the url, is a url. You need to open the file first and build the url list.
According to the docs https://selenium.dev/documentation/en/webdriver/browser_manipulation/, get expect a url, not a file path.
import selenium
from selenium import webdriver
with open("C\\user\\python\\output.txt") as f:
url_list = f.read().split('\n')
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")

Related

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

How to check if URL contains something

I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
However, after a couple of entries the page blocks you from doing anything. When this happens the URL changes and you'll get this Block.aspx?c=475412
Now how'd I be able to check if the url contains Block.aspx?c=475412 after each try?
I've tried looking for this but I could only find people asking how to get the current URL, not what I'm looking for, I need to check what the url contains.
Here is my code.
import selenium
from selenium import webdriver
url_list = open("path')
try:
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
except ValueError:
print("Something went wrong checking the URL.")
I suppose I'd add an if statement checking if the URL contains Block.aspx?c=475412, if anyone would be able to help me out I'd greatly appreciate it.
If you want to check what the URL contains, you can just use the in method built in with Python strings.
if "Block.aspx?c=475412" in driver.current_url: # check if "Block.aspx?c=475412" is in URL
print("Block.aspx is in the URL")

How to get the .pdf download at the end of a redirect chain with Selenium?

I have tried every method I can think of for getting the pdf from the link: http://apps.colorado.gov/dora/licensing/Lookup/LicenseLookup.aspx?docExternal=926241&docGuid=8DC9BB72-A921-45E7-9BCD-358846FCE54D
I have tried:
Clicking the button for this link
Opening the href manually in the webdriver
Using WebDriverWait, and various commands to wait for url switches or the appearance of certain urls
Sleeping and re-getting the page_source
Using a try statement to override the TimeOut exception and trying to issue more commands from there
Every attempt at opening this link results in a timeout exception, even though it works just fine manually.
It looks like it runs through 2(?) redirects before landing on the pdf file I'd like to grab. Is there anyone out there with selenium experience that can point me in the right direction for getting this pdf? I'm running Selenium on ChromeDriver in a Python script.
ANSWER:
download_buttons = self.browser.find_elements_by_link_text("External Document")
for button in download_buttons:
new_file_path = f'{blah}.pdf'
link = button.get_attribute("href")
download_link = requests.get(link, allow_redirects=True)
try:
with open(new_file_path, 'wb') as new_file:
new_file.write(download_link.content)
except Exception as e:
self.print_error(f"Failed to write file: {e}")
I cannot comment yet as I'm new. When you find the url that should contain the document you can call the requests library similar to how this person answered. Get a file from an ASPX webpage using Python

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.

Any advice for sending a request to a website from Python?

def align_sequences(IDs):
import webbrowser
import urllib,urllib2
url = 'http://www.uniprot.org/align/'
params = {'query':IDs}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
job_url = response.geturl()
webbrowser.open(job_url)
align_sequences('Q4PRD1 Q7LZ61')
With this function I want to open 'http://www.uniprot.org/align/', request the protein sequences with IDs Q4PRD1 and Q7LZ61 to be aligned, and then open the website in my browser.
Initially it seems to be working fine - running the script will open the website and show the alignment job to being run. However, it will keep going forever and never actually finish, even if I refresh the page. If I input the IDs in the browser and hit 'align' it works just fine, taking about 8 seconds to align.
I am not familiar with the differences between running something directly from a browser and running it from Python. Do any of you have an idea of what might be going wrong?
Thank you :-)
~Max
You have to click align button. You can't do this with webbrowser though. One option is to use selenium:
from selenium import webdriver
url = 'http://www.uniprot.org/align/'
ids = 'Q4PRD1 Q7LZ61'
driver = webdriver.Firefox()
driver.get(url)
q = driver.find_element_by_id('alignQuery')
q.send_keys(ids)
btn = driver.find_element_by_id("sequence-align-submit")
btn.click()
I think this is in javascript. If you look at the html-code of button Align you can see
onclick="UniProt.analytics('AlignmentSubmissionPage', 'click', 'Submit align'); submitAlignForm();"
UniProt.analytics() and submitAlignForm() some javascript magic. This magic in js-compr.js2013_11 file.
You can view this file using http://jsbeautifier.org/ and then do on python what do javascript.

Categories