I have created a screen scraping program using selenium, which prints out a few variables. I want to take the numbers it spits out and compare it to numbers in a text document. I am unsure on the process of going about this. What would be the best way to go about this. The text file will contain a 3 numbers which will be compared to 3 numbers that have been screen scraped.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
#The above is downloading the needed files for this code to work
chrome_path = r"C:\Users\ashabandha\Downloads\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://signin.acellus.com/SignIn/index.html")
time.sleep(2)
username = driver.find_element_by_id("Name")
password = driver.find_element_by_id("Psswrd")
username.send_keys("my login")
password.send_keys("my password")
time.sleep(2)
driver.find_element_by_xpath("""//*[#id="loginform"]/table[2]/tbody/tr/td[2]/input""").click()
#The program has now signed in and is going to navigate to the progress tab
time.sleep(2)
driver.get("https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=484")
time.sleep(2)
#now we are on the progress tab
posts = driver.find_elements_by_class_name("Object7069")
time.sleep(2)
for post in posts:
print (post.text)
#this gives me the first class log
time.sleep(2)
driver.get("https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326")
#This gives me second class log
time.sleep(2)
posts = driver.find_elements_by_class_name("Object7069")
time.sleep(2)
for post in posts:
print (post.text)
time.sleep(2)
driver.get("https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=292")
posts = driver.find_elements_by_class_name("Object7069")
time.sleep(2)
for post in posts:
print (post.text)
Save selenium output on a data structure, like list or dictionary, then open the file, extract the info you want to compare it to and do the algorithm or expression you wish to it: https://www.python.org/doc/
check out working with file.
Related
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I wrote a short program to automate the process of clicking and saving profiles on LinkedIn.
Brief:
The program reads from a txt file with a large amount of LI URLs.
Using Selenium, it opens them one by one, then, hit the "Open in Sales Navigator" button
A new tab is opening, and on it, it needs to click the "Save" button, and choose the relevant list to save on.
I have two main problems:
LinkedIn has 3 versions of the same page. How can I use a condition to check which page version is it? (meaning - if you can't find this button, move to the next version). From what I've seen, you can't really use the "If" function with selenium, cause it causing trouble. Any other suggestions?
More important, and the reason I opened this thread - I want to monitor the "failed" links. Let's say I have a list of 1000 LI URLs, and I ran the program to save them on my account. I want to monitor the ones it didn't save or failed to open (broken links, page unavailable, etc.). In order to execute that, I used a CSV file and ordered the program to save all the pages that already saved on this account, but it doesn't solve my problem. How can I make him save all of them and not just the ones that were already saved? (I find it hard to execute because when a page appears as "Unavailable", it jumps to the next one and I couldn't find a way to make him save it.
It makes it harder to work with it, cause when I put 500 or 1000 URLs, I can't tell which ones save and which ones aren't saved.
Here's the code:
import selenium.webdriver as webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import csv
import random
options = webdriver.ChromeOptions()
options.add_argument('--lang=EN')
options.add_argument("--start-maximized")
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path='assets\chromedriver', chrome_options=options)
driver.get("https://www.linkedin.com/login?fromSignIn=true")
minDelay=input("\n Please provide min delay in seconds : ")
maxDelay=input("\n Please provide max delay in seconds : ")
listNumber=input("\n Please provide list number : ")
outputFile=input('\n save skipped as?: ')
count=0
closed=2
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
sleep(3)
driver.find_element_by_xpath("//button[#class='save-to-list-dropdown__trigger ph5 artdeco-button artdeco-button--primary artdeco-button--3 artdeco-button--pro artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view']").click()
sleep(2)
count+=1
if count==1:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li["+str(listNumber)+"]").click()
else:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li[1]").click()
sleep(2)
sleep(random.randint(int(minDelay), int(maxDelay)))
except:
if closed==0:
driver.close()
sleep(1)
fileOutput=open(outputFile+".csv", mode='a', newline='', encoding='utf-8')
file_writer = csv.writer(fileOutput, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
file_writer.writerow([link.strip()])
fileOutput.close()
print("Finished.")
The common approach to have different sort of listeners is to use EventFiringWebDriver. See the example here:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
class EventListener(AbstractEventListener):
def before_click(self, element, driver):
if element.tag_name == 'a':
print('Clicking link:', element.get_attribute('href'))
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
driver.get("https://webelement.click/en/welcome")
link = driver.find_element_by_xpath('//a[text()="All Posts"]')
link.click()
driver.quit()
UPD:
Basically your case does not really need that listener. However you can user it. Say you have link file like:
https://google.com
https://invalid.url
https://duckduckgo.com/
https://sadfsdf.sdf
https://stackoverflow.com
Then the way with EventFiringWebDriver would be:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
broken_urls = []
class EventListener(AbstractEventListener):
def on_exception(self, exception, drv):
broken_urls.append(drv.current_url)
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
except:
print('Cannot reach the link', link.strip())
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
and without EventFiringWebDriver would be:
broken_urls = []
if __name__ == '__main__':
from selenium import webdriver
driver = webdriver.Firefox()
with open("links.txt", "r") as links:
for link in links:
stripped_link = link.strip()
try:
driver.get(stripped_link)
except:
print('Cannot reach the link', link.strip())
broken_urls.append(stripped_link)
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
This is my code:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# loading the webpage
browser = webdriver.Chrome()
browser.get("https://instagram.com")
time.sleep(1)
# finding essential requirements
user_name = browser.find_element_by_name("username")
password = browser.find_element_by_name("password")
login_button = browser.find_element_by_xpath("//button [#type = 'submit']")
# filling out the user name box
user_name.click()
user_name.clear()
user_name.send_keys("username")
# filling out the password box
password.click()
password.clear()
password.send_keys("password")
# clicking on the login button
login_button.click()
time.sleep(3)
# information save permission denial
not_now_button = browser.find_element_by_xpath("//button [#class = 'sqdOP yWX7d y3zKF ']")
not_now_button.click()
time.sleep(3)
# notification permission denial
not_now_button_2 = browser.find_element_by_xpath("//button [#class = 'aOOlW HoLwm ']")
not_now_button_2.click()
time.sleep(3)
# finding search box and searching + going to the page
search_box = browser.find_element_by_xpath('//input [#placeholder="Search"]')
search_box.send_keys("sb else's page")
time.sleep(3)
search_box.send_keys(Keys.RETURN)
search_box.send_keys(Keys.RETURN)
time.sleep(3)
# opening ((followers)) list
followers = browser.find_element_by_xpath('//a [#class="-nal3 "]')
followers.click()
time.sleep(10)
# following each follower
follower = browser.find_elements_by_xpath('//button [#class="sqdOP L3NKy y3zKF "]')
browser.close()
In this code, I normally simulate what a normal person does to follow another person.
I want to follow each follower of a page. I have thought all day long; But couldn't find any algorithms.
Got some good ideas, but just realized I don't know how I can scroll down to the end of the list to get the entire list. Can you help? (If you don't get me, try running the code and then extract the list of followers.)
# following each follower
get list of followers
for each follower - click 'follow' if it's possible
if button text haven't changed, it means that you reached the limit of follows, or maybe banned
Also, be sure to limit your actions, instagram had limit of follows (30 per hour, it was before)
And you can get the followers directly through instagram API.
And don't forget to unfollow them, because unfollowing also has limits. And the limit of current follows is 7500( it was before, not sure how about now)
First you need to get a list of the users that follows someone, then you just execute the same code in a loop. You can either scrape the users separate, or within selenium. Then run the code needed to follow a given person, in Ex. a for loop. Step 6: profit
I have decided to attempt to create a simple web scraper script in python. As a small challenge I decided to create a script which will be able to log me into facebook and fetch the current birthdays displayed in the side. I have managed to write a script which is able to log me into my facebook, however I have no idea how to fetch the birthdays displayed.
This is my scrypt.
from selenium import webdriver
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
usr = 'EMAIL'
pwd = 'PASSWORD'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.facebook.com/')
print ("Opened facebook")
sleep(1)
username_box = driver.find_element_by_id('email')
username_box.send_keys(usr)
print ("Email Id entered")
sleep(1)
password_box = driver.find_element_by_id('pass')
password_box.send_keys(pwd)
print ("Password entered")
login_box = driver.find_element_by_id('u_0_b')
login_box.click()
print ("Login Sucessfull")
print ("Fetched needed data")
input('Press anything to quit')
driver.quit()
print("Finished")
This is my first time creating a script of this type. My assumption is that I am supposed to traverse through the children of the "jsc_c_3d" div element until I get to the displayed birthdays. Furthermore the id of this element changes everytime the page is refreshed. Can anyone tell me how this is done or if this is the right way that I should go on about solving this problem?
The div for the birthday after expecting elements:
<div class="" id="jsc_c_3d">
<div class="j83agx80 cbu4d94t ew0dbk1b irj2b8pg">
<div class="qzhwtbm6 knvmm38d"><span class="oi732d6d ik7dh3pa d2edcug0 qv66sw1b c1et5uql
a8c37x1j muag1w35 enqfppq2 jq4qci2q a3bd9o3v knj5qynh oo9gr5id hzawbc8m" dir="auto">
<strong>Bobi Mitrevski</strong>
and
<strong>Trajce Tusev</strong> have birthdays today.</span></div></div></div>
You are correct that you would need to traverse through the inner elements of jsc_c_3d to extract the birthdays that you want. However this whole automated web-scraping is a problem if the id value is dynamic, such that it changes on each occasion. In this case, text parsers such as bs4 would do the job.
With the bs4 approach you simply have to extract the relevant div tags from the DOM and then you can parse the data to obtain the required contents.
More generally, this problem is solvable using the Facebook-API which could be as simple as
import facebook
token = 'a token' # token omitted here, this is the same token when I use in https://developers.facebook.com/tools/explorer/
graph = facebook.GraphAPI(token)
args = {'fields' : 'birthday,name' }
friends = graph.get_object("me/friends",**args)
I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.
This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them