Trying to scrape a specific element using Selenium but getting no results - python

I cobbled together some code to login to a website and navigate to the specific pages I want to scrape from. This part works fine. Now, however, I'm searching for a specific element titled 'tspan' and I'm getting an error that reads:
AttributeError: 'str' object has no attribute 'descendants'
If I go to the URL, right-click the element I want to grab, and click 'Inspect Element' I see the code behind the page, and it looks like this.
It looks like querying by 'g id' may work too.
So, I thought I could get all 'tspan' items, load all into a list, and write the list to a text file. However, I'm getting no 'tspan' elements at all. If I right-click the page and click 'View Page Source', I see no 'tspan' elements. This is very weird! The code behind the page is definitely different than what's rendered on the page itself. Here's my code. What am I doing wrong here?
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
from lxml import html
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
# selenium
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp-internal.com/admin/?page=0"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("identifierId")
username.send_keys("my_email#email.com")
wd.find_element_by_id("identifierNext").click()
# set password
time.sleep(2)
password = wd.find_element_by_name("password")
password.send_keys("my_pswd")
wd.find_element_by_id("passwordNext").click()
all_text = []
# list of URLs
url_list = ['https://corp-internal.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-internal.com/admin/graph?dag_id=eemm1_daily_legacy_history']
for link in url_list:
#File = webbrowser.open(link)
#File = requests.get(link)
#data = File.text
for link in bs.findAll('tspan'):
alldata = all_text.append(link.get('tspan'))
outF = open('C:/Users/ryans/OneDrive/Desktop/test.txt', 'w')
outF.writelines(alldata)
outF.close()

Related

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Web scraping dynamic content with Splinter module

I'm struggling with scraping a table (from steamcommunity) that is dynamically loaded through js. I'm using a combination of python Splinter and headless browser Phantomjs.
Here is what I already came up with:
from splinter import Browser
import time
import sys
browser = Browser('phantomjs')
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
print price_table
print price_table.first
print price_table.first.text
print price_table.first.value
browser.quit()
The first method is_element_present_by_xpath() ensures that the table I'm interested in is loaded. Then I try to access the rows of that table.
As I understood from Splinter documentation the .find_by_xpath() method returns ElementList, which is essentially a normal list with some aliases provided.
Price_table is an ElementList of all rows of table. The last two prints give out empty results, and I can't find any reason why text-method returns an empty string.
How could the elements of that table be accessed?
I tried code with different browsers and always got empty text but I found expected data in html. Maybe it is only mistake in splinter.
from splinter import Browser
#browser = Browser('firefox')
#browser = Browser('phantomjs')
#browser = Browser('chrome') # executable_path='/usr/bin/chromium-browser' ??? error !!!
browser = Browser('chrome') # executable_path='/usr/bin/chromedriver' OK
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print(browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5))
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
for row in price_table:
print('row html:', row.html)
print('row text:', row.text) # empty ???
for col in row.find_by_tag('td'):
print(' col html:', col.html)
print(' col text:', col.text) # empty ???
browser.quit()
Have you tried doing for i in price_table yet? From the code it states that ElementList element extends python list. I am sure you can just iterate over price_table.
Edit: Also this is the first time I have heard of splinter, it looks like it is just an abstraction over the selenium python package. Maybe if you are stuck you can look at selenium docs. They are better written.
from splinter import Browser
import time
import sys
browser = Browser('phantomjs')
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
for i in price_table:
print i
print i.text
browser.quit()

Checking the clickability of an element in selenium using python

I've been trying to write a script which will give me all the links to the episodes present on this page :- http://www.funimation.com/shows/assassination-classroom/videos/episodes
As you can see that the links can be seen in 'Outer HTML', I used selenium and PhantomJS with python.
Link Example: http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time
However, I can't seem to get my code right. I do have a basic Idea of what I want to do. Here's the process :-
1.) Copy the Outer HTML of the very first page and then save it as 'Source_html' file.
2.) Look for links inside this file.
3.) Move to the next page to see rest of the videos and their links.
4.) Repeat the step 2.
This is what my code looks like :
from selenium import webdriver
from selenium import selenium
from bs4 import BeautifulSoup
import time
# ---------------------------------------------------------------------------------------------
driver = webdriver.PhantomJS()
driver.get('http://www.funimation.com/shows/assassination-classroom/videos/episodes')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
print 'Links On First Page Are : \n'
soup = BeautifulSoup('source_code.html')
subtitles = soup.find_all('div',{'class':'popup-heading'})
official = 'something'
for official in subtitles:
x = official.findAll('a')
for a in x:
print a['href']
sbtn = driver.find_element_by_link_text(">"):
print sbtn
print 'Entering The Loop Now'
for driver.find_element_by_link_text(">"):
sbtn.click()
time.sleep(3)
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code1.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
Things I already know :-
soup = BeautifulSoup('source_code.html') won't work, because I need to open this file via python and feed it into BS after that. That I can manage.
That official variable isn't really doing anything. Just helping me start a loop.
for driver.find_element_by_link_text(">"):
Now, this is what I need to fix somehow. I'm not sure how to check if this thing is still clickable or not. If yes, then proceed to next page, get the links, click this again to go to page 3 and repeat the process.
Any help would be appreciated.
You don't need to use BeautifulSoup here at all. Just grab all the links via selenium. Proceed to next page only if the > link is visible. Here is the complete implementation including gathering the links, necessary waits. It should work for any page count:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.funimation.com/shows/assassination-classroom/videos/episodes")
wait = WebDriverWait(driver, 10)
links = []
while True:
# wait for the page to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.item-title")))
# wait until the loading circle becomes invisible
wait.until(EC.invisibility_of_element_located((By.ID, "loadingCircle")))
links.extend([link.get_attribute("href") for link in driver.find_elements_by_css_selector("a.item-title")])
print("Parsing page number #" + driver.find_element_by_css_selector("a.jp-current").text)
# click next
next_link = driver.find_element_by_css_selector("a.next")
if not next_link.is_displayed():
break
next_link.click()
time.sleep(1) # hardcoded delay
print(len(links))
print(links)
For the mentioned in the question URL, it prints:
Parsing page number #1
Parsing page number #2
93
['http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater']
Basically, I use webelement.is_displayed() to check if it is clickable or not.
isLinkDisplay = driver.find_element_by_link_text(">").is_displayed()

Using Python to enter data into form and get data from results page

I am using Python 2.7 on a 32 bit Windows machine.
I am trying to enter species data into http://explorer.natureserve.org and retrieve the results, but am having difficulty understanding how to do it. Needless to say I am relatively new to Python.
I have the following code:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://explorer.natureserve.org")
assert "NatureServe" in driver.title
SciName = driver.find_element_by_name('searchSciOrCommonName')
SciName.send_keys("Arabis georgiana")
SciName.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
The above works, but now I need to select the element Arabis georgiana on the results page, which will take me to another page. How do I get the results page back into Python and redirect to the page that I actually want?
You need to set the searchSciOrCommonName field value this way:
br.form = list(br.forms())[0]
br.form['searchSciOrCommonName'] = 'butterfly'
response = br.submit()
Then, you can parse the HTML response via, for example, BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response)
for item in soup.select('table[border="1"] > tr i')[1:]:
print(item.text.strip())
which would print:
Aglais io
Callophrys mossii hidakupa
Callophrys mossii marinensis
Cercyonis pegala incana
...
Psora nipponica
Flowering Plants
Asclepias tuberosa

Parse resulting webpage Python Selenium

I used Selenium webdriver in Python to input some text in search field and look for it. I'd now like to parse that page/ use something like BeautifulSoup on it. But I'm confused about how to call the resulting page.
My code so far:
textinput = open("1.txt", "r").read()
url = "http://www.example.com"
driver = webdriver.Chrome(executable_path='path/chromedriver.exe')
driver.get(url)
sbox = driver.find_element_by_name("a")
sbox.send_keys(textinput)
submit = driver.find_element_by_xpath('//*[#id="maincontent"]/form/input[5]')
submit.click()
Once you have clicked on the submit button using:
submit.click()
It automatically goes to the next page. So, to parse the resulting page, just make another:
whatimlookingfor = driver.find_element_by_id("myid")
submit = driver.find_element_by_xpath('//*[#id="maincontent"]/form/input[5]')
# You are still on the first page
submit.click()
# You are now on the second page
whatimlookingfor = driver.find_element_by_id("myid")

Categories