Web scraping dynamic content with Splinter module - python

I'm struggling with scraping a table (from steamcommunity) that is dynamically loaded through js. I'm using a combination of python Splinter and headless browser Phantomjs.
Here is what I already came up with:
from splinter import Browser
import time
import sys
browser = Browser('phantomjs')
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
print price_table
print price_table.first
print price_table.first.text
print price_table.first.value
browser.quit()
The first method is_element_present_by_xpath() ensures that the table I'm interested in is loaded. Then I try to access the rows of that table.
As I understood from Splinter documentation the .find_by_xpath() method returns ElementList, which is essentially a normal list with some aliases provided.
Price_table is an ElementList of all rows of table. The last two prints give out empty results, and I can't find any reason why text-method returns an empty string.
How could the elements of that table be accessed?

I tried code with different browsers and always got empty text but I found expected data in html. Maybe it is only mistake in splinter.
from splinter import Browser
#browser = Browser('firefox')
#browser = Browser('phantomjs')
#browser = Browser('chrome') # executable_path='/usr/bin/chromium-browser' ??? error !!!
browser = Browser('chrome') # executable_path='/usr/bin/chromedriver' OK
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print(browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5))
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
for row in price_table:
print('row html:', row.html)
print('row text:', row.text) # empty ???
for col in row.find_by_tag('td'):
print(' col html:', col.html)
print(' col text:', col.text) # empty ???
browser.quit()

Have you tried doing for i in price_table yet? From the code it states that ElementList element extends python list. I am sure you can just iterate over price_table.
Edit: Also this is the first time I have heard of splinter, it looks like it is just an abstraction over the selenium python package. Maybe if you are stuck you can look at selenium docs. They are better written.
from splinter import Browser
import time
import sys
browser = Browser('phantomjs')
url = 'https://steamcommunity.com/market/listings/730/%E2%98%85%20Karambit%20%7C%20Blue%20Steel%20(Battle-Scarred)'
browser.visit(url)
print browser.is_element_present_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]', wait_time = 5)
price_table = browser.find_by_xpath('//*[#id="market_commodity_buyreqeusts_table"]/table/tbody/tr')
for i in price_table:
print i
print i.text
browser.quit()

Related

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Trying to scrape a specific element using Selenium but getting no results

I cobbled together some code to login to a website and navigate to the specific pages I want to scrape from. This part works fine. Now, however, I'm searching for a specific element titled 'tspan' and I'm getting an error that reads:
AttributeError: 'str' object has no attribute 'descendants'
If I go to the URL, right-click the element I want to grab, and click 'Inspect Element' I see the code behind the page, and it looks like this.
It looks like querying by 'g id' may work too.
So, I thought I could get all 'tspan' items, load all into a list, and write the list to a text file. However, I'm getting no 'tspan' elements at all. If I right-click the page and click 'View Page Source', I see no 'tspan' elements. This is very weird! The code behind the page is definitely different than what's rendered on the page itself. Here's my code. What am I doing wrong here?
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
from lxml import html
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
# selenium
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp-internal.com/admin/?page=0"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("identifierId")
username.send_keys("my_email#email.com")
wd.find_element_by_id("identifierNext").click()
# set password
time.sleep(2)
password = wd.find_element_by_name("password")
password.send_keys("my_pswd")
wd.find_element_by_id("passwordNext").click()
all_text = []
# list of URLs
url_list = ['https://corp-internal.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-internal.com/admin/graph?dag_id=eemm1_daily_legacy_history']
for link in url_list:
#File = webbrowser.open(link)
#File = requests.get(link)
#data = File.text
for link in bs.findAll('tspan'):
alldata = all_text.append(link.get('tspan'))
outF = open('C:/Users/ryans/OneDrive/Desktop/test.txt', 'w')
outF.writelines(alldata)
outF.close()

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.
This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them

Checking the clickability of an element in selenium using python

I've been trying to write a script which will give me all the links to the episodes present on this page :- http://www.funimation.com/shows/assassination-classroom/videos/episodes
As you can see that the links can be seen in 'Outer HTML', I used selenium and PhantomJS with python.
Link Example: http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time
However, I can't seem to get my code right. I do have a basic Idea of what I want to do. Here's the process :-
1.) Copy the Outer HTML of the very first page and then save it as 'Source_html' file.
2.) Look for links inside this file.
3.) Move to the next page to see rest of the videos and their links.
4.) Repeat the step 2.
This is what my code looks like :
from selenium import webdriver
from selenium import selenium
from bs4 import BeautifulSoup
import time
# ---------------------------------------------------------------------------------------------
driver = webdriver.PhantomJS()
driver.get('http://www.funimation.com/shows/assassination-classroom/videos/episodes')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
print 'Links On First Page Are : \n'
soup = BeautifulSoup('source_code.html')
subtitles = soup.find_all('div',{'class':'popup-heading'})
official = 'something'
for official in subtitles:
x = official.findAll('a')
for a in x:
print a['href']
sbtn = driver.find_element_by_link_text(">"):
print sbtn
print 'Entering The Loop Now'
for driver.find_element_by_link_text(">"):
sbtn.click()
time.sleep(3)
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code1.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
Things I already know :-
soup = BeautifulSoup('source_code.html') won't work, because I need to open this file via python and feed it into BS after that. That I can manage.
That official variable isn't really doing anything. Just helping me start a loop.
for driver.find_element_by_link_text(">"):
Now, this is what I need to fix somehow. I'm not sure how to check if this thing is still clickable or not. If yes, then proceed to next page, get the links, click this again to go to page 3 and repeat the process.
Any help would be appreciated.
You don't need to use BeautifulSoup here at all. Just grab all the links via selenium. Proceed to next page only if the > link is visible. Here is the complete implementation including gathering the links, necessary waits. It should work for any page count:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.funimation.com/shows/assassination-classroom/videos/episodes")
wait = WebDriverWait(driver, 10)
links = []
while True:
# wait for the page to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.item-title")))
# wait until the loading circle becomes invisible
wait.until(EC.invisibility_of_element_located((By.ID, "loadingCircle")))
links.extend([link.get_attribute("href") for link in driver.find_elements_by_css_selector("a.item-title")])
print("Parsing page number #" + driver.find_element_by_css_selector("a.jp-current").text)
# click next
next_link = driver.find_element_by_css_selector("a.next")
if not next_link.is_displayed():
break
next_link.click()
time.sleep(1) # hardcoded delay
print(len(links))
print(links)
For the mentioned in the question URL, it prints:
Parsing page number #1
Parsing page number #2
93
['http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater']
Basically, I use webelement.is_displayed() to check if it is clickable or not.
isLinkDisplay = driver.find_element_by_link_text(">").is_displayed()

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.

Categories