I'm trying to get the entire page content from this page: https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects
It seems that the page automatically limits the displayed content while scrolling.
For instance, 1st code returns names starting with A - C, whereas 2nd code returns L - W. Is there any way to pull out the entire page content. (I'm not asking page_source to be answered or scroll down or even the time.sleep, but I'm asking you how to override the page's unknown limitation by selenium)
And the codes used are as below:
1st (w/o going to the page bottom):
url = "https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects"
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.maximize_window()
browser.get(url)
time.sleep(10)
content = browser.page_source.encode('utf-8')
file_ = open('result1.html', 'wb')
file_.write(content)
file_.close()
browser.close()
2nd (w/ going to the page bottom):
url = "https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects"
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.maximize_window()
browser.get(url)
time.sleep(10)
elem = browser.find_element_by_xpath(
'/html/body/div/div/div/div/div/div/div[2]/footer/div/div/div/div[3]')
actions = ActionChains(browser)
actions.click(elem).perform()
time.sleep(5)
content = browser.page_source.encode('utf-8')
file_ = open('result1.html', 'wb')
file_.write(content)
file_.close()
browser.close()
Related
I want to request this url:
https://www.codal.ir/CompanyList.aspx
This url contains tables on 110 pages that when the page is changed, neither the url nor the new request is changed.
this is my code:
import requests as req
req = req.Session()
isics = req.get("https://www.codal.ir/CompanyList.aspx")
print(isics.text)
but I only get the first page information.I intend to extract the required information from the tables by request and regex but if you have another way I will be happy to hear .Thanks for helping me get the whole pages.
I used Selenium to navigate in the table. You can't do that with requests, because we do not have links that redirect us to the new page in the table. You can find the code below.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
def get_company_links(links, driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
rows = soup.select("table.companies-table tr")
for row in rows:
link = row.select_one("a")
if(link):
links.append("https://www.codal.ir/" + link['href'])
options = webdriver.ChromeOptions()
#options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://www.codal.ir/CompanyList.aspx")
current_page_button = driver.find_element_by_css_selector('input[type="submit"].normal.selected')
page_number = int(current_page_button.get_attribute('value'))
while(True):
get_company_links(links, driver)
next_page_button = driver.find_element_by_css_selector('input#ctl00_ContentPlaceHolder1_ucPager1_btnNext')
next_page_button.click()
time.sleep(2)
previous_page_number = page_number
current_page_button = driver.find_element_by_css_selector('input[type="submit"].normal.selected')
page_number = int(current_page_button.get_attribute('value'))
if(previous_page_number == page_number):
break # no more page left
print(links)
Main working principle is navigating through the table and collecting links of the company websites. We use next button to navigate and stop when last page index is equal to the current index, which indicates that we arrived at the end of the table.
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).
Anyone can help!
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})
Issue:
The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.
Solution:
Choose a method of selecting the new iframe, and then proceed to parse the new html.
The Scrapy-Splash method
(This is an adaptation of Mikhail Korobov's solution from this answer)
If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:
# ...
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args={'html': 1, 'iframes': 1})
def parse_result(self, response):
iframe_html = response.data['childFrames'][0]['html']
sel = parsel.Selector(iframe_html)
item = {
'my_field': sel.xpath(...),
# ...
}
The Selenium method
(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!
With the following code, this will switch scopes to the new frame:
# Goes at the top
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
import time
# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)
soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')
# This will return any content found in tags called '<table>'
table = soup.find_all('table')
My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!
Trying to fetch all the movie posters from left sided area of this site but my script only parses the first one and quits.
How can I get all the movie poster links ending with .jpg extension?
from selenium import webdriver
def fetch_image_links(driver,link):
driver.get(link)
for item in driver.find_elements_by_css_selector("a[href^='/title/'] img.loadlate[src$='.jpg']"):
print(item.get_attribute("src"))
if __name__ == '__main__':
url = "https://www.imdb.com/list/ls006385184/"
driver = webdriver.Chrome()
try:
fetch_image_links(driver,url)
finally:
driver.quit()
When page is opened only first couple movies have posters - all other have default image.
You need to scroll down page and wait for the moment when no default images displayed (for all movies posters are loaded)
from selenium.webdriver.common.keys import Keys
default_img = "https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png"
def fetch_image_links(driver,link):
driver.get(link)
while driver.find_elements_by_css_selector("a>img[src='%s']" % default_img):
driver.find_element_by_tag_name('a').send_keys(Keys.PAGE_DOWN)
for item in driver.find_elements_by_css_selector("a[href^='/title/'] img.loadlate[src$='.jpg']"):
print(item.get_attribute("src"))
I'm trying to get the number of plays for the top songs from a number of artists on Spotify using python and splinter.
If you fill in the username and password below with yours, you should be able to run the code.
from splinter import Browser
import time
from bs4 import BeautifulSoup
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(10)
So far, so good. If you open up firefox, you'll can see Miley Cyrus's artist page, including the number of plays for top tracks.
If you open up the Firefox Developer Tools Inspector and hover, you can see the name of the song in .tl-highlight elements, and the number of plays in .tl-listen-count elements. However, I've found it impossible (at least on my machine) to access these elements using splinter. Moreover, when I try to get the source for the entire page, the elements that I can see by hovering my mouse over them in Firefox don't show up in what is ostensibly the page source.
html = browser.html
soup = BeautifulSoup(html)
output = soup.prettify()
with open('miley_cyrus_artist_page.html', 'w') as output_f:
output_f.write(output)
browser.quit()
I don't think I know enough about web programming to know what the issue is here--Firefox sees all the DOM elements clearly, but splinter that is driving Firefox does not.
The key problem is that there is an iframe containing the artist's page with list of tracks. You need to switch into it's context before searching for elements:
frame = browser.driver.find_element_by_css_selector("iframe[id^=browse-app-spotify]")
browser.driver.switch_to.frame(frame)
Many thanks to #alecxe, the following code works to pull the information on the artist.
from splinter import Browser
import time
from bs4 import BeautifulSoup
import codecs
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(30)
CORRECT_FRAME_INDEX = 6
with browser.get_iframe(CORRECT_FRAME_INDEX) as iframe:
html = iframe.html
soup = BeautifulSoup(html)
output = soup.prettify()
with codecs.open('test.html', 'w', 'utf-8') as output_f:
output_f.write(output)
browser.quit()