Selenium not getting HTML of PDF links - python

I'm trying to download the PDF slides off this website using Python and selenium but I think the the links to the slides only appear after loading a script. I tried waiting for the javascript to load but it's still not finding anything. Any ideas?
import os, sys, time, random
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.implicitly_wait(3)
html = browser.page_source
links = browser.find_elements_by_class_name('flip-entry')
print(links)
browser.quit()

The reason is that there are no links on the main page. You are getting links inside an IFrame. This IFrame points to https://drive.google.com/embeddedfolderview?hl=fr&id=0ByUKRdiCDK7-c0k1TWlLM1U1RXc#list
You can either directly browse that URL in your code instead of main page. Or you can switch to the frame
browser.switch_to_frame(browser.find_element_by_class_name("iframe-class"))
links = browser.find_elements_by_css_selector('.flip-entry a')
for link in links:
print(link.get_attribute("href"))

from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to_frame(browser.find_element_by_class_name('iframe-class'))
links = browser.find_elements_by_class_name('.flip-entry a')
for link in links:
print(link.get_attribute("href"))
browser.quit()

Related

How do I go about scraping some data from chrome browser?

The webpage I am trying to scrape can only be seen after login so using a direct url won't work. I need to scrape data while I am logged in using my chrome browser.
Then I need to get the value of the the element from
I have tried using the following code.
import requests
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
lastdatadate=[]
lastprocesseddate=[]
source = requests.get('webpage.com').text
content = driver.page_source
soup = bs(content, 'lxml')
#print(soup.prettify())
price = soup.find('span', attrs={'id':'calculatedMinRate'})
print(price.text)
You could still perform a login on the opened webdriver and fill in the input fields, as explained here: How to locate and insert a value in a text box (input) using Python Selenium?
Steps:
Fill in the input fields
Find the submit button and trigger a click event
Afterwards add a sleep command, few seconds should be enough
Afterwards you should be able to get the data.

How to Get the Webpage Title of Chrome?

Using python, is there a way to obtain the title of the current active tab in Chrome?
If it is impossible, getting the list of titles of all tabs also works for my purpose.
Thanks.
There's multiple way of getting the title of a the current tab using Python,
You could use BeatifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
Or if your using Selenium
from selenium import webdriver
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
driver.maximize_window()
driver.get(YOUR_URL)
print(driver.title)
print(driver.current_url)
driver.refresh()
driver.close()

Missing Elements from HTML File Using BeautifulSoup

I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks
website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)

Access all href-links in a deep-class hierarchy

I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.
The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.
Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.

Unable to Identify Webpage in BeautifulSoup by URL

I am using Python and Selenium to attempting to scrape all of the links from the results page of a certain search page.
No matter what I search for in the previous screen, the URL for any search on the results page is: "https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
If I use Selenium to autosearch, then try to read this URL into BeautifulSoup, I get HTTPError: HTTP Error 404: Not Found
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)
# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)
#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button
search="/html/body/div[2]/div/div[2]/div/div[2]/form/div[1]/div/span/button[1]"
#click submit button
driver.find_element_by_xpath(search).click()
#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)
#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')
I suspect this has something to do with the proxyserver and Python is not receiving the necessary info to identify the website, but I'm not sure how to work around this.
Thanks in advance!
I used Selenium to identify the new URL as a work-around for identifying the proper search page:
url1=driver.current_url
Next, I used requests to get the content and feed it into beautifulsoup.
All together, I added:
#Added to the top of the script
import requests
...
#identify the current search page with Selenium
url1=driver.current_url
#scrape the content of the results page
r=requests.get(url)
soup=BeautifulSoup(r.content, 'html.parser')
...

Categories