Access all href-links in a deep-class hierarchy - python

I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.

The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.

Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.

Related

How do I go about scraping some data from chrome browser?

The webpage I am trying to scrape can only be seen after login so using a direct url won't work. I need to scrape data while I am logged in using my chrome browser.
Then I need to get the value of the the element from
I have tried using the following code.
import requests
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
lastdatadate=[]
lastprocesseddate=[]
source = requests.get('webpage.com').text
content = driver.page_source
soup = bs(content, 'lxml')
#print(soup.prettify())
price = soup.find('span', attrs={'id':'calculatedMinRate'})
print(price.text)
You could still perform a login on the opened webdriver and fill in the input fields, as explained here: How to locate and insert a value in a text box (input) using Python Selenium?
Steps:
Fill in the input fields
Find the submit button and trigger a click event
Afterwards add a sleep command, few seconds should be enough
Afterwards you should be able to get the data.

Why is HTML returned by requests different from the real page HTML?

I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrape the web page using:
import requests
h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content
And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.
For example imagine I want to scrape:
<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>
I use a command like this:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
print(html_text)
soup = BeautifulSoup(html_text,'lxml')
job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text
This will return me Shopify Inc.
But it doesn't because the html code y load or get from the web page with the requests' library, gets me another complete different html.
I want to know how to get the original html code from the web page.
If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python library
It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.
You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source
jobs = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
print(job.text)
Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.
Outputs
Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

I need to find a way to make my code give time for the page to load, and only then grab the HTML code

So i wanted to grab a real time value from a website displaying "the real time revolution of the earth's population", except when i run the code:
import requests
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
population = soup.findAll('p', attrs={'class':'counter'})
print(population[0])
my output is:
<p class="counter" id="counters_number_interveal_5">loading...</p>
The number i am looking to get is replaced by "loading..." so i am looking to find a way to actually get the value. Or an alternative to get the same result.
You can wait for the page to load explicitly using time.sleep(), which will probably get the end-result you want. However, this isn't best practice and could end up waiting longer than the page needed to load.
I would recommend using Selenium instead, which has a multitude of useful features related to this; specifically it can implicitly wait.
The following is how you could use Selenium to wait until the counter is loaded, and not wait any longer.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
driver = webdriver.Firefox()
driver.get(url)
try:
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//*[#id=\"counters_number_interval_5\"]")))
counter = driver.find_element_by_xpath("//*[#id=\"counters_number_interval_5\"]").text
print(counter)
except TimeoutException:
print("Timed out, couldn't load the page in time")
driver.quit()
You will need to install Selenium, but it's like installing BeautifulSoup - just use pip install selenium
The website is still loading, perhaps use the time module to make the script wait for the answer.
import time
time.sleep(5)
#Wait 5 seconds for the answer
This should be added between the requests.get and the parsing with BeautifulSoup.
EDIT
Rereading your question, the problem is actually in the usage of requests, since it is downloading the html immediately, you need to add the timeout argument for the proper loading of the html:
response = requests.get(url, timeout = 5)
It's because you are targeting the wrong class. You can find the desired result within the second class with the same name counter. Try either of the two - one commented out and the other is active. They both produce the desired result.
import requests
from bs4 import BeautifulSoup
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
population = soup.find(class_='item-content').find(class_='counter').text
# population = soup.select_one('.item-content > p.counter').text
print(population)

Selenium does not get elements loaded later

I have been trying to make a python script that will log into my router's page, log all the connected devices, and then display them by their connected names.
I have made a script that looks like this:
from requests import session
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
url = "http://192.168.1.1/login/login.html"
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 100)
browser.get(url)
kad = browser.find_element_by_id("AuthName")
password = browser.find_element_by_id("AuthPassword")
kad.send_keys("MyRouterLoginName")
password.send_keys("MyRouterLoginPassword")
buton = browser.find_element_by_xpath("/html/body/div[2]/div[2]/div[2]/div/ul/li/div[3]/form/fieldset/ul/li[6]/input") #this is my login button
buton.click()
homepage = "http://192.168.1.1/index.html"
browser.get(homepage) #Router asks for changing default password, i skip it like that
sleep(5)
verify = browser.find_element_by_css_selector('body')
print(verify.text) #see my later explanation
xpathmethod = browser.find_element_by_xpath("/html/body/div[3]/div/div/div/div[3]/div/table/tbody/tr/td[3]/div/ul[1]/li[1]/div[2]/a")
print(xpathmethod.text)
print("Finding by css")
content = browser.find_element_by_css_selector('.addmenu')
print(content.text)
The verify line was to make sure the webpage was fully loaded but here is the problem, while webpage loads, it first loads a default menu items (Such as connection status, Networking settings, troubleshooting etc) then loads the devices that are currently connected. Webdriver somehow does not recognize the connected devices section and gives an "unable to locate element" error.
I have tried xpath and css selector methods but both gives me the same result.
Sorry, I can't paste the html fully but here is the path that chrome gives me when I inspect the element
html body div.index div div #mainframe html body div div #contentPanel #mapView div div table tbody tr td div #wlInfo li div a
You need something like this:
try:
# Load page
browser.get("http://192.168.1.1/index.html")
# Wait 10 seconds before element can be located on the page
# For example, <div class="example"> -- first div with class "example"
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, ".//div[#class='example'][1]")))
except Exception as e:
# Catch an exception in case of element unavailability
print ("Page load error: {}".format(e.message))
Found another way to get connected devices list.
Modem has a ConnectionStatus page where it gives me a full list including mac addresses and other details in a single string.
Now I need to parse them. Will create another question about that.

Unable to Identify Webpage in BeautifulSoup by URL

I am using Python and Selenium to attempting to scrape all of the links from the results page of a certain search page.
No matter what I search for in the previous screen, the URL for any search on the results page is: "https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
If I use Selenium to autosearch, then try to read this URL into BeautifulSoup, I get HTTPError: HTTP Error 404: Not Found
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)
# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)
#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button
search="/html/body/div[2]/div/div[2]/div/div[2]/form/div[1]/div/span/button[1]"
#click submit button
driver.find_element_by_xpath(search).click()
#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)
#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')
I suspect this has something to do with the proxyserver and Python is not receiving the necessary info to identify the website, but I'm not sure how to work around this.
Thanks in advance!
I used Selenium to identify the new URL as a work-around for identifying the proper search page:
url1=driver.current_url
Next, I used requests to get the content and feed it into beautifulsoup.
All together, I added:
#Added to the top of the script
import requests
...
#identify the current search page with Selenium
url1=driver.current_url
#scrape the content of the results page
r=requests.get(url)
soup=BeautifulSoup(r.content, 'html.parser')
...

Categories