How do I go about scraping some data from chrome browser? - python

The webpage I am trying to scrape can only be seen after login so using a direct url won't work. I need to scrape data while I am logged in using my chrome browser.
Then I need to get the value of the the element from
I have tried using the following code.
import requests
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
lastdatadate=[]
lastprocesseddate=[]
source = requests.get('webpage.com').text
content = driver.page_source
soup = bs(content, 'lxml')
#print(soup.prettify())
price = soup.find('span', attrs={'id':'calculatedMinRate'})
print(price.text)

You could still perform a login on the opened webdriver and fill in the input fields, as explained here: How to locate and insert a value in a text box (input) using Python Selenium?
Steps:
Fill in the input fields
Find the submit button and trigger a click event
Afterwards add a sleep command, few seconds should be enough
Afterwards you should be able to get the data.

Related

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)
Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)
The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

Access all href-links in a deep-class hierarchy

I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.
The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.
Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.

Fetching table data from webpage using selenium in python

I am very new to web scraping. I have the following url:
https://www.bloomberg.com/markets/symbolsearch
So, I use Selenium to enter the Symbol Textbox and press Find Symbols to get the details. This is the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
It returns the table. How can I retrieve that? I am pretty clueless.
Second question,
Can I do this without Selenium as it is slowing down things? Is there a way to find an API which returns a JSON?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
time.sleep(5)
url = driver.current_url
time.sleep(5)
parsed = requests.get(url)
soup = BeautifulSoup(parsed.content,'html.parser')
a = soup.findAll("table", { "class" : "dual_border_data_table" })
print(a)
here is the total code by which you can get the table you are looking for. now do what you need to do after getting the table. hope it helps

Unable to Identify Webpage in BeautifulSoup by URL

I am using Python and Selenium to attempting to scrape all of the links from the results page of a certain search page.
No matter what I search for in the previous screen, the URL for any search on the results page is: "https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
If I use Selenium to autosearch, then try to read this URL into BeautifulSoup, I get HTTPError: HTTP Error 404: Not Found
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)
# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)
#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button
search="/html/body/div[2]/div/div[2]/div/div[2]/form/div[1]/div/span/button[1]"
#click submit button
driver.find_element_by_xpath(search).click()
#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)
#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')
I suspect this has something to do with the proxyserver and Python is not receiving the necessary info to identify the website, but I'm not sure how to work around this.
Thanks in advance!
I used Selenium to identify the new URL as a work-around for identifying the proper search page:
url1=driver.current_url
Next, I used requests to get the content and feed it into beautifulsoup.
All together, I added:
#Added to the top of the script
import requests
...
#identify the current search page with Selenium
url1=driver.current_url
#scrape the content of the results page
r=requests.get(url)
soup=BeautifulSoup(r.content, 'html.parser')
...

Submitting value in a Google powered form while scraping

I am trying to scrape an online food-ordering website using Mechanize & BS4. The problem I'm facing is that the website has a form that takes location as input powered by Google. When I try filling it using this method:
from bs4 import BeautifulSoup as bs
import requests, lxml, mechanize
url = raw_input("Enter URL: ")
browser = mechanize.Browser()
browser.open(url)
# 'placeSelectionForm' is the name of the input-field
browser.select_form(name='placeSelectionForm')
control1 = browser.form.controls[0]
control1._value = 'Koramangala'
browser.submit()
soup = bs(browser.response().read(), "lxml")
print soup.prettify()
The script works fine for a normal django form that I have made. But the problem here is that the Google powered form is using auto-complete api like this:
So when I type initials of some location, there are auto-complete suggestions and as soon as I select one option the form auto-submits and I'm taken to a new URL.
Now, the problem with the URL of the new page is that no matter what option I chose in the form, the URL remains the same, and the values that come with the response vary accordingly with the option I chose on previous page.
How can I fill this form (powered by Google maps api) using tools like Mechanize or BS4 or any other such?
This is a quite Javascript "heavy" website which you may find difficult to automate with mechanize. Here is how you can do make a search, choose one of the suggestions and wait for results to load with selenium:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('http://www.swiggy.com/bangalore')
# wait for input to appear and make a search
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "pac-input"))).send_keys("Koramangala")
# wait for suggestions to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.pac-container div.pac-item")))
# choose the first suggestion
suggestions = driver.find_elements_by_css_selector("div.pac-container div.pac-item")
suggestions[0].click()
# wait for results to load
wait.until(EC.visibility_of_element_located((By.ID, "restaurants")))
# TODO: extract results
I've added comments to make things clear. Let me know if you want me to expand on any of the parts of the code.

Categories