I am trying to scrape an online food-ordering website using Mechanize & BS4. The problem I'm facing is that the website has a form that takes location as input powered by Google. When I try filling it using this method:
from bs4 import BeautifulSoup as bs
import requests, lxml, mechanize
url = raw_input("Enter URL: ")
browser = mechanize.Browser()
browser.open(url)
# 'placeSelectionForm' is the name of the input-field
browser.select_form(name='placeSelectionForm')
control1 = browser.form.controls[0]
control1._value = 'Koramangala'
browser.submit()
soup = bs(browser.response().read(), "lxml")
print soup.prettify()
The script works fine for a normal django form that I have made. But the problem here is that the Google powered form is using auto-complete api like this:
So when I type initials of some location, there are auto-complete suggestions and as soon as I select one option the form auto-submits and I'm taken to a new URL.
Now, the problem with the URL of the new page is that no matter what option I chose in the form, the URL remains the same, and the values that come with the response vary accordingly with the option I chose on previous page.
How can I fill this form (powered by Google maps api) using tools like Mechanize or BS4 or any other such?
This is a quite Javascript "heavy" website which you may find difficult to automate with mechanize. Here is how you can do make a search, choose one of the suggestions and wait for results to load with selenium:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('http://www.swiggy.com/bangalore')
# wait for input to appear and make a search
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "pac-input"))).send_keys("Koramangala")
# wait for suggestions to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.pac-container div.pac-item")))
# choose the first suggestion
suggestions = driver.find_elements_by_css_selector("div.pac-container div.pac-item")
suggestions[0].click()
# wait for results to load
wait.until(EC.visibility_of_element_located((By.ID, "restaurants")))
# TODO: extract results
I've added comments to make things clear. Let me know if you want me to expand on any of the parts of the code.
Related
The webpage I am trying to scrape can only be seen after login so using a direct url won't work. I need to scrape data while I am logged in using my chrome browser.
Then I need to get the value of the the element from
I have tried using the following code.
import requests
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
lastdatadate=[]
lastprocesseddate=[]
source = requests.get('webpage.com').text
content = driver.page_source
soup = bs(content, 'lxml')
#print(soup.prettify())
price = soup.find('span', attrs={'id':'calculatedMinRate'})
print(price.text)
You could still perform a login on the opened webdriver and fill in the input fields, as explained here: How to locate and insert a value in a text box (input) using Python Selenium?
Steps:
Fill in the input fields
Find the submit button and trigger a click event
Afterwards add a sleep command, few seconds should be enough
Afterwards you should be able to get the data.
I am able to get cookies from the website just fine. But I am interested in the cookies which the Chatbot is using for example there are chatbot websites like: <www.kinguin.net> or <www.multibankfx.com> or <coschedule.com>
If we go on to these websites and 'inspect element' them and then see under the cookies for secure.livechat.inc (this is the chatbot) there will be 1 or 2 cookies as shown in the figure below
Here in this image, I am looking into the cookies of the chatbot on the website called <www.kinguin.net> and we can see one cookie there, i.e. "__livechat"
So this cookie is what I want to automate and extract using the selenium.
my following code return all cookies on the website but "_livechat" is missing
import os, sys, json, codecs, subprocess, requests, time, string
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
host = 'kinguin.net'
driver.get("https://"+host)
cookies = driver.get_cookies()
driver.switch_to.default_content()
cookies = driver.get_cookies()
for item in cookies:
print(item['name'])
taking it further, my following code goes into the iFrame of the chatbot and get cookies but return null
driver.switch_to.default_content()
elementID = driver.find_element_by_id('chat-widget')
driver.switch_to.frame(0)
cookies = driver.get_cookies()
for item in cookies:
print(item['name'])
#ble Thanks alot - the way you suggest is helpful only for this particular website which is not what I want. I am sorry if I could not explain it clearly in my earlier query but I want a generic solution for a large scale website dataset.
For example, if we look at <www.ebanx.com> here the chatbot is different and hence I will search it by
elementID = driver.find_element_by_id('hubspot-messages-iframe-container')
and if I use your code after this driver.switch_to.frame(elementID)
it gives me the error
NoSuchFrameException: Message: no such frame: element is not a frame
With this line of code you found iframe element:
elementID = driver.find_element_by_id('chat-widget')
Use this to switch to that iframe and you will be able to collect cookies with code you wrote
driver.switch_to.frame(elementID)
After you finish, switch to default content with
driver.switch_to.default_content()
There are more iframes on that page. The easiest approach is to find element by using unique identifier, such as 'id' or 'name' and store it in variable e.g. 'elementID'. I suggest renaming it to 'iframe_element' because it is not ID, you just got the element by ID.
Also, avoid search by index (driver.switch_to.frame(0)) if there are not so many iframes on page (https://www.guru99.com/handling-iframes-selenium.html)
I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.
The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.
Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.
I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.
I am trying to ask Google to pull up a query's relevant Search Links, in this case I am using Wikipedia, and then parse the urls of the first three via Selenium. So far I have only been able to do the first part, Googling. Here's my code:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
query = raw_input("What do you wish to search on Wikipedia?\n")
query = " " + query
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to the google home page
driver.get("https://www.google.com/search?q=site%3Awikipedia.com&ie=utf-8&oe=utf-8")
# the page is ajaxy so the title is originally this:
print driver.title
# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")
# type in the search
inputElement.send_keys(query)
# submit the form (although google automatically searches now without submitting)
inputElement.submit()
try:
# we have to wait for the page to refresh, the last thing that seems to be updated is the title
# You should see "cheese! - Google Search"
print driver.title
driver.find_element_by_xpath("//h3[contains(text(),'Wikipedia')]").click()
finally:
driver.quit()
I am trying to use the example from Selenium's documentation, so please excuse the comments and, at times, unnecessary code.
The line of code I'm having trouble with is:
driver.find_element_by_xpath("//h3[contains(text(),'Wikipedia')]").click()
What I'm attempting to do is obtain the relevant Wikipedia link, or, more specifically, the link that the H3 'r' path directs to.
Here's a picture of a Google page that I'm describing.
In this instance, I wish to pull the link http://en.wikipedia.com/wiki/salary
Sorry for the wall of text, I'm trying to be as specific as possible. Anyways, thank you for the assistance in advance.
Best Regards!
The problem is that this XPath is not correct - there is an a element that has "Wikipedia" inside the text, not the h3 element. Fix it:
driver.find_element_by_xpath("//a[contains(text(), 'Wikipedia')]").click()
You can even go further and simplify it using:
driver.find_element_by_partial_link_text("Wikipedia").click()