I need to scrape this page (which has a form): http://kllads.kar.nic.in/MLAWise_reports.aspx, with Python preferably (if not Python, then JavaScript). I was looking at libraries like RoboBrowser (which is basically Mechanize + BeautifulSoup) and (maybe) Selenium but I'm not quite sure on how to go about it. From inspecting the element, it seems to be a WebForm that I need to fill in. After filling that in, the webpage generates some data that I need to store. How should I do this?
You can interact with the javascript web forms relatively easily in Selenium. You may need to install a webdriver quickly, but besides that all you need to do is find the form using its xpath and then have Selenium select an option from the drop down menu using the option's xpath. For the web page provided that would look something like this:
#import functions from selenium module
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# open chrome browser using webdriver
path_to_chromedriver = '/Users/Michael/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
# open web page using browser
browser.get('http://kllads.kar.nic.in/MLAWise_reports.aspx')
# wait for page to load then find 'Constituency Name' dropdown and select 'Aland (46)''
const_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ddlconstname"]')))
browser.find_element_by_xpath('//*[#id="ddlconstname"]/option[2]').click()
# wait for the page to load then find 'Select Status' dropdown and select 'OnGoing'
sel_status = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ddlstatus1"]')))
browser.find_element_by_xpath('//*[#id="ddlstatus1"]/option[2]').click()
# wait for browser to load then click 'Generate Report'
gen_report = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="BtnReport"]')))
browser.find_element_by_xpath('//*[#id="BtnReport"]').click()
Between each interaction, you are just giving the browser some time to load before attempting click the next element. Once all the forms are filled out, the page will display the data based on the options selected and you should be able to scrape the table data. I had a few issues when attempting to load data for the first Constituency Name option, but the others seemed to work fine.
You should also be able to loop through all the dropdown options available under each web form to display all the data.
Hope that helps!
Related
I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io
I am trying to download the daily report from the website NSE-India using selenium & python.
Approach to download the daily report
Website loads with no data
After X time,page is loaded with report information
Once the page is loaded with report data,"table[#id='etfTable']" appears
Explicit wait is added in the code,to wait till the "table[#id='etfTable']" loads
Code for explicit wait
element=WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
Extract the onclick event using xpath
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
Trigger the click to download the file
Full code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options =webdriver.ChromeOptions();
prefs={"download.default_directory":"/Volumes/Project/WebScraper/downloadData"};
options.binary_location=r'/Applications/Google Chrome 2.app/Contents/MacOS/Google Chrome'
chrome_driver_binary =r'/usr/local/Caskroom/chromedriver/94.0.4606.61/chromedriver'
options.add_experimental_option("prefs",prefs)
driver =webdriver.Chrome(chrome_driver_binary,options=options)
try:
#driver.implicity_wait(10)
driver.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
print(downloadcsv)
downloadcsv.click()
time.sleep(5)
driver.close()
except:
print("Invalid URL")
Issue i am facing.
The page is keeps on loading but when launched without selenium the daily report is getting loaded
Normal
Loading via Selenium
Not able to download the daily report
There are some syntax error in the program. Like semi-colon in few lines and while finding element using WebDriverWait, brackets are missing.
Try like below and confirm.
Can use Javascript to click on that element.
driver.get("https://www.nseindia.com/market-data/exchange-traded-funds-etf")
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located((By.XPATH,"//table[#id='etfTable']/tbody/tr[2]")))
downloadcsv= driver.find_element_by_xpath("//img[#title='csv']/parent::a")
print(downloadcsv)
driver.execute_script("arguments[0].click();",downloadcsv)
It's not an issue with your code it's an issue with the website. I checked it most of the time it did not allow me to click on the CSV file. instead of downloading the CSV file, you can scrape the table.
# for direct to the page delete cookies is very important otherwise it will deny the access
browser.delete_all_cookies()
browser.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
sleep(5)
soup = BeautifulSoup(browser.page_source, 'html.parser')
# scrape the table from the soup
This is in the same project I asked here
However, this time I encounter another issue. Bascially, I am trying to get the 2 fields Updated and Published under the More information toggle link (the HTML to select for this toggle is "//a[#class='toggle_info_btn']"
In one page https://thehive.itch.io/promnesia, I am able to retrieve the 2 fields. But in another page https://dmullinsgames.itch.io/paper-jekyll, I cannot eventhough both have the same HTML.
Here is my code (as suggested by Yosuva A in the previous question):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome('chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://dmullinsgames.itch.io/paper-jekyll");
driver.find_element(By.XPATH,"//a[#class='toggle_info_btn']").click()
time.sleep(2)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[#class='game_info_panel_widget']/table//tr//td"))) #Wait for specific element
table_rows= driver.find_elements(By.XPATH,"//div[#class='game_info_panel_widget']/table//tr//td")
for rows in table_rows:
print(rows.text)
driver.quit()
When running this, I see chromedriver opens a Chrome windows with the page, but I don't see the 2 fields Updated and Published there.
Here is what chromedriver sees when it opens an instance of Chrome:
Here is what actually there:
Please let me know what issue this is...
As answered by D.Weltrowski in the comment, some fields in the page are only visible if logged in. Furthermore, the same field can be visible on one page but invisible on another. Therefore, the solution is to have Scrapy logged in before crawling and it will be able to scrape those data. Information on authenticated crawl here
I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url
I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request
I am trying to automate logging into a website (http://www.phptravels.net/) using Selenium - Python on Chrome. This is an open website used for automation tutorials.
I am trying to click an element to open a drop-down (My Account button at the top navbar) which will then give me the option to login and redirect me to the login page.
The HTML is nested with many div and ul/li tags.
I have tried various lines of code but haven't been able to make much progress.
driver.find_element_by_id('li_myaccount').click()
driver.find_element_by_link_text(' Login').click()
driver.find_element_by_xpath("//*[#id='li_myaccount']/ul/li[1]/a").click()
These are some of the examples that I tried out. All of them failed with the error "element not visible".
How do I find those elements? Even the xpath function is throwing this error.
I have not added any time wait in my code.
Any ideas how to proceed further?
Hope this code will help:
from selenium import webdriver
from selenium.webdriver.common.by import By
url="http://www.phptravels.net/"
d=webdriver.Chrome()
d.maximize_window()
d.get(url)
d.find_element(By.LINK_TEXT,'MY ACCOUNT').click()
d.find_element(By.LINK_TEXT,'Login').click()
d.find_element(By.NAME,"username").send_keys("Test")
d.find_element(By.NAME,"password").send_keys("Test")
d.find_element(By.XPATH,"//button[text()='Login']").click()
Use the best available locator on your html page, so that you need not to create xpath of css for simple operations
You may be having issues with the page not being loaded when you try and find the element of interest. You should use the WebDriverWait class to wait until a given element is present in the page.
Adapted from the docs:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up your driver here....
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'li_myaccount'))
)
element.click()
except:
#Handle any exceptions here
finally:
driver.quit()