Web Scraping XML file that's Downloaded After Submitting Form (Python) - python

How do you scrape an xml file that automatically downloads to your computer after submitting a form? I haven't seen any examples that contain a file that is from submitted data. I haven't been able to find it in the python documentation. https://docs.python.org/3/library/xml.etree.elementtree.html . Any suggestions, or links would be very much appreciated.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains
url = 'https://oui.doleta.gov/unemploy/claims.asp'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2022')
driver.find_element_by_css_selector('input[name="filetype"][value="xml"]').click()
select = Select(driver.find_element_by_id('states'))
# Iterate through and select all states
for opt in select.options:
opt.click()
input('Press ENTER to submit the form')
driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()

This post on stackoverflow seems to be in the direction of what I'm looking for; How to download XML files avoiding the popup This type of file may harm your computer through ChromeDriver and Chrome using Selenium in Python
This link helped the most for my needs, so I'll add it here, however, the first link is very helpful as well so I want to leave it.
How to control the download of files with Selenium + Python bindings in Chrome

Related

Download file from linked HTML ref, use in Selenium python script

I am trying to create an automation process for downloading updated versions of VS Code Marketplace extensions, and have a selenium python script that takes in a list of extension hosting pages and names, navigates to the extension page, clicks on version history tab, and clicks the top (most-recent) download link. I change the driver's chrome options to edit chrome's default download directory to a created folder under that extension's name. (ex. download process from marketplace)
This all works well, but is extremely time consuming because a new window needs to be opened upon each iteration with a different extension as the driver settings have to be reset to change the chrome download location. Furthermore, selenium guidance recommends against download clicks and to rather capture URL and translate to an HTTP request library.
To solve this, I am trying to use urllib download from an http link and download to a specified path- this could then let me get around needing to reset the driver settings upon every iteration, which would then allow me to run the driver in a single window and just open new tabs to save overall time. urllib documentation
However, when I inspect the download button on an extension, the only link I can find is the href link which has a format like:
https://marketplace.visualstudio.com/_apis/public/gallery/publishers/grimmer/vsextensions/vscode-back-forward-button/0.1.6/vspackage (raw html)
In examples in the documentation the links have a format like:
https://www.facebook.com/favicon.ico
with the filename on the end.
I have tried multiple functions from urllib to download from that href link, but it doesn't seem to recognize it, so I'm not sure if there's any way to get a link that looks like the format from the documention, or some other solution?
Also, urllib seems to require the file name (i.e. extensionversionnumber.vsix) at the end of the path to download to a specified location, but I can't seem to pull the file name from the html etiher.
import os
from struct import pack
import time
import pandas as pd
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
inputLocation=input("Enter csv file path: ")
fileLocation=os.path.abspath(inputLocation)
inputPath=input("Enter path to where packages will be stored: ")
workingPath=os.path.abspath(inputPath)
df=pd.read_csv(fileLocation)
hostingPages=df['Hosting Page'].tolist()
packageNames=df['Package Name'].tolist()
chrome_options = webdriver.ChromeOptions()
def downloadExtension(url, folderName):
os.chdir(workingPath)
if not os.path.exists(folderName):
os.makedirs(folderName)
filepath=os.path.join(workingPath, folderName)
chrome_options.add_experimental_option("prefs", {
"download.default_directory": filepath,
"download.prompt_for_download": False,
"download.directory_upgrade": True
})
driver=webdriver.Chrome(options=chrome_options)
wait=WebDriverWait(driver, 20)
driver.get(url)
wait.until(lambda d: d.find_element(By.ID, "versionHistory"))
driver.find_element(By.ID, "versionHistory").click()
wait.until(lambda d: d.find_element(By.LINK_TEXT, "Download"))
#### attempt to use urllib to download by html request rather than click ####
link=driver.find_element(By.LINK_TEXT, "Download").get_attribute('href')
urllib.request.urlretrieve(link, filepath)
#### above line does not work ####
driver.quit()
for i in range(len(hostingPages)):
downloadExtension(hostingPages[i], packageNames[i])

Use Selenium to download multiple URLs as PDFs

I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url
I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request

Post data when opening a browser in Python

Is there a way to open a browser from a Python application, while posting data to it in POST method?
I tried using webbrowser.open(url) to open the browser window, but I need to also post some data to it (via POST variables), to make sure the page opens with the proper information.
As an example, I would like to open http://duckduckgo.com, while posting "q=mysearchterm" as POST data to it, so the page will open with pre-filled data.
I believe you should be using a Web Driver such as Selenium. Here is an example:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome(r"PATHTO\chromedriver.exe") # download the chromedriver and provide the full path here
browser.get('http://duckduckgo.com/')
search = browser.find_element_by_name('q')
search.send_keys("mysearchterm")
search.send_keys(Keys.RETURN)
time.sleep(15)

Using Python to Scrape a JS Form

I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.

why it is required to open google home page from chrome browser to enter search from selenium

I had a doubt with my code . Why is it required to open a google webpage to search something in selenium .why can't we search directly search from chrome browser. below is my code
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
chrome_options=webdriver.ChromeOptions()
chrome_options.add_argument("--disable-infobars")
browser=webdriver.Chrome('C:/Users/chromedriver.exe',chrome_options=chrome_options)
page=browser.get('http://www.google.com')
search=browser.find_element_by_name('q')
search.send_keys('aditya')
search.send_keys(Keys.RETURN)
time.sleep(5)
why is the line page=browser.get('http://www.google.com') required. I am anyways using browser to send keys why is page required?
When you first instantiate a new browser driver, it is just a blank browser. You need to navigate to a URL before you can interact with any objects on that page. In your case, you are trying to do a google search, so you must navigate to the page first before being able to find the search element.
You shouldn't have to assign it to the page variable though. You should be able to just write:
browser.get('http://www.google.com')

Categories