I am trying to download the PDF grant reports from a state's department of education public database here: https://mdoe.state.mi.us/cms/grantauditorreport.aspx
I'd like to produce a Python script to go up and download all reports as PDFs for a particular range of dates. There is no option in the page's interface to just download all recipients (schools), so I'm hoping a Python script could loop through all the available selections and download each report individually.
I am very new to Python and have attempted some resources here for people asking similar things, but I have been unsuccessful. So, I do not have any starter code. If someone could give me a start on this, I would greatly appreciate it.
Thanks!
I would recommend Selenium it can be used for webscraping in python.
You will need to install selenium using the instructions provided in the above link. You will also need to install pyautogui (pip install should work).
Note that there were many issues when getting selenium to work in Internet Explorer, if you have problems check out here and here. Because of these issues I had to add in a number of capabilities and define the location of the IEdriver when initializing the selenium webdriver, you will need to change these to match your system. I had originally hoped to use the Chrome or firefox browsers, but the website being scraped only generated the report in internet explorer. As has been noted on other stackexchange boards selenium executes commands much more slowly in Internet Explorer.
Here is code that works with my system and selenium versions:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
import time
url = "https://mdoe.state.mi.us/cms/grantauditorreport.aspx"
# LOAD WEBPAGE
capabilities = DesiredCapabilities.INTERNETEXPLORER
capabilities['ignoreProtectedModeSettings'] = True
capabilities['ignoreZoomSetting'] = True
capabilities.setdefault("nativeEvents", False)
driver = webdriver.Ie(capabilities=capabilities,executable_path="C:\\Users\\amcmu\\Downloads\\IEDriverServer_Win32_2.42.0\\IEDriverServer.exe") #include path to IEdriver
driver.get(url)
# SEARCH
search_key = 'public school'
search = driver.find_element_by_name("ctl00$cphMain$txtAgency")
search.send_keys(search_key) # this is the search term
driver.find_element_by_css_selector('.ButtonStandard').click()
# DROPDOWN LIST
list_item = 'Alpena Public Schools'
select = Select(driver.find_element_by_id('ctl00_cphMain_ddlAgency'))
# select by visible text
select.select_by_visible_text(list_item)
# If you want to iterate through every item in the list use select_by_value
#select.select_by_value('1')
# DATES
start='4/2/2018'
end='4/9/2021'
start_date = driver.find_element_by_name("ctl00$cphMain$txtBeginDate")
driver.execute_script("arguments[0].value = ''", start_date)
start_date.send_keys(start)
end_date = driver.find_element_by_name("ctl00$cphMain$txtEndDate")
driver.execute_script("arguments[0].value = ''", end_date)
end_date.send_keys(end)
# PRESS ENTER TO GENERATE PDF
driver.find_element_by_name('ctl00$cphMain$btnSearch').click()
time.sleep(30) # wait while the server generates the file
print('I hope I waited long enough to generate the file.')
# SAVE FILE
import pyautogui
pyautogui.hotkey('alt','n')
pyautogui.press('tab')
pyautogui.press('enter')
time.sleep(3)
driver.quit()
quit()
Now when the file is being generated you need to wait while their server does its thing, this seemed to take a while (order of 20s). I added the time.sleep(30) to give it 30 seconds to generate the file, you will need to play with this value, or figure out a way to find out when the file has been generated.
I am not sure how you want to iterate through the schools (ie do you have a list of schools with their exact name). If you don't have the list of schools you might want to use something like this pseudocode:
select = Select(driver.find_element_by_name("ctl00$cphMain$ddlAgency"))
options = select.options
for index in range(0, len(options) - 1):
# DROPDOWN LIST
select.select_by_index(index)
# DATES
do stuff
# PRESS ENTER TO GENERATE PDF
do stuff
# SAVE FILE
do stuff
Related
I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."
view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534
I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.
I haven't seen this x:y format before and some searching makes me think it's in a JSON format.
Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (i.e. work for pages with different isrc values, as each music track has a different one).
My code so far ...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os
# Access AudioNetwork and search for tracks.
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.audionetwork.com/track/searchkeyword")
search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)
music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")
music_link.click()
I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?
You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.
The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.
from selenium import webdriver
from selenium.webdriver.common.by import By
import json
driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")
search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')
content_as_dict = json.loads(content)
print(content_as_dict['props']['pageProps']['track']['isrc'])
driver.close()
driver.quit()
Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.
Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp. Let me know if that works for you.
If you wanted to find the value of isrc, you could do:
import json
... # your code here
jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]
replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...
jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]
I have an excel file with thousands of rows with data. I am trying to find materials that could help me automate the process of pulling data from excel and filling it up on the website I have. I am only able to find videos and instructions to do it with the browser opening. Can I do it with a browser already open and a page already loaded? Anything on VBA or Python would help, thanks.
Your question is pretty broad, but in general lines what I think you're looking for is essentially Python, Pandas and Selenium.
You could essentially install Selenium, use Pandas to evaluate your csv file and loop it so it'll input the data for you. It might take a while as it will imitate human procedure, so to speak, but it should help you.
Edit - Handling Selenium and Pseudo-Code
1. Building blocks
For this pseudo-walkthrough, here's what we are going to use:
BeautifulSoup: Python library used to obtain HTML and XML information
Pandas: Python library for reading and manipulating data. This is what we're using to handle the Excel file.
Selenium: a web browser automation tool that will handle the grunt work for us.
Google Chrome: dismisses presentation
2. Installations
2.1. Python Libraries
Run the following code to install the necessary libraries:
!pip install bs4
!pip install pandas
!pip install selenium
2.2. Selenium Driver
Now that you have the libraries installed, let's handle the last part of the setup. We will download a Google Chrome driver, which will be the basis of operation for the Selenium library.
Follow these steps:
Open Google Chrome and click the three dots on the upper right corner
Go to Help > About Google Chrome
Check what your Chrome version is
Access http://chromedriver.chromium.org/downloads and download the driver corresponding to your version
Make sure you save the driver to a folder you'll use in the future
3. Getting stuff done
So, I haven't exactly run this code just yet, so sorry for any syntax mistakes I might've overlooked. However, this is what it's going to look like:
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# using Pandas to read the csv file
source_information = pd.read_csv('my_file.csv',header=None,skiprows=[0])
# setting the URL for BeautifulSoup to operate in
url = "yourwebform.com"
my_web_form = get(url).content
soup = BeautifulSoup(my_web_form, 'html.parser')
# Setting parameters for selenium to work
path = r'C:/Users/itsme/Desktop/Util/chromedriver.exe' #make sure you insert the path to your driver here!
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, chrome_options=options)
driver.get(url)
# creating a procedure to fill the form
def fulfill_form(username, user_email):
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_customer_name = driver.find_element_by_name('username')
input_customer_email = driver.find_element_by_name('email')
submit = driver.find_element_by_name('Submit')
#input the values and hold a bit for the next action
input_customer_name.send_keys(username)
time.sleep(1)
input_customer_email.send_keys(user_email)
time.sleep(1)
submit.click()
time.sleep(7)
# creating a list to hold any entries should them result in error
failed_attempts = []
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(source_information[customer_name], source_information[customer_email])
except:
failed_attempts.append(source_information[customer_name])
pass
if len(failed_attemps) > 0:
print("{} cases have failed").format(len(failed_attempts))
print("Procedure concluded")
Let me know if you run into any trouble and we'll try and fix it together!
I used Rodrigo Castilho's solution but had to place all the code creating and opening the driver inside the formula. With the driver outside the formula it would iterate through the first row of data and then fail at the second because there was no connection to the driver. By placing the block of code to connect in the formula it closed and reopened for each row. I'm happy to explain to anyone else having similar problems any other steps.
def fulfill_form(first, last, email, phone, zip):
# Setting parameters for selenium to work
path = r'chromedriver' # make sure you insert the path to your driver here!
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_first = driver.find_element_by_id('form-first_name')
input_last = driver.find_element_by_id('form-last_name')
input_email = driver.find_element_by_id('form-email')
input_phone = driver.find_element_by_id('form-phone')
input_zip = driver.find_element_by_id('form-zip_code')
submit = driver.find_element_by_name('commit')
#input the values and hold a bit for the next action
driver.delete_all_cookies()
input_first.send_keys(first)
time.sleep(1)
input_last.send_keys(last)
time.sleep(1)
input_email.send_keys(email)
time.sleep(1)
input_phone.send_keys(phone)
time.sleep(1)
input_zip.send_keys(zip)
time.sleep(1)
submit.click()
time.sleep(1)
driver.close()
I manage around 150 email accounts of my company, and I wrote a Python script for Selenium WebDriver that automates actions (deleting spams, emptying the trash,...) one account after the other, several times a day, and it is way too slow, and crashes all the time. I read that Selenium Grid with Docker on Amazon AWS could do the job, and it seems that the "parallel" option for Selenium WebDriver could too.
What I need to do simultaneously :
1) Login (all the accounts)
2) Perform actions (delete spams, empty the trash,...)
3) Close the Chrome instances
I currently have to use a for loop to create 150 times the same instructions that I store in lists, and this is not optimized at all, it makes my computer crash... In a nutshell, I know it's not the way to go and I'm looking forward to have this running simultaneously in parallel.
Here is a shortened version of the code I am using:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
## GET THE USERS' EMAILS
emails = []
pwds = []
with open("users.txt", "r") as users_file: # Open the text file containing emails and pwds
for line in users_file:
email_and_pwd = line.split() # Extract the line of the .txt file as a list
emails.append(email_and_pwd[0]) # Add the email in the emails list
pwds.append(email_and_pwd[1]) # Add the pwd in the pwds list
nbr_users = len(emails) # number of users
## CREATE LISTS OF INSTRUCTIONS THAT WILL BE EXECUTED LATER AS CODE
create_new_driver = []
go_to_email_box = []
fill_username_box = []
fill_password_box = []
# Here I have the same type of lines to create lists that will contain the instructions to click on the Login button, then delete spams, then empty the trash, etc...
## FILL THE LISTS WITH INSTRUCTIONS AS STRINGS
for i in range(nbr_users):
create_new_driver_element = 'driver' + str(i) + ' = webdriver.Chrome(options = chrome_options)'
create_new_driver.append(create_new_driver_element)
# Here I have the same type of lines to create the rest of the instructions to fill the lists
## EXECUTE THE INSTRUCTIONS SAVED IN THE LISTS
for user in range(nbr_users):
exec(create_new_driver[user])
# Here I have the same type of lines to execute the code contained in the previously created lists
After browsing the internet for days to no results, any kind of help is appreciated.
Thank you very much !
I tried to make a comment on #Jetro Cao's answer but it was too much to fit in a comment. I'll base my response off his work.
Why is your program slow?
As #Jetro Cao mentioned, you are running everything sequentially rather than in parallel.
Will another tool make this easier?
Yes. If possible, you should use an email administration tool (ex: G Suite for gmail) rather than logging into these accounts individually through python scripts.
Is it safe to store my 150 account credentials in a text file?
No. Unless these are meaningless bot accounts without any sensitive emails.
Code Advice:
Overall your code is off to a great start. A few specific comments:
Avoid using exec
Use threading as recommended by #Jetro Cao
You haven't provided us with a large portion of your code. When using selenium, some tricks can be used to help speed things up. One example is loading the login page directly rather than loading a home page and then "clicking" login.
import threading
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
## GET THE USERS' EMAILS
emails = []
pwds = []
with open("users.txt", "r") as users_file: # Open the text file containing emails and pwds
for line in users_file:
email, pwd= line.split() # Extract the line of the .txt file as a list
emails.append(email) # Add the email in the emails list
pwds.append(pwd) # Add the pwd in the pwds list
nbr_users = len(emails) # number of users
## CREATE LISTS OF INSTRUCTIONS THAT WILL BE EXECUTED LATER AS CODE
drivers = []
go_to_email_box = []
fill_username_box = []
fill_password_box = []
# Here I have the same type of lines to create lists that will contain the instructions to click on the Login button, then delete spams, then empty the trash, etc...
## FILL THE LISTS WITH INSTRUCTIONS AS STRINGS
threads = []
for i in range(nbr_users):
t = threading.Thread(target=webdriver.Chrome, kwargs={'options': chrome_options})
t.start()
threads.append(t)
for t in threads:
t.join()
My hunch is your program is slow due to you executing the set of instructions for one user after another. And although I don't know the specifics of how Selenium works, I'd imagine there's probably quite a bit of networking IO going on there, for which you're waiting until completion with each user before executing the same set for the next user.
As you correctly surmised, what you need is to be able to parallelize them more. Python's threading module can help you do that, in particular what you need is the Thread class.
Using just your create_new_driver as an example, you can try something like the following:
# ...
# your other imports
# ...
import threading
# ...
# setting up your user info and instructions lists
# ...
# create function that will execute the instructions for a single user,
# which will be used as the argument for target param of the Thread constructor
def manage_single_user(i):
user = nbr_user[i]
exec(create_new_driver[user])
# execute rest of the instructions for same user
threads = []
for i in range(nbr_users):
t = threading.Thread(target=manage_single_user, args=(i,))
t.start()
threads.append(t)
for t in threads:
t.join()
The above will start running the set of instructions for each user from the get go, so you'll be waiting on network IOs from each simultaneously. Thus as long as executing a set of instructions for a single user is sufficiently fast, you should see significant performance in speed.
Hey I am python scripting novice, I was wondering if someone could help me understand how I could python, or any convenient scripting language, to cherry pick specific data values (just a few arbitrary columns on the excel sheet), and take the data and input into a web browser like chrome. Just a general idea of how this should function would be extremely helpful, and also if there is an API available that would be great to know.
Any advice is appreciated. Thanks.
Okay. Lets get started.
Getting values out of an excel document
This page is a great place to start as far as reading excel values goes.
Directly from the site:
import pandas as pd
table = pd.read_excel('sales.xlsx',
sheetname = 'Month 1',
header = 0,
index_col = 0,
parse_cols = "A, C, G",
convert_float = True)
print(table)
Inputting values into browser
Inputting values into the browser can be done quite easily through Selenium which is a python library for controlling browsers via code.
Also directly from the site:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
How to start your project
Now that you have the building blocks, you can put them together.
Example steps:
Open file, open browser and other initialization stuff you need to do.
Read values from excel document
Send values to browser
Repeat steps 2 & 3 forever
One really newbie question.
I'm working on a small python script for my home use, that will collect data of a specific air ticket.
I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:
http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html
And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png
Because they are not located in the HTML, can I extract them?
I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.
I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:
Rendered/interactive javascript with gtk/webkit/jswebkit
Rendered Javascript Crawler With Scrapy and Selenium RC
I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.
I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.
A great introduction to Selenium can be found here.
Install: pip install selenium
Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!
Ex:
airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'
soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))
Now you can parse soup like you normally would with Beautifulsoup.
I hope that helps you!
from selenium import webdriver
import requests
class SeleniumWebScraper():
def __init__(self):
self.source_code = ''
self.is_page_loaded = 0
self.driver = webdriver.Firefox()
self.is_browser_closed = 0
# To ensure the page has fully loaded we will 'implicitly' wait
self.driver.implicitly_wait(10) # Seconds
def close(self):
self.driver.close()
self.clear_source_code()
self.is_page_loaded = 0
self.is_browser_closed = 1
def clear_source_code(self):
self.source_code = ''
self.is_page_loaded = 0
def retrieve_source_code(self, domain):
if self.is_browser_closed:
self.driver = webdriver.Firefox()
# The driver.get method will navigate to a page given by the URL.
# WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
# before returning control to your test or script.
# It's worth nothing that if your page uses a lot of AJAX on load then
# WebDriver may not know when it has completely loaded.
self.driver.get(domain)
self.is_page_loaded = 1
self.source_code = self.driver.page_source
return self.source_code
You don't even need BeautifulSoup to extract data.
Just do this and your response is converted to a Dictionary which is very easy to handle.
text = json.loads("You text of the main response content")
You can now print any key value pair from the dictionary.
Give it a try. It is super easy.