Scraping a website that requires authentication

Scraping a website that requires authentication - python

I know this question might seem quite straight forward, but I have tried every suggestion and none has worked.
I want to build a Python script that checks my school website to see if new grades have been put up. However I cannot for the life of me figure out how to scrape it.
The website redirects to a different page to login. I have tried all the scripts and answers I could find but I am lost.
I use Python 3, the website is in a https://blah.schooldomate.state.edu.country/website/grades/summary.aspx
format
The username section contains the following:
<input class="txt" id="username" name="username" type="text" autocomplete="off" style="cursor: auto;">
The password is the name except it contains an onfocus HTML element.
One successfully authenticated, I am automatically redirected to the correct page.
I have tried:
using Python 2's cookielib and Mechanize
Using HTTPBasicAuth
Passing the information as a dict to a requests.get()
Trying out many different peoples code including answers I found on this site

You can try with requests:
http://docs.python-requests.org/en/master/
from the web site:
import requests
r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

Maybe you can use Selenium library.
I let you my code example:
from selenium import webdriver
def loging():
browser = webdriver.Firefox()
browser.get("www.your_url.com")
#Edit the XPATH of Loging INPUT username
xpath_username = "//input[#class='username']"
#Edit the XPATH of Loging INPUT password
xpath_password = "//input[#class='password']"
#THIS will write the YOUR_USERNAME/pass in the xpath (Custom function)
click_xpath(browser, xpath_username, "YOUR_USERNAME")
click_xpath(browser, xpath_username, "YOUR_PASSWORD")
#THEN SCRAPE WHAT YOU NEED
#Here is the custom function
#If NO input, will only click on the element (on a button for example)
def click_xpath(self, browser, xpath, input="", time_wait=10):
try:
browser.implicitly_wait(time_wait)
wait = WebDriverWait(browser, time_wait)
search = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))
search.click()
sleep(1)
#Write in the element
if input:
search.send_keys(str(input) + Keys.RETURN)
return search
except Exception as e:
#print("ERROR-click_xpath: "+xpath)
return False

Related

Fetch current birthdays after logging into facebook with pyhhon

I have decided to attempt to create a simple web scraper script in python. As a small challenge I decided to create a script which will be able to log me into facebook and fetch the current birthdays displayed in the side. I have managed to write a script which is able to log me into my facebook, however I have no idea how to fetch the birthdays displayed.
This is my scrypt.
from selenium import webdriver
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
usr = 'EMAIL'
pwd = 'PASSWORD'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.facebook.com/')
print ("Opened facebook")
sleep(1)
username_box = driver.find_element_by_id('email')
username_box.send_keys(usr)
print ("Email Id entered")
sleep(1)
password_box = driver.find_element_by_id('pass')
password_box.send_keys(pwd)
print ("Password entered")
login_box = driver.find_element_by_id('u_0_b')
login_box.click()
print ("Login Sucessfull")
print ("Fetched needed data")
input('Press anything to quit')
driver.quit()
print("Finished")
This is my first time creating a script of this type. My assumption is that I am supposed to traverse through the children of the "jsc_c_3d" div element until I get to the displayed birthdays. Furthermore the id of this element changes everytime the page is refreshed. Can anyone tell me how this is done or if this is the right way that I should go on about solving this problem?
The div for the birthday after expecting elements:
<div class="" id="jsc_c_3d">
<div class="j83agx80 cbu4d94t ew0dbk1b irj2b8pg">
<div class="qzhwtbm6 knvmm38d"><span class="oi732d6d ik7dh3pa d2edcug0 qv66sw1b c1et5uql
a8c37x1j muag1w35 enqfppq2 jq4qci2q a3bd9o3v knj5qynh oo9gr5id hzawbc8m" dir="auto">
<strong>Bobi Mitrevski</strong>
and
<strong>Trajce Tusev</strong> have birthdays today.</span></div></div></div>

You are correct that you would need to traverse through the inner elements of jsc_c_3d to extract the birthdays that you want. However this whole automated web-scraping is a problem if the id value is dynamic, such that it changes on each occasion. In this case, text parsers such as bs4 would do the job.
With the bs4 approach you simply have to extract the relevant div tags from the DOM and then you can parse the data to obtain the required contents.
More generally, this problem is solvable using the Facebook-API which could be as simple as
import facebook
token = 'a token' # token omitted here, this is the same token when I use in https://developers.facebook.com/tools/explorer/
graph = facebook.GraphAPI(token)
args = {'fields' : 'birthday,name' }
friends = graph.get_object("me/friends",**args)

Selenium with Python: collecting an email from a form with read only

I am trying to collect email addresses from a form on a website that has readonly inside of it.
<input name="email" id="email" type="text" class="form-control" value="example#gmail.com" readonly="">
I want to be able to get the email address (example#gmail.com) but everything I try returns "unable to locate element".
Everything is configured properly as the rest of the script is working fine, which I have left out.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
x = 0
all_volunteers = driver.find_elements_by_xpath('//*[#title="View volunteer record"]')
for volunteer in all_volunteers:
volunteer.click()
driver.implicitly_wait(3)
# email_add = driver.find_element_by_id('emaillabel')
#email_add = driver.switch_to_frame(driver.find_element_by_name('email'))
#print(email_add.get_attribute('email'))
#email_add = driver.find_element_by_css_selector('input id')
#email_add = driver.find_element_by_xpath('//input [#name="email"]')
#email_add = driver.find_element_by_tag_name('Email Address')
email_add = driver.find_element_by_xpath('//*[#id="email"]')
print(email_add.get_attribute('value'))
# back button
driver.execute_script("window.history.go(-1)")
#increase counter by 1
x += 1
Everything commented out (followed by #) is what I have tried.
Is any one able to tell me what I am doing wrong or missing?

I have a debugging solution to locate the element.
In the browser, open the web page containing the email input
Open developer tools (F12)
Open console tab in the developer tools
Type $x('//input[#id="email"]') and see if the element is located. This is the native xpath locator
You can also try document.getElementById('email') in the console
If the element is not found still, try the iFrame selector marked in the screenshot to identify iframes and switch to it.
If more than one element is returned, it means that you might have to modify the selector to find unique element.

Python w/ Selenium Gmail Email Send Automating To: Field is Giving Me Trouble

The end goal is to send myself an email if my public ip address changes, as I don't have dynamic dns and have to manually enter the ip addresses myself for my web server. I've done all I possibly can to try and get bash utilities to do the job for me, but CenturyLink is unfortunately out to block me no matter how I configure my outbound mail.
So I've turned to graphical python/selenium web page automation, which will sign into my gmail account for me, click the 'compose' button, then enter in the To:, Subject:, and text segments and hit send. Everything is working except for one small part - the To: field. The html/css is different for this than all the others and no matter how I try to select the field using
driver.find_element_by_class_name()
or
driver.find_element_by_id()
I just can't seem to fill out the field. Bash will give me an error like
:lo cannot be reached by keyboard
or
textarea#:lo.vO is not a valid selector
When I did an inspect element, the element looked like this:
<textarea rows="1" id=":lo" class="vO" name="to" spellcheck="false" autocomplete="false" autocapitalize="off" autocorrect="off" tabindex="1" dir="ltr" aria-label="To" role="combobox" aria-autocomplete="list" style="width: 462px;"></textarea>
My code so far is this: (note: which does not include getting ip info yet, just gmail login / manipulation)
from selenium import webdriver
import time
driver = webdriver.Firefox();
driver.get('https://www.gmail.com');
username = driver.find_element_by_id('identifierId');
username.send_keys("EMAIL");
driver.find_elements_by_class_name('RveJvd.snByac')[1].click();
time.sleep(2); #password not entered in username field
password = driver.find_element_by_class_name('whsOnd.zHQkBf');
password.send_keys("PASSWORD");
driver.find_elements_by_class_name('RveJvd.snByac')[0].click();
#end login, start composing
time.sleep(5); #wait for sign in
driver.find_element_by_class_name('T-I.J-J5-Ji.T-I-KE.L3').click();
to = driver.find_element_by_class_name('textarea#:lo.vO'); #incorrect
to.send_keys("EMAIL");
subject = driver.find_element_by_id(':l6');
subject.send_keys("IP Address changed");
content = driver.find_element_by_id(':m9');
content.send_keys("Test Test\n");

I think there seems to be a dynamic variation with the element ids in different browser. For me when I tried to compose the mail to fetch the XPATH I noted the XPATH was //*[#id=":oa"] but while the script launched it was //*[#id=":my"].
To accommodate this I have used element querying using XPATH //textarea[1] as the Recipients section is always the first textarea. This proves to work well consistently across different browser sessions.
Code Snippet
>>> d = webdriver.Chrome()
[14424:7728:0809/135301.805:ERROR:install_util.cc(597)] Unable to read registry value HKLM\SOFTWARE\Policies\Google\Chrome\MachineLevelUserCloudPolicyEnrollmentToken for writing result=2
DevTools listening on ws://127.0.0.1:12582/devtools/browser/31a5ab42-a4d2-46f3-95c6-a0c9ddc129d7
>>> d.get('https://www.gmail.com')
>>> d.find_element_by_xpath(xpath)
<selenium.webdriver.remote.webelement.WebElement (session="6072286733856e53b69af89ea981001c", element="0.42218760484088036-1")>
>>> d.find_element_by_xpath('//textarea[1]').send_keys('cswadhikar#gmail.com')
Result

Have you tried to use the Gmail API?
It's easier faster and more efficient than using Selenium.
Here's the quickstart: https://developers.google.com/gmail/api/quickstart/python
(I'm writing an answer because I don't have the reputation to just comment)

You can also use Python's built-in email package:
https://docs.python.org/3/library/email.examples.html

Try this code to send an email with Gmail.
It has To, Subject and Send button functionality :
driver.find_element(By.XPATH, '//*[#id=":k2"]/div/div').click()# Compose button
time.sleep(5)
driver.find_element(By.NAME, 'to').send_keys("Enter the email address of recipients")# to field in compose
time.sleep(2)
driver.find_element(By.NAME,'subjectbox').send_keys("This email is send using selenium")# Subject field in compose
time.sleep(2)
driver.find_element(By.XPATH,'//*[#id=":p3"]').click()# click on send button
time.sleep(5)
driver.close()

How to login to a website and scrape data using python

I want to create a program where I can check my grades using python and I have the code to web scrape data, but I do not know how to log into this specific website. The website is https://hac.chicousd.org/LoginParent.aspx?page=Default.aspx and if you need it I can give my username and password. I have tried using requests and urllib and neither work. I appreciate any help given.

Try using mechanical soup. It allows you to navigate a website just like you would normally.

As pointed out in the comments, a possibility is to use selenium, a browser manipulation tool. However, you can also use requests.Sessions to send a POST request with a payload of the email, and then a GET request for whatever portal page you wish to view after:
import requests
r = requests.Session()
payload = {'portalAccountUsername':'yoursutdentemail#school.com'}
r.post('https://hac.chicousd.org/LoginParent.aspx?page=Default.aspx', data = payload)
Then, with r instance, you can send a GET request to a page on the portal that is only visible to authenticated users:
data = r.get('https://hac.chicousd.org/some_student_only_page').text
Note that the keys of the payload dictionary must all be valid <input> "name" values from the site's HTML.

As others have said, you can use selenium. You also should use time to stop the program some seconds before to put your password. First install selenium in you command prompt pip install selenuim and a webdriver (here is the code for chrome pip install chromedriver_installer). Then you could use them in your code.
import selenium
from selenium import webdriver
import time
from time import sleep
Then, you should open the web page with the web driver
browser = webdriver.Chrome('C:\\Users...\\chromedriver.exe')
browser.get('The website address')
The next step is to find the name of the elements on the web page to write your username, password, and the path for the buttons
username = browser.find_element_by_id('portalAccountUsername')
username.send_keys('your email')
next = browser.find_element_by_xpath('//*[#id="next"]')
next.click()
password = browser.find_element_by_id('portalAccountPassword')
time.sleep(2)
password.send_keys('your password')
sing_in = browser.find_element_by_xpath('//*[#id="LoginButton"]')
sing_in.click()

Interacting with website forms

I'm trying to connect to a school url and automate the process with selenium. Originally I tried using splinter, but ran into similar problems. I can't seem to be able to interact with the username and password fields. I realized a little ways in that it is an iframe that I need to interact with. Currently I have:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://my.oregonstate.edu/webapps/login/")
driver.switch_to.frame('Content') #I tried contentFrame and content as well
loginid = driver.find_elements_by_id('user_id')
loginid.send_keys("***")
passwd = driver.find_elements_by_id('password')
passwd.send_keys("***")
sub = driver.find_elements_by_id('login')
sub.click()
time.sleep(5)
driver.close()
Here is the HTML that I am trying to interact with:
The Website: https://my.oregonstate.edu/webapps/portal/frameset.jsp
The iframe:
<iframe id="contentFrame" style="height: 593px;" name="content" title="Content" src="/webapps/portal/execute/tabs/tabAction?tab_tab_group_id=_1_1" frameborder="0"></iframe>
The forms:
Username:
<input name="user_id" id="user_id" size="25" maxlength="50" type="text">
Password:
<input size="25" name="password" id="password" autocomplete="off" type="password">
It seems that selenium can locate the elements just find, but I am unable to input any information into these fields, I got the error 'List object has no attribute'. When I realized it was the iframe I tried to navigate into that but it says 'Unable to locate frame: Content'. Is there another iframe that I am missing? Or something obvious? This is my first time here so sorry if I messed something up with the code linking.
Thanks for the help.

driver.switch_to.frame() takes frame's id or name, where your frame have id = contentFrame and name = content. (The reason they didn't work is probably because of a different issue, read through please)
First, please try use either one of them, not Content (which has upper case C).
Once you have fixed the issue above, there will be another error in your code.
loginid = driver.find_elements_by_id('user_id')
loginid.send_keys("***")
driver.find_elements_by_id finds all matching elements, which is a list. So you can't use send_keys. Please use driver.find_element_by_id('user_id').
Here is the code I tested working.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://my.oregonstate.edu/webapps/login/")
driver.switch_to.frame('content') # all lower case to match your actual frame name
loginid = driver.find_element_by_id('user_id')
loginid.send_keys("***")
passwd = driver.find_element_by_id('password')
passwd.send_keys("***")
Regarding issue in your following comments
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://my.oregonstate.edu/webapps/login/?action=relogin")
loginid = driver.find_element_by_id('user_id')
loginid.send_keys("***")
passwd = driver.find_element_by_id('password')
passwd.send_keys("***")
driver.find_element_by_css_selector('.submit.button-1').click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a website that requires authentication - python

You can try with requests: http://docs.python-requests.org/en/master/ from the web site: import requests r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

Related

Fetch current birthdays after logging into facebook with pyhhon

Selenium with Python: collecting an email from a form with read only

Python w/ Selenium Gmail Email Send Automating To: Field is Giving Me Trouble

How to login to a website and scrape data using python

Interacting with website forms

Categories

Resources