How to call Javascript function in Python (web-crawler)?

How to call Javascript function in Python (web-crawler)? - python

HTML Login button :
HTML User_id : <input name="userid" type="text" tabindex="1" class="login_input" value="" onfocus="check_userid_on()" onclick="check_userid_on()" onblur="check_userid_off()">
I attached the picture of user_id, user_pw, and sign-in button page source for better understanding.
https://i.stack.imgur.com/kr8Nf.png
https://i.stack.imgur.com/irZJz.png
In python, I want to insert user_id and user_pw then login in using login button that has a javascript function called loginsendit()
So far my code starts like
LOGIN_INFO = {
'userId': 'myidid',
'userPassword': 'mypassword123'
}
user_id = soup.find('input' , {'name': 'userid'})
user_id['value'] = LOGIN_INFO['userid']
user_pw = soup.find('input', {'name': 'userpw'})
user_pw['value'] = LOGIN_INFO['userPassword']
login_req = s.post('url', data='loginSendit()')
print(login_req.status_code)
But it only prints out 200 even if the password or username is wrong, which means my code doesn't let me log in.
Can you help me how to call this loginsendit() javascript func in Python?

If you want to execute JavaScript then look for phantomjs - general tutorial, threads: Running javascript in Selenium using Python and Executing Javascript on Selenium/PhantomJS.

Related

Web-scraping a password protected website using Ghost.py

I'm trying to get the HTML content of a password protected site using Ghost.py.
The web server I have to access, has the following HTML code (I cut it just to the important parts):
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<script language="JavaScript">
function DoHash()
{
var psw = document.getElementById('psw_id');
var hpsw = document.getElementById('hpsw_id');
var nonce = hpsw.value;
hpsw.value = MD5(nonce.concat(psw.value));
psw.value = '';
return true;
}
</script>
</head>
<body>
<form action="PAGE.HTM" name="" method="post" onsubmit="DoHash();">
Access code <input id="psw_id" type="password" maxlength="15" size="20" name="q" value="">
<br>
<input type="submit" value="" name="q" class="w_bok">
<br>
<input id="hpsw_id" type="hidden" name="pA" value="180864D635AD2347">
</form>
</body>
</html>
The value of "#hpsw_id" changes every time you load the page.
On a normal browser, once you type the correct password and press enter or click the "submit" button, you land on the same page but now with the real contents.
URL: http://192.168.1.60/PAGE.htm
<html>
<head>
<!–– javascript is gone ––>
</head>
<body>
Welcome to PAGE.htm content
</body>
</html>
First I tried with mechanize but failed, as I need javascript. So now I´m trying to solve it using Ghost.py
My code so far:
import ghost
g = ghost.Ghost()
with g.start(wait_timeout=20) as session:
page, extra_resources = session.open("http://192.168.1.60/PAGE.htm")
if page.http_status == 200:
print("Good!")
session.evaluate("document.getElementById('psw_id').value='MySecretPassword';")
session.evaluate("document.getElementsByClassName('w_bok')[0].click();", expect_loading=True)
print session.content
This code is not loading the contents correctly, in the console I get:
Traceback (most recent call last): File "", line 8, in
File
"/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 181, in
wrapper
timeout=kwargs.pop('timeout', None)) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1196, in
wait_for_page_loaded
'Unable to load requested page', timeout) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1174, in
wait_for
raise TimeoutError(timeout_message) ghost.ghost.TimeoutError: Unable to load requested page
Two questions...
1) How can I successfully login to the password protected site and get the real content of PAGE.htm?
2) Is this direction the best way to go? Or I'm missing something completely which will make things work more efficiently?
I'm using Ubuntu Mate.

This is not the answer I was looking for, just a work-around to make it work (in case someone else has a similar issue in the future).
To skip the javascript part (which was stopping me to use python's request), I decided to do the expected hash on python (and not on web) and send the hash as the normal web form would do.
So the Javascript basically concatenates the hidden hpsw_id value and the password, and makes a md5 from it.
The python now looks like this:
import requests
from hashlib import md5
from re import search
url = "http://192.168.1.60/PAGE.htm"
with requests.Session() as s:
# Get hpsw_id number from website
r = s.get(url)
hpsw_id = search('name="pA" value="([A-Z0-9]*)"', r.text)
hpsw_id = hpsw_id.group(1)
# Make hash of ID and password
m = md5()
m.update(hpsw_id + 'MySecretPassword')
pA = m.hexdigest()
# Post to website to login
r = s.post(url, data=[('q', ''), ('q', ''), ('pA', pA)])
print r.content
Note: the q, q and pA are the elements that the form (q=&q=&pA=f08b97e5e3f472fdde4280a9aa408aaa) is sending when I login normally using internet browser.
If someone however knows the answer of my original question I would be very appreciated if you post it here.

selenium in python browser.find_element_by_name('submit').click() not working

I'm trying to scrape a site that needs login information, and after hours of trying to figure out why I keep getting "Login failed", I believe it is simply because the "Log in" or "Submit" button is not actually getting clicked. I realized this by saving a screenshot of the browser right when it "fails". My username and password are filled into the fields.
I've tried things like wait, elementScrollBehavior, nothing seems to work. I'd really appreciate some help with this! Code below.
def load(self):
global browser
DesiredCapabilities.PHANTOMJS["elementScrollBehavior"] = 1
#browser field
browser = webdriver.PhantomJS()
wait = WebDriverWait(browser, 10)
#browser = webdriver.Firefox()
#browser = webdriver.Chrome()
loginId = self.id
password = self.pw
browser.get('https://link.example.com')
browser.find_element_by_id('cf-login').send_keys(loginId)
browser.find_element_by_id('password').send_keys(password)
browser.find_element_by_name('submit').click()
#wait.until(EC.presence_of_element_located((By.ID, "crefli_HC_SSS_STUDENT_CENTER")))
try:
if browser.find_element_by_id('crefli_HC_SSS_STUDENT_CENTER'):
#return login status
return True
else:
return False
except:
print('element not found on page')
print(browser.current_url)
#browser.save_screenshot('~/Desktop/screen2.png')
HTML of form:
<form name="loginform" action="/oam/server/auth_cred_submit" method="post">
<div class="nonfloat-box">
Username:
<input type="text" id="cf-login" name="username" class="username inputbox" autocomplete="OFF">
</div>
<div class="float-box">
Password:
<input id="password" name="password" type="password" class="password inputbox" autocomplete="OFF">
</div>
<input type="image" src="https://www.cuny.edu/site/citizencuny/cunyfirst-login/loginbutton.jpg" onclick="javascript: return signon_validate()" alt="Submit" name="submit">
</form>
I believe I need to SOMEHOW get that bit of javascript to run. But HOW?
UPDATE: Selenium has a submit() method that automatically submits the <form> in HTML. Even using this, it does not work. As you can see in the HTML, it IS a form. At this point I do not know what else to try.

Please try this, hope it helps
from selenium.webdriver.common.keys import Keys
driver.find_element_by_name('submit').send_keys(Keys.RETURN)
(or)
driver.find_element_by_name('submit').send_keys(Keys.ENTER)

I have a feeling 'submit' is not being found by find_element_by_name. Try find_element_by_xpath(//*[#name='submit'])

HTML is needed to make a more accurate determination, but I usually use enter key to submit forms, sometimes javascript messes up the ability to submits by click and a simple enter usually does the trick
from selenium.webdriver.common.keys import Keys
def load(self):
(...)
browser.find_element_by_id('password').send_keys(password)
browser.find_element_by_id('password').send_keys(Keys.ENTER)
(...)
Otherwise, make sure you mean find_element_by_name and not find_element_by_tag_name.

Use xpath, Please try this
//input[#name='submit']
or
//input[contains(#name,'submit')AND contains(#alt,'Submit')]

You can use submit() to submit the form. It needs to be sent to the <form> tag
browser.find_element_by_id('cf-login').send_keys(loginId)
browser.find_element_by_id('password').send_keys(password)
browser.find_element_by_name('loginform').submit()
If this doesn't work you can use JavaScript click as a work around
submit = browser.find_element_by_name('submit')
browser.execute_script("arguments[0].click();", submit)
You can also try sending Enter
from selenium.webdriver.common.keys import Keys
browser.find_element_by_id('cf-login').send_keys(loginId)
password_field = browser.find_element_by_id('password')
password_field.send_keys(password)
password_field.send_keys(Keys.RETURN)
#OR
password_field.send_keys(Keys.ENTER)

When you use selenium do automatic testing or scraping ，I suggest you use the method
webdriver.find_element_by_xpath(xpathString)
because you can check the xpathwebbroser's console
Try this command on console:
$x('xpathString')

Scraping a website that requires authentication

I know this question might seem quite straight forward, but I have tried every suggestion and none has worked.
I want to build a Python script that checks my school website to see if new grades have been put up. However I cannot for the life of me figure out how to scrape it.
The website redirects to a different page to login. I have tried all the scripts and answers I could find but I am lost.
I use Python 3, the website is in a https://blah.schooldomate.state.edu.country/website/grades/summary.aspx
format
The username section contains the following:
<input class="txt" id="username" name="username" type="text" autocomplete="off" style="cursor: auto;">
The password is the name except it contains an onfocus HTML element.
One successfully authenticated, I am automatically redirected to the correct page.
I have tried:
using Python 2's cookielib and Mechanize
Using HTTPBasicAuth
Passing the information as a dict to a requests.get()
Trying out many different peoples code including answers I found on this site

You can try with requests:
http://docs.python-requests.org/en/master/
from the web site:
import requests
r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

Maybe you can use Selenium library.
I let you my code example:
from selenium import webdriver
def loging():
browser = webdriver.Firefox()
browser.get("www.your_url.com")
#Edit the XPATH of Loging INPUT username
xpath_username = "//input[#class='username']"
#Edit the XPATH of Loging INPUT password
xpath_password = "//input[#class='password']"
#THIS will write the YOUR_USERNAME/pass in the xpath (Custom function)
click_xpath(browser, xpath_username, "YOUR_USERNAME")
click_xpath(browser, xpath_username, "YOUR_PASSWORD")
#THEN SCRAPE WHAT YOU NEED
#Here is the custom function
#If NO input, will only click on the element (on a button for example)
def click_xpath(self, browser, xpath, input="", time_wait=10):
try:
browser.implicitly_wait(time_wait)
wait = WebDriverWait(browser, time_wait)
search = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))
search.click()
sleep(1)
#Write in the element
if input:
search.send_keys(str(input) + Keys.RETURN)
return search
except Exception as e:
#print("ERROR-click_xpath: "+xpath)
return False

How to call a postback in ASP.Net with Python

I am trying to web-scrape some elements and their values off a page with Python; However, to get more elements, I need to simulate a click on the next button. There is a post back tied to these buttons, so I am trying to call it. Unfortunately, Python is only printing the same values over and over again [meaning the post back for the next button isn't being called]. I am using requests to do my POST/GET.
import re
import time
import requests
TARGET_GROUP_ID = 778092
SESSION = requests.Session()
REQUEST_HEADERS = {"Accept-Encoding": "gzip,deflate"}
GROUP_URL = "http://roblox.com/groups/group.aspx?gid=%d"%(TARGET_GROUP_ID)
POST_BUTTON_HTML = 'pagerbtns next'
EVENTVALIDATION_REGEX = re.compile(r'id="__EVENTVALIDATION" value="(.+)"').search
VIEWSTATE_REGEX = re.compile(r'id="__VIEWSTATE" value="(.+)"').search
VIEWSTATEGENERATOR_REGEX = re.compile(r'id="__VIEWSTATEGENERATOR" value="(.+)"').search
TITLE_REGEX = re.compile(r'<a id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_ctrl\d+_hlAvatar".*?title="(\w+)".*?ID=(\d+)"')
page = SESSION.get(GROUP_URL, headers = REQUEST_HEADERS).text
while 1:
if POST_BUTTON_HTML in page:
for (ids,names) in re.findall(TITLE_REGEX, page):
print ids,names
postData = {
"__EVENTVALIDATION": EVENTVALIDATION_REGEX(page).group(1),
"__VIEWSTATE": VIEWSTATE_REGEX(page).group(1),
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR_REGEX(page).group(1),
"__ASYNCPOST": True,
"ct1000_cphRoblox_rbxGroupRoleSetMembersPane_currentRoleSetID": "4725789",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton": "",
"ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox": "3"
}
page=SESSION.post(GROUP_URL, data = postData, stream = True).text
time.sleep(2)
How can I properly call the post back in ASP.NET from Python to fix this issue? As stated before, it's only printing out the same values each time.
This is the HTML Element of the button
<a class="pagerbtns next" href="javascript:__doPostBack('ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl02$ctl00','')"> </a>
And this is the div it is in:
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_MembersPagerPanel" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton')">
<div id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_Div1" class="paging_wrapper">
Page <input name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$PageTextBox" type="text" value="1" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_PageTextBox" class="paging_input"> of
<div class="paging_pagenums_container">125</div>
<input type="submit" name="ctl00$cphRoblox$rbxGroupRoleSetMembersPane$dlUsers_Footer$ctl01$HiddenInputButton" value="" onclick="loading('members');" id="ctl00_cphRoblox_rbxGroupRoleSetMembersPane_dlUsers_Footer_ctl01_HiddenInputButton" class="pagerbtns translate" style="display:none;">
</div>
</div>
I was thinking of using a JS library and executing the JS __postback method, however, I would like to first see if this can be achieved in pure Python.

Yes it should be achievable you just have to submit correct values on correct fields. But i assume web page you are trying parse uses asp.net web forms so it should be really time consuming to find values and such. I suggest you to look into selenium with that you can easily call click and events on a webpage without writing so much code.
driver = webdriver.Firefox()
driver.get("http://site you are trying to parse")
driver.find_element_by_id("button").click()
//then get the data you want

Trying to understand what auth a website is using to login through python

I am trying to login into this website but when I look at the source, I cannot ascertain how its login works. The file is a ".page" which confuses me and the code surrounding login is:
<input type="text" id="screenName" name="screenName" tabindex="1"/>
and
<input type="password" id="password" name="kclq" maxlength="104" tabindex="2"/>
but when i got to the submitting of the password it was not as clear:
Sign In
I tried using the modules requests and mechanize but neither seemed to work out:
import requests
import sys
USERNAME = 'username'
PASSWORD = 'password'
URL = 'http://edline.net/Index.page'
def main():
session = requests.session(config = {'verbose': sys.stderr})
login_data = {
'screenName': USERNAME,
'btnSignIn' : 'signIn',
'password' : PASSWORD,
}
r = session.post(URL,data=login_data)
r = session.get("https://www.edline.net/UserDocList.page?")
#access a page requiring login
print r.text
if __name__ == '__main__':
main()
How exactly would I go about doing this? The website itself seems to use javascript when logged in, but I just want to be able to access pages while logged in at the moment.
Also, could I be directed to a good starting place to be sufficient in reading html and knowing what kind of website I am dealing with?
Thanks

use requests for login:
import requests
url = 'url'
data = {
"screenName" : "username",
"password" : "password",
"btnSignIn" : "signIn",
}
r = requests.post(url, data=data)
that are easy

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to call Javascript function in Python (web-crawler)? - python

If you want to execute JavaScript then look for phantomjs - general tutorial, threads: Running javascript in Selenium using Python and Executing Javascript on Selenium/PhantomJS.

Related

Web-scraping a password protected website using Ghost.py

selenium in python browser.find_element_by_name('submit').click() not working

Scraping a website that requires authentication

How to call a postback in ASP.Net with Python

Trying to understand what auth a website is using to login through python

Categories

Resources