Link Checker (Spider Crawler)

Link Checker (Spider Crawler) - python

I am looking for a link checker to spider my website and log invalid links, the problem is that I have a Login page at the start which is required. What i want is a link checker to run through the command post login details then spider the rest of the website.
Any ideas guys will be appreciated.

I've just recently solved a similar problem like this:
import urllib
import urllib2
import cookielib
login = 'user#host.com'
password = 'secret'
cookiejar = cookielib.CookieJar()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))
# adjust this to match the form's field names
values = {'username': login, 'password': password}
data = urllib.urlencode(values)
request = urllib2.Request('http://target.of.POST-method', data)
url = urlOpener.open(request)
# from now on, we're authenticated and we can access the rest of the site
url = urlOpener.open('http://rest.of.user.area')

You want to look at the cookielib module: http://docs.python.org/library/cookielib.html. It implements a full implementation of cookies, which will let you store login details. Once you're using a CookieJar, you just have to get login details from the user (say, from the console) and submit a proper POST request.

Related

Logging into Flask Web App from a Python script [duplicate]

I am trying to post a request to log in to a website using the Requests module in Python but its not really working. I'm new to this...so I can't figure out if I should make my Username and Password cookies or some type of HTTP authorization thing I found (??).
from pyquery import PyQuery
import requests
url = 'http://www.locationary.com/home/index2.jsp'
So now, I think I'm supposed to use "post" and cookies....
ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
r = requests.post(url, cookies=ck)
content = r.text
q = PyQuery(content)
title = q("title").text()
print title
I have a feeling that I'm doing the cookies thing wrong...I don't know.
If it doesn't log in correctly, the title of the home page should come out to "Locationary.com" and if it does, it should be "Home Page."
If you could maybe explain a few things about requests and cookies to me and help me out with this, I would greatly appreciate it. :D
Thanks.
...It still didn't really work yet. Okay...so this is what the home page HTML says before you log in:
</td><td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_email.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="text" name="inUserName" id="inUserName" size="25"></td>
<td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_password.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="password" name="inUserPass" id="inUserPass"></td>
So I think I'm doing it right, but the output is still "Locationary.com"
2nd EDIT:
I want to be able to stay logged in for a long time and whenever I request a page under that domain, I want the content to show up as if I were logged in.

I know you've found another solution, but for those like me who find this question, looking for the same thing, it can be achieved with requests as follows:
Firstly, as Marcus did, check the source of the login form to get three pieces of information - the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.
Once you've got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.
Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.
Example
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'username',
'inUserPass': 'password'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('LOGIN_URL', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text
# An authorised request.
r = s.get('A protected web page url')
print r.text
# etc...

If the information you want is on the page you are directed to immediately after login...
Lets call your ck variable payload instead, like in the python-requests docs:
payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)
Otherwise...
See https://stackoverflow.com/a/17633072/111362 below.

Let me try to make it simple, suppose URL of the site is http://example.com/ and let's suppose you need to sign up by filling username and password, so we go to the login page say http://example.com/login.php now and view it's source code and search for the action URL it will be in form tag something like
<form name="loginform" method="post" action="userinfo.php">
now take userinfo.php to make absolute URL which will be 'http://example.com/userinfo.php', now run a simple python script
import requests
url = 'http://example.com/userinfo.php'
values = {'username': 'user',
'password': 'pass'}
r = requests.post(url, data=values)
print r.content
I Hope that this helps someone somewhere someday.

The requests.Session() solution assisted with logging into a form with CSRF Protection (as used in Flask-WTF forms). Check if a csrf_token is required as a hidden field and add it to the payload with the username and password:
import requests
from bs4 import BeautifulSoup
payload = {
'email': 'email#example.com',
'password': 'passw0rd'
}
with requests.Session() as sess:
res = sess.get(server_name + '/signin')
signin = BeautifulSoup(res._content, 'html.parser')
payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
res = sess.post(server_name + '/auth/login', data=payload)

Find out the name of the inputs used on the websites form for usernames <...name=username.../> and passwords <...name=password../> and replace them in the script below. Also replace the URL to point at the desired site to log into.
login.py
#!/usr/bin/env python
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
payload = { 'username': 'user#email.com', 'password': 'blahblahsecretpassw0rd' }
url = 'https://website.com/login.html'
requests.post(url, data=payload, verify=False)
The use of disable_warnings(InsecureRequestWarning) will silence any output from the script when trying to log into sites with unverified SSL certificates.
Extra:
To run this script from the command line on a UNIX based system place it in a directory, i.e. home/scripts and add this directory to your path in ~/.bash_profile or a similar file used by the terminal.
# Custom scripts
export CUSTOM_SCRIPTS=home/scripts
export PATH=$CUSTOM_SCRIPTS:$PATH
Then create a link to this python script inside home/scripts/login.py
ln -s ~/home/scripts/login.py ~/home/scripts/login
Close your terminal, start a new one, run login

Some pages may require more than login/pass. There may even be hidden fields. The most reliable way is to use inspect tool and look at the network tab while logging in, to see what data is being passed on.

not able to get another page when iam using python request session module to login

I am trying to login LinkedIn using python request session module but iam not able access other pages please help me out.
My code is like this
import requests
from bs4 import BeautifulSoup
# Get login form
URL = 'https://www.linkedin.com/uas/login'
session = requests.session()
login_response = session.get('https://www.linkedin.com/uas/login')
login = BeautifulSoup(login_response.text,"lxml")
# Get hidden form inputs
inputs = login.find('form', {'name': 'login'}).findAll('input',
{'type':
['hidden', 'submit']})
# Create POST data
post = {input.get('name'): input.get('value') for input in inputs}
post['session_key'] = 'usename'
post['session_password'] = 'password'
# Post login
post_response = session.post('https://www.linkedin.com/uas/login-
submit', data=post)
notify_response = session.get('https://www.linkedin.com/company-
beta/3067/')
notify = BeautifulSoup(notify_response.text,"lxml")
print notify.title

Well, hope I'm not saying wrong stuff, but I had to crawl linkedin some weeks ago and seen linkedin is pretty good at spoting bots. I'm almost sure it is your issue here (you should try to print output of post_response, you surelly you will see you are on a captcha page or something like that).
Plot twist: I succeed to login into linkedin by running selenium, login to linkedin by hand and use pickle to save cookies as text file.
Then, instead of using login form, I just loaded cookies to selenium and refresh page, tadam, logged in. I think this can be done with requests

python-requests - can't login

trying to scrape some data, but first I need to login. I am trying to use python-requests, and here is my code so far :
login_url = "https://www.wehelpen.nl/login/"
users_url = "https://www.wehelpen.nl/ik-zoek-hulp/hulpprofielen/"
profile_url = "https://www.wehelpen.nl/profiel/01136/hulpvragen/"
uname = "****"
pword = "****"
def main():
s = login(uname, pword, login_url)
page = s.get(users_url)
print makeUTF8(page.text) # grab html and grep for logged in text to make sure!
def login(uname, pword, url):
s = requests.session()
s.get(url, auth=(uname, pword))
csrftoken = s.cookies['csrftoken']
login_data = dict(username=uname, password=pword,
csrfmiddlewaretoken=csrftoken, next='/')
s.post(url, data=login_data, headers=dict(Referer=url))
return s
def makeUTF8(text):
return text.encode('utf-8')
Basically, I need to login at login_url with a POST request (using a csrf token because I get an error otherwise), then using the session object passed back from login(), I want to check that I am logged in by making a GET request to a user page. When I get the return - page.text I can then run a grep command to check for a certain href which tells me if I am logged in or not.
So, thus far I am unable to login and keep a working session object. Can anyone help me? So far, this has been the most tedious python experience of my life.
EDIT. I have searched, searched and searched SO for answers and nothing is working...

You need to have correct names for dictionary keys. Request libary uses html name of form to find right form. In your case those names are identification and password.
login_data = {'identification'=uname,'password'=pswrd...}

There are lots of options, but I have had success using cookielib instead of trying to "manually" handle the cookies.
import urllib2
import cookielib
cookiejar = cookielib.CookieJar()
cookiejar.clear()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))
# ...etc...
Some potentially relevant answers on getting this set up are on SO, including: https://stackoverflow.com/a/5826033/1681480

Login to website using python

I am trying to login to this page using Python.
I tried using the steps described on this other Stack Overflow post, and got the following code:
import urllib, urllib2, cookielib
username = 'username'
password = 'password'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http://friends.cisv.org/index.cfm', login_data)
resp = opener.open('http://friends.cisv.org/index.cfm?fuseaction=activities.list')
print resp.read()
but that gave me the following output:
<SCRIPT LANGUAGE="JavaScript">
alert('Sorry. You need to log back in to continue. You will be returned to the home page when you click on OK.');
document.location.href='index.cfm';
</SCRIPT>
What am I doing wrong?

I would recommend using the wonderful requests module.
The code below will get you logged into the site and persist the cookies for the duration of the session.
import requests
import sys
EMAIL = ''
PASSWORD = ''
URL = 'http://friends.cisv.org'
def main():
# Start a session so we can have persistant cookies
session = requests.session(config={'verbose': sys.stderr})
# This is the form data that the page sends when logging in
login_data = {
'loginemail': EMAIL,
'loginpswd': PASSWORD,
'submit': 'login',
}
# Authenticate
r = session.post(URL, data=login_data)
# Try accessing a page that requires you to be logged in
r = session.get('http://friends.cisv.org/index.cfm?fuseaction=user.fullprofile')
if __name__ == '__main__':
main()

The term "login" is unfortunately very vague. The code given here obviously tried to log in using HTTP basic authentication. I'd wager a guess that this site wants you to send it a username and password in some kind of POST form (that's how most web-based login forms work). In this case, you'd need to send the proper POST request, and keep whatever cookies it sent back to you for future requests. Unfortunately I don't know what this would be, it depends on the site. You'll need to figure out how it normally logs a user in and try to follow that pattern.

python authenticate with urllib2 and cookielib and get login passed/failed result

Here is cut from my code which I use to login into the remote site.
My problem is that I don't know how to handle authentication pass/fail result.
def prepareLoginData(self):
self.post_login_data = urllib.urlencode({
'login': self.user,
'password': self.password,
'Login': 'Login'
})
return self.post_login_data
def prepareOpener(self):
cj = cookielib.CookieJar()
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
for header in self.headers:
self.opener.addheaders.append(header)
return self.opener
Then I login like below:
self.resp = self.opener.open(self.login_page, self.post_login_data)
and parse the response self.resp.read() to check if login passed with regular expression.
How can I get login result based on cookie value? or maybe there is another way?
When auth passed or failed the only thing I see in cj is SESSID which does not give info about auth result.
Thanks in advance!

Maybe you can look at the source code of the returned page:
In a lot of moderns websites, every xhtml element of the page has an id or a class or is child on an element with an id or a class, so i think you can use an xhtml parser like BeautifulSoup to extract error messages from the website.
BeautifulSoup is verry easy to learn and to use, so you'll probabilly find a solution, but if not, give me the url of the website, and i'll try to write a working code...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Link Checker (Spider Crawler) - python

I am looking for a link checker to spider my website and log invalid links, the problem is that I have a Login page at the start which is required. What i want is a link checker to run through the command post login details then spider the rest of the website. Any ideas guys will be appreciated.

Related

Logging into Flask Web App from a Python script [duplicate]

not able to get another page when iam using python request session module to login

python-requests - can't login

Login to website using python

python authenticate with urllib2 and cookielib and get login passed/failed result

Categories

Resources