Authentication results in 404 code - python

There is a website I need to scrape, but before I do I need to login.
There seems to be three things I need to get in, the username, password and authenticity token. The user name and password I know, but I am not sure how to access the token.
This is what I have tried:
import requests
from lxml import html
login_url = "https://urs.earthdata.nasa.gov/home"
session_requests = requests.session()
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='authenticity_token']/#value")))[0]
payload = {"username": "my_name",
"password": "my_password",
"authenticity_token": authenticity_token}
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
print (result)
This results in :
<Response [404]>
My name and password are entered correctly so it is the token that must be going wrong. I think the problem is this line:
authenticity_token = list(set(tree.xpath("//input[#name='authenticity_token']/#value")))[0]
or this line:
payload = {"username": "my_name",
"password": "my_password",
"authenticity_token": authenticity_token}
by looking at the source code on the webpage I noticed there is a authenticity_token, csrf-token and a csrf-param. So its possible these are in the wrong order, but I tried all the combinations.
EDIT:
Here is a beautiful soup approach that results in 404 again.
s = requests.session()
response = s.get(login_url)
soup = BeautifulSoup(response.text, "lxml")
for n in soup('input'):
if n['name'] == 'authenticity_token':
token = n['value']
if n['name'] == 'utf8':
utf8 = n['value']
break
auth = {
'username': 'my_username'
, 'password': 'my_password'
, 'authenticity_token': token
, 'utf8' : utf8
}
s.post(login_url, data=auth)

If you inspect the page you'll notice that form action value is '/login', so you have to submit your data to https://urs.earthdata.nasa.gov/login'.
login_url = "https://urs.earthdata.nasa.gov/login"
home_url = "https://urs.earthdata.nasa.gov/home"
s = requests.session()
soup = BeautifulSoup(s.get(home_url).text, "lxml")
data = {i['name']:i.get('value', '') for i in soup.find_all('input')}
data['username'] = 'my_username'
data['password'] = 'my_password'
result = s.post(login_url, data=data)
print(result)
< Response [200]>
A quick example with selenium:
from selenium import webdriver
driver = webdriver.Firefox()
url = 'https://n5eil01u.ecs.nsidc.org/MOST/MOD10A1.006/'
driver.get(url)
driver.find_element_by_name('username').send_keys('my_username')
driver.find_element_by_name('password').send_keys('my_password')
driver.find_element_by_id('login').submit()
html = driver.page_source
driver.quit()

Related

Python beautifulsoup4 - Web scraping (Pfsense) - Create an user

I'm trying to log into the website and create an user in Pfsense.
Actually, i can log and consult on Pfsense but i cannot create users.
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
#First CSRF Token for login
URL1 = "http://myipaddress/index.php"
html_text1 = requests.get(URL1).text
soup1 = BeautifulSoup(html_text1, 'lxml')
token1 = soup1.find('input', {'name':'__csrf_magic'})['value']
payload1 = {'__csrf_magic': token1,'usernamefld':'myusername','passwordfld':'mypassword','login':'Sign In'}
s.post(URL1, data=payload1)
#First CSRF Token for create an user
URL2 = "http:///myipaddress/system_usermanager.php?act=new"
html_text2 = requests.get(URL2).text
soup2 = BeautifulSoup(html_text2, 'lxml')
token2 = soup2.find('input', {'name':'__csrf_magic'})['value']
payload2 = {'__csrf_magic': token2, 'usernamefld' : 'Robert', 'passwordfld1' : 'root', 'passwordfld2' : 'root', 'descr':'Robert DELAFONDU', 'groups[]':'Simple','utype':'user','save':'Save'}
s.post(URL2, data=payload2)
#Print group user
URL = "http://myipaddress/system_groupmanager.php?act=edit&groupid=0"
html_text = s.get(URL).text
soup = BeautifulSoup(html_text, 'lxml')
users = soup.find_all("option")
for user in users :
print (user.text)
This script print all of users.
I also tried to add all of payload from creating an user (payload image) but didn't work.
I tried to log in pfsense and get CSRF token for log, then i'm expecting to create an user but i don't know how to.
Anyone can help me to understand how this work?

BeautfulSoup issues to login in to website with request session

I using code below to make a login in a website en be able to scrape data from my own profile page.
However same after i make get from URL of profile the selector(soup) only returns data from login page.
I still dont be able to found a reason for that.
import requests
from requests import session
from bs4 import BeautifulSoup
login_url='https://caicara.pizzanapoles.com.br/Account/Login'
url_perfil = 'https://caicara.pizzanapoles.com.br/AdminCliente'
payload = {
'username' : 'MY_USERNAME',
'password' : 'MY_PASSWORD'
}
with requests.session() as s:
s.post(login_url, data = payload)
r = requests.get(url_perfil)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.title)
Firstly you need to use your session object s for all the requests.
r = requests.get(url_perfil)
changes to
r = s.get(url_perfil)
A __RequestVerificationToken is sent in the POST data when you try to login - you may need to send it too.
It is present inside the HTML of the login_url
<input name="__RequestVerificationToken" value="..."
This means you .get() the login page - extract the token - then send your .post()
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', {'name': '__RequestVerificationToken'})['value']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
You may want to save each request into its own variable for further debugging.
Thank You Karl for yout return,
But it dont worked fine.
U change my code using tips as you mentioned above.
import requests
from bs4 import BeautifulSoup
login_url = 'https://caicara.pizzanapoles.com.br/Account/Login'
url = 'https://caicara.pizzanapoles.com.br/AdminCliente'
data = {
'username': 'myuser',
'password': 'mypass',
}
with requests.session() as s:
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', name='__RequestVerificationToken')['value_of
_my_token']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
However it returns a error below.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-760e35f7b327> in <module>
13
14 soup = BeautifulSoup(r.content, 'html.parser')
---> 15 token = soup.find('input', name='__RequestVerificationToken')['QHlUQaro9sNo4lefL59lQRtbuziHnHtolV7Xm_Et_3tvnZKZnS4gjBBJZakw7crW0dyXy_lok44RozrMAvWm61XXGla5tC3AuZlgXC4GukA1']
16
17 payload['__RequestVerificationToken'] = token
TypeError: find() got multiple values for argument 'name'

log into stubborn webpage via Python

I am not very experienced with this type of thing, but I cannot seem to log into this webpage via Python: https://ravenpack.com/discovery/login/
I have tried solutions from other StackOverflow posts, but nothing seems to work. It could be that it is not possible or I just do not know what I'm doing - either are likely possible
I have tried:
import requests
LOGIN_URL = 'https://ravenpack.com/discovery/login/'
DATA_URL = 'https://ravenpack.com/discovery/news_analytics_story/FFF4BFD4F4D4FF803852899BD1F02077/'
payload = {
'username': 'uname',
'password': 'pword'
}
with requests.Session() as s:
s.post(LOGIN_URL, data=payload)
r = s.get(DATA_URL)
print r.text
this:
from twill.commands import *
go('https://ravenpack.com/discovery/login/')
fv("2", "username", "uname")
fv("2", "password", "pword")
submit('1')
this:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("https://ravenpack.com/discovery/login/") #Url that contains signin form
br.select_form()
br['username'] = "uname" #see what is the name of txt input in form
br['password'] = 'pword'
result = br.submit().read()
f=file('s.html', 'w')
f.write(result)
f.close()
and this:
from robobrowser import RoboBrowser
browser = RoboBrowser(history=True,user_agent='Mozilla/5.0')
login_url = 'https://ravenpack.com/discovery/login/'
browser.open(login_url)
form = browser.get_form(id='login_form')
form['username'].value = 'uname'
form['password'].value = 'pword'
browser.submit_form(form)
Any help is appreciated.
import requests
LOGIN_URL = 'https://ravenpack.com/discovery/login/'
DATA_URL = 'https://ravenpack.com/discovery/news_analytics_story/FFF4BFD4F4D4FF803852899BD1F02077/'
username = 'user'
password = 'password'
with requests.Session() as s:
s.post(LOGIN_URL, auth=HTTPBasicAuth(username, password))
r = s.get(DATA_URL)
print r.text

I am trying to scrape HTML from a site that requires a login but am not getting any data

I am following this tutorial but I can't seem to get any data when I am running the python. I get an HTTP status code of 200 and status.ok returns a true value. Any help would be great. This is what my response looks like in Terminal:
[]
200
True
import requests
from lxml import html
USERNAME = "username#email.com"
PASSWORD = "legitpassword"
LOGIN_URL = "https://bitbucket.org/account/signin/?next=/"
URL = "https://bitbucket.org/dashboard/overview"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='csrfmiddlewaretoken']/#value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"csrfmiddlewaretoken": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_elems = tree.findall(".//span[#class='repo-name']")
bucket_names = [bucket_elem.text_content().replace("\n", "").strip() for bucket_elem in bucket_elems]
print bucket_names
print result.status_code
if __name__ == '__main__':
main()
The xpath is wrong, there is no span with the class repo-name, you can get the repo names from the anchor tags with:
bucket_elems = tree.xpath("//a[#class='execute repo-list--repo-name']")
bucket_names = [bucket_elem.text_content().strip() for bucket_elem in bucket_elems]
The html has obviously changed since the tutorial was written.

how to pass search key and get result through bs4

def get_main_page_url("https://malwr.com/analysis/search/", strDestPath, strMD5):
base_url = 'https://malwr.com/'
url = 'https://malwr.com/account/login/'
username = 'myname'
password = 'pswd'
session = requests.Session()
# getting csrf value
response = session.get(url)
soup = bs4.BeautifulSoup(response.content)
form = soup.form
csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value')
## csrf1 = form.find('input', attrs ={'name': 'search'}).get('value')
# logging in
data = {
'username': username,
'password': password,
'csrfmiddlewaretoken': csrf
}
session.post(url, data=data)
# getting analysis data
response = session.get(urlparameter)
soup = bs4.BeautifulSoup(response.content)
form = soup.form
csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value')
## csrf1 = form.find('input', attrs ={'name': 'search'}).get('value')
data = {
'search': strMD5,
'csrfmiddlewaretoken': csrf
}
session.post(urlparameter, data = data)
response = session.get(urlparameter)
soup = bs4.BeautifulSoup(response.content)
print(soup)
if(None != soup.find('section', id='file').find('table')('tr')[-1].a):
link = soup.find('section', id='file').find('table')('tr')[-1].a.get('href')
link = urljoin(base_url, link)
webFile = session.get(link)
filename =link.split('/')[-2]
filename = arg + filename
localFile = open(filename, 'wb')
localFile.write(webFile.content)
webFile.close()
localFile.close()
I am able to login by searching crftoken. Then I am trying to send MD5 to search on malware.com, however I am not able to get the page that searches the sent MD5 to page.
I want to search the MD5 that we passes through crftoken.
Please let me know what is the wrong in code.
You've done almost everything correctly. Except that you need to pass the result of the POST request to BeautifulSoup. Replace:
session.post(urlparameter, data = data)
response = session.get(urlparameter)
with:
response = session.post(urlparameter, data=data)
Worked for me (I had an account at malwr).

Categories