Python Requests: Accepting a TOS before accessing a page

Python Requests: Accepting a TOS before accessing a page - python

I am a newbie trying to write a python script to scrape some information from a website. I need to get to the search page of the website, but on a new session it will redirect you to a TOS acceptance page. You click yes or no to accept, and then it will move you to the search page. Here is my code:
import requests
s=requests.Session()
page = s.get("http://probate.cuyahogacounty.us/pa/CaseSearch.aspx")
if ('TOS.aspx' in page.url):
print("Attempt to agree to TOS")
yesBtn={'ctl00$mpContentPH$btnYes': 'Yes'}
r=s.post(page.url, data=yesBtn)
r2=s.get("http://probate.cuyahogacounty.us/pa/CaseSearch.aspx")
print (r.url)
print (r2.url)
Both r and r2 kick me back to the TOS URL. Help!!

This kind of website need a cookiejar or some "object" to store the session.
Try this.
import requests
import lxml.html
base_url = 'http://probate.cuyahogacounty.us'
with requests.Session() as s:
url = base_url + '/pa/CaseSearch.aspx'
resp = s.get(url,allow_redirects=False)
url_tos = base_url + resp.headers['Location']
resp = s.get(url_tos)
root = lxml.html.fromstring(resp.text)
vgenerator = root.xpath('//*[#id="__VIEWSTATEGENERATOR"]//#value')[0]
viewstate = root.xpath('//*[#id="__VIEWSTATE"]//#value')[0]
eventvalidation = root.xpath('//*[#id="__EVENTVALIDATION"]//#value')[0]
data = {
'ajax_HiddenField': '',
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': vgenerator,
'__EVENTVALIDATION': eventvalidation,
'ctl00$mpContentPH$btnYes': 'Yes'
}
r = s.post(url_tos,data=data)
print r.text

Related

Python beautifulsoup4 - Web scraping (Pfsense) - Create an user

I'm trying to log into the website and create an user in Pfsense.
Actually, i can log and consult on Pfsense but i cannot create users.
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
#First CSRF Token for login
URL1 = "http://myipaddress/index.php"
html_text1 = requests.get(URL1).text
soup1 = BeautifulSoup(html_text1, 'lxml')
token1 = soup1.find('input', {'name':'__csrf_magic'})['value']
payload1 = {'__csrf_magic': token1,'usernamefld':'myusername','passwordfld':'mypassword','login':'Sign In'}
s.post(URL1, data=payload1)
#First CSRF Token for create an user
URL2 = "http:///myipaddress/system_usermanager.php?act=new"
html_text2 = requests.get(URL2).text
soup2 = BeautifulSoup(html_text2, 'lxml')
token2 = soup2.find('input', {'name':'__csrf_magic'})['value']
payload2 = {'__csrf_magic': token2, 'usernamefld' : 'Robert', 'passwordfld1' : 'root', 'passwordfld2' : 'root', 'descr':'Robert DELAFONDU', 'groups[]':'Simple','utype':'user','save':'Save'}
s.post(URL2, data=payload2)
#Print group user
URL = "http://myipaddress/system_groupmanager.php?act=edit&groupid=0"
html_text = s.get(URL).text
soup = BeautifulSoup(html_text, 'lxml')
users = soup.find_all("option")
for user in users :
print (user.text)
This script print all of users.
I also tried to add all of payload from creating an user (payload image) but didn't work.
I tried to log in pfsense and get CSRF token for log, then i'm expecting to create an user but i don't know how to.
Anyone can help me to understand how this work?

BeautfulSoup issues to login in to website with request session

I using code below to make a login in a website en be able to scrape data from my own profile page.
However same after i make get from URL of profile the selector(soup) only returns data from login page.
I still dont be able to found a reason for that.
import requests
from requests import session
from bs4 import BeautifulSoup
login_url='https://caicara.pizzanapoles.com.br/Account/Login'
url_perfil = 'https://caicara.pizzanapoles.com.br/AdminCliente'
payload = {
'username' : 'MY_USERNAME',
'password' : 'MY_PASSWORD'
}
with requests.session() as s:
s.post(login_url, data = payload)
r = requests.get(url_perfil)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.title)

Firstly you need to use your session object s for all the requests.
r = requests.get(url_perfil)
changes to
r = s.get(url_perfil)
A __RequestVerificationToken is sent in the POST data when you try to login - you may need to send it too.
It is present inside the HTML of the login_url
<input name="__RequestVerificationToken" value="..."
This means you .get() the login page - extract the token - then send your .post()
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', {'name': '__RequestVerificationToken'})['value']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
You may want to save each request into its own variable for further debugging.

Thank You Karl for yout return,
But it dont worked fine.
U change my code using tips as you mentioned above.
import requests
from bs4 import BeautifulSoup
login_url = 'https://caicara.pizzanapoles.com.br/Account/Login'
url = 'https://caicara.pizzanapoles.com.br/AdminCliente'
data = {
'username': 'myuser',
'password': 'mypass',
}
with requests.session() as s:
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', name='__RequestVerificationToken')['value_of
_my_token']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
However it returns a error below.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-760e35f7b327> in <module>
13
14 soup = BeautifulSoup(r.content, 'html.parser')
---> 15 token = soup.find('input', name='__RequestVerificationToken')['QHlUQaro9sNo4lefL59lQRtbuziHnHtolV7Xm_Et_3tvnZKZnS4gjBBJZakw7crW0dyXy_lok44RozrMAvWm61XXGla5tC3AuZlgXC4GukA1']
16
17 payload['__RequestVerificationToken'] = token
TypeError: find() got multiple values for argument 'name'

python requests POST error, session issue?

I am trying to mimic the following browser actions via python's requests:
Land on https://www.bundesanzeiger.de/pub/en/to_nlp_start
Click "More search options"
Click checkbox "Also find historicised data" (corresponds to POST param: isHistorical: true)
Click button "Search net short positions"
Click button "Als CSV herunterladen" to download csv file
This is the code I have to simulate this:
import requests
import re
s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)
matches = re.search(
r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
r.text
)
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)
However, even though sr gives me a status_code 200, it's really an error when I check sr.url, which shows https://www.bundesanzeiger.de/pub/en/error-404?9
Digging a bit deeper, I noticed that request_url above resolves to something like
https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form
but when I check the request url in Chrome, it's actually
https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`
The 87 here seems to change, suggesting it's some session ID, but when I'm doing this using requests it doesn't appear to resolve properly.
Any idea what I'm missing here?

You can try this script to download the CSV file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'
data = {
'fulltext': '',
'positionsinhaber': '',
'ermittent': '',
'isin': '',
'positionVon': '',
'positionBis': '',
'datumVon': '',
'datumBis': '',
'isHistorical': 'true',
'nlp-search-button': 'Search+net+short+positions'
}
headers = {
'Referer': 'https://www.bundesanzeiger.de/'
}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')
soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )
a = soup.select_one('a[title="Download as CSV"]')['href']
a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')
print( s.get(a, headers=headers).content.decode('utf-8-sig') )
Prints:
"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"
...and so on.

If you check https://www.bundesanzeiger.de/robots.txt, this website does not like to be indexed. The website could be denying access to the default user agent used by bots. This might help : Python requests vs. robots.txt

Authentication results in 404 code

There is a website I need to scrape, but before I do I need to login.
There seems to be three things I need to get in, the username, password and authenticity token. The user name and password I know, but I am not sure how to access the token.
This is what I have tried:
import requests
from lxml import html
login_url = "https://urs.earthdata.nasa.gov/home"
session_requests = requests.session()
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='authenticity_token']/#value")))[0]
payload = {"username": "my_name",
"password": "my_password",
"authenticity_token": authenticity_token}
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
print (result)
This results in :
<Response [404]>
My name and password are entered correctly so it is the token that must be going wrong. I think the problem is this line:
authenticity_token = list(set(tree.xpath("//input[#name='authenticity_token']/#value")))[0]
or this line:
payload = {"username": "my_name",
"password": "my_password",
"authenticity_token": authenticity_token}
by looking at the source code on the webpage I noticed there is a authenticity_token, csrf-token and a csrf-param. So its possible these are in the wrong order, but I tried all the combinations.
EDIT:
Here is a beautiful soup approach that results in 404 again.
s = requests.session()
response = s.get(login_url)
soup = BeautifulSoup(response.text, "lxml")
for n in soup('input'):
if n['name'] == 'authenticity_token':
token = n['value']
if n['name'] == 'utf8':
utf8 = n['value']
break
auth = {
'username': 'my_username'
, 'password': 'my_password'
, 'authenticity_token': token
, 'utf8' : utf8
}
s.post(login_url, data=auth)

If you inspect the page you'll notice that form action value is '/login', so you have to submit your data to https://urs.earthdata.nasa.gov/login'.
login_url = "https://urs.earthdata.nasa.gov/login"
home_url = "https://urs.earthdata.nasa.gov/home"
s = requests.session()
soup = BeautifulSoup(s.get(home_url).text, "lxml")
data = {i['name']:i.get('value', '') for i in soup.find_all('input')}
data['username'] = 'my_username'
data['password'] = 'my_password'
result = s.post(login_url, data=data)
print(result)
< Response [200]>
A quick example with selenium:
from selenium import webdriver
driver = webdriver.Firefox()
url = 'https://n5eil01u.ecs.nsidc.org/MOST/MOD10A1.006/'
driver.get(url)
driver.find_element_by_name('username').send_keys('my_username')
driver.find_element_by_name('password').send_keys('my_password')
driver.find_element_by_id('login').submit()
html = driver.page_source
driver.quit()

Unable to create Wiki page using python requests

I am trying to edit a page on a Wiki that uses MediaWiki software, but it is not working. I am able to succesfully log in, but unable to edit pages. I am unsure what is causing this problem, as I included the edit token in my request. Here is my code:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
def get_edit_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpEditToken']
return token[0]
#login
s = requests.Session()
values = {'wpName' : 'username',
'wpPassword' : 'password',
'wpLoginAttempt' : 'Log in',
'wpForceHttps' : '1',
'wpLoginToken' : ''
}
url = '.....'
resp = s.get(url)
values['wpLoginToken'] = get_login_token(resp)
req = s.post(url, values)
# edit page
url1 = '.....'
editing = {'wpTextbox1' : 'hi there',
'wpSave' : 'Save page',
'wpSummary' : 'hi',
'wpEditToken' : ''}
resp = s.get(url1)
editing['wpEditToken'] = get_edit_token(resp)
edit = s.post(url1, editing)
print(edit.url)
print(edit.content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Requests: Accepting a TOS before accessing a page - python

Related

Python beautifulsoup4 - Web scraping (Pfsense) - Create an user

BeautfulSoup issues to login in to website with request session

python requests POST error, session issue?

Authentication results in 404 code

Unable to create Wiki page using python requests

Categories

Resources