python requests POST error, session issue? - python

I am trying to mimic the following browser actions via python's requests:
Land on https://www.bundesanzeiger.de/pub/en/to_nlp_start
Click "More search options"
Click checkbox "Also find historicised data" (corresponds to POST param: isHistorical: true)
Click button "Search net short positions"
Click button "Als CSV herunterladen" to download csv file
This is the code I have to simulate this:
import requests
import re
s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)
matches = re.search(
r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
r.text
)
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)
However, even though sr gives me a status_code 200, it's really an error when I check sr.url, which shows https://www.bundesanzeiger.de/pub/en/error-404?9
Digging a bit deeper, I noticed that request_url above resolves to something like
https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form
but when I check the request url in Chrome, it's actually
https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`
The 87 here seems to change, suggesting it's some session ID, but when I'm doing this using requests it doesn't appear to resolve properly.
Any idea what I'm missing here?

You can try this script to download the CSV file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'
data = {
'fulltext': '',
'positionsinhaber': '',
'ermittent': '',
'isin': '',
'positionVon': '',
'positionBis': '',
'datumVon': '',
'datumBis': '',
'isHistorical': 'true',
'nlp-search-button': 'Search+net+short+positions'
}
headers = {
'Referer': 'https://www.bundesanzeiger.de/'
}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')
soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )
a = soup.select_one('a[title="Download as CSV"]')['href']
a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')
print( s.get(a, headers=headers).content.decode('utf-8-sig') )
Prints:
"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"
...and so on.

If you check https://www.bundesanzeiger.de/robots.txt, this website does not like to be indexed. The website could be denying access to the default user agent used by bots. This might help : Python requests vs. robots.txt

Related

BeautfulSoup issues to login in to website with request session

I using code below to make a login in a website en be able to scrape data from my own profile page.
However same after i make get from URL of profile the selector(soup) only returns data from login page.
I still dont be able to found a reason for that.
import requests
from requests import session
from bs4 import BeautifulSoup
login_url='https://caicara.pizzanapoles.com.br/Account/Login'
url_perfil = 'https://caicara.pizzanapoles.com.br/AdminCliente'
payload = {
'username' : 'MY_USERNAME',
'password' : 'MY_PASSWORD'
}
with requests.session() as s:
s.post(login_url, data = payload)
r = requests.get(url_perfil)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.title)
Firstly you need to use your session object s for all the requests.
r = requests.get(url_perfil)
changes to
r = s.get(url_perfil)
A __RequestVerificationToken is sent in the POST data when you try to login - you may need to send it too.
It is present inside the HTML of the login_url
<input name="__RequestVerificationToken" value="..."
This means you .get() the login page - extract the token - then send your .post()
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', {'name': '__RequestVerificationToken'})['value']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
You may want to save each request into its own variable for further debugging.
Thank You Karl for yout return,
But it dont worked fine.
U change my code using tips as you mentioned above.
import requests
from bs4 import BeautifulSoup
login_url = 'https://caicara.pizzanapoles.com.br/Account/Login'
url = 'https://caicara.pizzanapoles.com.br/AdminCliente'
data = {
'username': 'myuser',
'password': 'mypass',
}
with requests.session() as s:
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', name='__RequestVerificationToken')['value_of
_my_token']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
However it returns a error below.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-760e35f7b327> in <module>
13
14 soup = BeautifulSoup(r.content, 'html.parser')
---> 15 token = soup.find('input', name='__RequestVerificationToken')['QHlUQaro9sNo4lefL59lQRtbuziHnHtolV7Xm_Et_3tvnZKZnS4gjBBJZakw7crW0dyXy_lok44RozrMAvWm61XXGla5tC3AuZlgXC4GukA1']
16
17 payload['__RequestVerificationToken'] = token
TypeError: find() got multiple values for argument 'name'

How to get the search result from BeautifulSoup?

I am not super used to Beautifulsoup yet (even though it is super useful). My question that I have is if I have a website like this
https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp
and I were to get results from passing in P2RY12 into "gene name" input box, what do I need to do?
Also, in general, if I want to get a search result from a certain website what do I need to do?
If you open Firefox/Chrome webmaster tools, you can observe where the page is making requests. So when typing P2RY12 into search box and clicking the submit button, the page is making POST request to http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action.
In general, you need to know the URL and parameters sent to the URL to get any information back.
This example grabs some information from the first page of results:
import requests
from bs4 import BeautifulSoup
url = 'http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action'
data = {
'totalCount': -1,
'searchForm.chrom': 0,
'searchForm.start': '',
'searchForm.end': '',
'searchForm.rsid': '',
'searchForm.popu': 0,
'searchForm.geneid': '',
'searchForm.genename': 'P2RY12',
'searchForm.goterm': '',
'searchForm.gokeyword': '',
'searchForm.limitFlag': 1,
'searchForm.numlimit': 1000
}
headers = {
'Referer': 'https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp',
}
soup = BeautifulSoup(requests.post(url, data=data, headers=headers).text, 'html.parser')
for td in soup.select('table.table7 tr > td:nth-child(3)'):
a = td.select_one('a')
print('SNP ID:', a.get_text(strip=True))
t1 = a.find_next_sibling('br').find_next_sibling(text=True)
print('Position:', t1.strip())
print('Location:', ', '.join( l.get_text(strip=True) for l in t1.find_next_siblings('a') ))
print('Genotype:', a.find_next_siblings('br')[2].find_next_sibling(text=True).strip())
print('-' * 80)
Prints:
SNP ID: cfa19627795
Position: Chr23:45904511
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: G
--------------------------------------------------------------------------------
SNP ID: cfa19627797
Position: Chr23:45904579
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
SNP ID: cfa19627803
Position: Chr23:45904842
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
...and so on.

Unable to let my script generate few values automatically to be used within payload

I've created a script to get the html elements from a target page by sending two https requests subsequently. My script can does the thing flawlessly. However, I had to copy the four values from chrome dev tools to fill in the four keys within payload in order to send the final http requests to reach the target page. This is the starting link and following are the description as to how I could reach the target page.
Click on the Find Hotel button (no need to change dates if chek-out date is by default at least one day longer than check-in date).
Tick the box like the image below and press the Book Now button just above it. Now, it should lead you to the target page automatically.
Upon reaching the target page titled as Enter Guest Details, parse the html elements from there
I've tried with (working one):
import requests
from bs4 import BeautifulSoup
url = 'https://booking.discoverqatar.qa/SearchHandler.aspx?'
second_url = 'https://booking.discoverqatar.qa/PassengerDetails.aspx?'
params = {
'Module':'H','txtCity':'','hdnCity':'2947','txtHotel':'','hdnHotel':'',
'fromDate':'05/11/2019','toDate':'07/11/2019','selZone':'','minSelPrice':'',
'maxSelPrice':'','roomConfiguration':'2|0|','noOfRooms':'1',
'hotelStandardArray':'63,60,54,50,52,51','CallFrom':'','DllNationality':'-1',
'HdnNoOfRooms':'-1','SourceXid':'MTEzNzg=','mdx':''
}
payload = {
'CallFrom':'MToxNjozOCBQTXxCMkN8MToxNjozOCBQTQ==',
'Btype':'MToxNjozOCBQTXxBfDE6MTY6MzggUE0=',
'PaxConfig':'MToxNjozOCBQTXwyfDB8MnwwfHwxOjE2OjM4IFBN',
'usid':'MToxNjozOCBQTXxoZW54dmkzcWVnc3J3cXpld2lsa2ZwMm18MToxNjozOCBQTQ=='
}
with requests.Session() as s:
r = s.get(url,params=params,headers={"User-agent":"Mozilla/5.0"})
res = s.get(second_url,params=payload,headers={
"User-agent":"Mozilla/5.0",
"Referer":r.url
})
soup = BeautifulSoup(res.text,'lxml')
print(soup)
In the above script I've copied and pasted the value of CallFrom,Btype,PaxConfig and usid from dev tools to use within payload.
How can I fill in the values automatically to be used within payload?
Params sent to the second request is Base64 encoded, after decode they are:
'CallFrom':'1:16:38 PM|B2C|1:16:38 PM',
'Btype':'1:16:38 PM|A|1:16:38 PM',
'PaxConfig':'1:16:38 PM|2|0|2|0||1:16:38 PM',
'usid':'1:16:38 PM|henxvi3qegsrwqzewilkfp2m|1:16:38 PM'
At first glance, you already notice they are in patterns of:
$date|$param|$date
Where $date is current time in the format of utc_ts_now.strftime("%I:%M:%S %p").
For $param section of these four parameters, I guess it should be fixed for CallFrom and Btype, usid is the session key, you can find it easily in the previous response.
PaxConfig is guest counts, it's related to roomConfiguration you sent in the first request.
To automate the second request, you would generate the decoded value for each parameter first, then encode them with Base64.
Update:
#!/usr/bin/env python3.7
import base64
from datetime import datetime
import requests
def first_request(session, params):
url = 'https://booking.discoverqatar.qa/SearchHandler.aspx'
r = session.get(url, params=params)
return r
def second_request(session, params):
url = 'https://booking.discoverqatar.qa/PassengerDetails.aspx'
r = session.get(url, params=params)
return r
def main():
params1 = {
'Module': 'H',
'txtCity': '',
'hdnCity': '2947',
'txtHotel': '',
'hdnHotel': '',
'fromDate': '05/11/2019',
'toDate': '07/11/2019',
'selZone': '',
'minSelPrice': '',
'maxSelPrice': '',
'roomConfiguration': '2|0|',
'noOfRooms': '1',
'hotelStandardArray': '63,60,54,50,52,51',
'CallFrom': '',
'DllNationality': '-1',
'HdnNoOfRooms': '-1',
'SourceXid': 'MTEzNzg=',
'mdx': ''
}
session = requests.Session()
_ = first_request(session, params1)
asp_session = session.cookies.get("ASP.NET_SessionId")
params2 = {
# Could related to options "Available" / "On Request"
"Btype": "A",
# Try out other guest counts to make sure
"PaxConfig": params1["roomConfiguration"] * 2,
"CallFrom": "B2C",
"usid": asp_session
}
date = datetime.utcnow().strftime("%I:%M:%S %p")
for k, v in params2.items():
v = "|".join([date, v, date])
v = base64.b64encode(bytes(v, "utf-8")).decode("utf-8")
params2[k] = v
r = second_request(session, params2)
print(r.text)
if __name__ == '__main__':
main()

How to select value from dropdown item using requests in Python 3?

I want to scrape data from the website https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx. I have to select the State as Delhi, District as Adarsh Nagar (4) & click on Search button, and scrape all the information.
So far I tried using the given below code as
import requests
from bs4 import BeautifulSoup
Error was coming as 'HTTPS 443 SSL', which I ressolved using 'verify = False
resp = requests.get('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState']='Delhi'
dictinfo['ddldistrict']='Adarsh Nagar (4)'
dictinfo['__EVENTTARGET']='btnSearch'
dictinfo = {k:(None,str(v)) for k,v in dictinfo.items()}
r=requests.post('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False,files=dictinfo)
r
Error: Response [500]
soup2
Error:
Invalid postback or callback
argument. Event validation is enabled using <pages
enableEventValidation="true"/> in configuration or <%# Page
EnableEventValidation="true" %> in a page. For security purposes,
this feature verifies that arguments to postback or callback events
originate from the server control that originally rendered them. If
the data is valid and expected, use the
ClientScriptManager.RegisterForEventValidation method in order to
register the postback or callback data for validation.
Can someone please help me to scrape it or get it done.
(I can only use REQUEST & BEAUTIFULSOUP library, no SELENIUM, MECHANIZE,etc. libraries. )
Try the script below to get the tabular results meant to be populated choosing two dropdown items as you stated above from that webpage. Turn out that you have to make two subsequent post requests to populate the results.
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
url = 'https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0'
resp = s.get(url,verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState'] = 'DL'
res = s.post(url,data=dictinfo)
soup_obj = BeautifulSoup(res.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup_obj.select('input[name]')}
payload['ddldistrict'] = 'ADN'
r = s.post(url,data=payload)
sauce = BeautifulSoup(r.text,"lxml")
for items in sauce.select("#dgDisplay tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
Output you may see in the console like:
['Firm Name', 'City', 'Licences', 'Reg. Pharmacists / Comp. Person']
['A ONE MEDICOS', 'DELHI-251/1, GALI NO.1, KH, NO, 739/251/1, NEAR HIMACHAL BHAWAN,SARAI PIPAL THALA, VILLAGE AZAD PUR,', 'R - 2', 'virender kumar, DPH, [22295-17/10/2013]']
['AAROGYAM', 'DELHI-PVT. SHOP NO. 1, GF, 121,VILLAGE BHAROLA', 'R - 2', 'avinesh bhadoriya, DPH, [27033-]']
['ABCO INDIA', 'DELHI-SHOP NO-452/22,BHUSHAN BHAWAN RING ROAD,FLYOVER AZAD PUR', 'W - 2', 'sanjay dubey , SSC, [C-P-03/01/1997]']
['ADARSH MEDICOS', 'DELHI-NORTHERN SIDE B-107, GALI NO. 1,,MAJLIS PARK, VILLAGE BHAROLA,', 'R - 2', 'dilip kumar, BPH, [28036-11/01/2018]']

Python Requests: Accepting a TOS before accessing a page

I am a newbie trying to write a python script to scrape some information from a website. I need to get to the search page of the website, but on a new session it will redirect you to a TOS acceptance page. You click yes or no to accept, and then it will move you to the search page. Here is my code:
import requests
s=requests.Session()
page = s.get("http://probate.cuyahogacounty.us/pa/CaseSearch.aspx")
if ('TOS.aspx' in page.url):
print("Attempt to agree to TOS")
yesBtn={'ctl00$mpContentPH$btnYes': 'Yes'}
r=s.post(page.url, data=yesBtn)
r2=s.get("http://probate.cuyahogacounty.us/pa/CaseSearch.aspx")
print (r.url)
print (r2.url)
Both r and r2 kick me back to the TOS URL. Help!!
This kind of website need a cookiejar or some "object" to store the session.
Try this.
import requests
import lxml.html
base_url = 'http://probate.cuyahogacounty.us'
with requests.Session() as s:
url = base_url + '/pa/CaseSearch.aspx'
resp = s.get(url,allow_redirects=False)
url_tos = base_url + resp.headers['Location']
resp = s.get(url_tos)
root = lxml.html.fromstring(resp.text)
vgenerator = root.xpath('//*[#id="__VIEWSTATEGENERATOR"]//#value')[0]
viewstate = root.xpath('//*[#id="__VIEWSTATE"]//#value')[0]
eventvalidation = root.xpath('//*[#id="__EVENTVALIDATION"]//#value')[0]
data = {
'ajax_HiddenField': '',
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': vgenerator,
'__EVENTVALIDATION': eventvalidation,
'ctl00$mpContentPH$btnYes': 'Yes'
}
r = s.post(url_tos,data=data)
print r.text

Categories