I'm using Python 3.3 and the Requests library to do a basic POST request.
I want to simulate what happens if you manually enter information into the browser from the webpage:
https://www.dspayments.com/FAIRFAX. For example, at that url, enter "x" for the license plate and Virginia as the state. Then the url changes to: https://www.dspayments.com/FAIRFAX/Home/PayOption, and it displays the desired information (I care about the source code of this second webpage).
I looked through the source code of the above two url's. Doing "inspect element" on the text boxes of the first url I found some things that need to be included in the post request: {'Plate':"x", 'PlateStateProv':"VA", "submit":"Search"}.
Then the second website (ending in /PayOption), had the raw html:
<form action="/FAIRFAX/Home/PayOption" method="post"><input name="__RequestVerificationToken" type="hidden" value="6OBKbiFcSa6tCqU8k75uf00m_byjxANUbacPXgK2evexESNDz_1cwkUpVVePA2czBLYgKvdEK-Oqk4WuyREi9advmDAEkcC2JvfG2VaVBWkvF3O48k74RXqx7IzwWqSB5PzIJ83P7C5EpTE1CwuWM9MGR2mTVMWyFfpzLnDfFpM1" /><div class="validation-summary-valid" data-valmsg-summary="true">
I then used the name:value pairs from the above html as keys and values in my payload dictionary of the post request. I think the problem is that in the second url, there is the "__RequestVerificationToken" which seems to have a randomly generated value every time.
How can I properly POST to this website? A "correct" answer would be one that produces the same source code on the website ending in "/PayOption" as if you manually enter "x" as the plate number and Virginia as the state and click submit on the first url.
My code is:
import requests
url1 = r'https://www.dspayments.com/FAIRFAX'
url2 = r'https://www.dspayments.com/FAIRFAX/Home/PayOption'
s = requests.Session()
#GET request
r = s.get(url1)
text1 = r.text
startstr = '<input name="__RequestVerificationToken" type="hidden" value="'
start_ind = text1.find(startstr)+len(startstr)
end_ind = text1.find('"',start_ind)
auth_string = text1[start_ind:end_ind]
#POST request
payload = {'Plate':'x', 'PlateStateProv':'VA',"submit":"Search",
"__RequestVerificationToken":auth_string,"validation-summary-valid":"true"}
post = s.post(url2, headers=user_agent, data=payload)
source_code = post.text
Thanks, -K.
You should only need the data from the first page, and as you say, the __RequestVerificationToken changes with each request.
You'll have to do something like:
GET request to https://www.dspayments.com/FAIRFAX
harvest __RequestVerificationToken value (Requests Session will take care of any associated cookies)
POST using the data you scraped from the GET request
extract whatever you need from the 2nd page
So, just focus on creating a form that's exactly like the one in the first page. Have a stab at it and if you're still struggling I can help dig into the particulars.
Related
good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance
Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you
This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324§orid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find
I'm new to Python so still getting used to some of the different libraries it offers. I'm currently trying to use urllib to access the HTML of the website so that I can eventually scrape data from a table in the account I want to login as.
import urllib.request
link = "websiteurl.com"
login = "email#address.com"
password = "password"
#Access the website of the given address, returns back an HTML file
def access_website(address):
return urllib.request.urlopen(address).read()
html = access_website(link)
print(html)
This function returns me
b'<!DOCTYPE html>\n<html lang="en">\n <head>\n <meta charset="utf-8">\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\n
<meta name="viewport" content="width=device-width, initial-scale=1">\n <title>Festival Manager</title>\n
<link href="bundle.css" rel="stylesheet">\n <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n
<!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n <!--[if lt IE 9]>\n <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\n
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>\n <![endif]-->\n </head>\n <body>\n
<script src="vendor.js"></script>\n <script src="login.js"></script>\n </body>\n</html>\n'
And the thing is I'm not really sure why it's giving me the part about HTML5 shim and respond.js... Because when I go to the actual website and inspect the javascript it doesn't look like this, so it doesn't seem to be returning me the HTML that I see when I actually visit the website.
Also I was trying to check what kind of requests it sends when I send login information, it isn't showing me a post request in the network tab of inspect elements. So I'm not really sure how I would even send the login information through Python through a post request to login?
Here is my take on it for Python 3, done without any external libraries (StackOverflow). After login you can use BeautifulSoup, or any other kind of scraping, if you did login without 3d party libraries/modules you can do scraping as well.
Likewise, script on my GitHub here
Whole script replicated below as to StackOverflow guidelines:
# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
# adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
# (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
# (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
# <input type="hidden" name="_token" value="random1234567890qwertzstring">
# this can probably be done better with regex and similar
# but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
# that includes token we scraped earler, as that's usually in hidden fields
# make sure left side is from "name" attributes of the form,
# and right side is what you want to post as "value"
# and for hidden fields make sure you replicate the expected answer,
# eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
# 'name':'value', # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
# data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
# btw, despite what I thought it is automatically treated as POST
# I guess because of byte encoded data field you don't need to say it like this:
# urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
# we use a particular string we know only occurs after login,
# like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()
A short additional info, regarding your original code...
That is usually good enough if you do NOT have a login page. But with modern logins, you usually have cookies, checking of referal page, user agent code, tokens, if not more (like captcha's). Websites don't like to be scraped, and they fight it. It's also called good security.
So in addition to just doing the request as you initially did you have to:
- take the cookie of the page, and serve it back during login
- know the page's referal, usually you can just push the login page to the login-action page
- fake the agent, if you announce yourself as "Python 3" agent (default) you are maybe just getting blocked right away
- scrape the token (as in my case) and serve it back in login
- package your payload (user, pass, token, and other stuff), encode it properly, and submit it as DATA to trigger the POST method
- etc.
So yeah, with built-in libraries, code baloons a bit as soon as login page is involved.
With 3rd party it's somewhat shorter, but as much as I researched, you again have to think about referals, agents, scraping tokens, etc. No lib will do that automagically, as each page works slightly differently (some will need fake agent, some won't, some have tokens, some not, some name them differently, etc).
If you strip my code of comments and extras, and shorten it a bit, you can make it a function that takes in 5 arguments and has 15 lines or less.
Cheers!
Scraping data of mortgage from official mortgage registry. The problem is that I can't extract the html of particular document. Everything happens on POST behalf - I have all of the data required to precise the POST request, but still when i'm printing the request.url it shows me the welcome screen page. It should retrieve html from particular document. All data like number of mortgage or current page are listed in dev tools > netowrk > Form Data, so I bet it must be possible. I'm quite new in web python so I will apprecaite any help.
My code:
import requests
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
r = requests.post('https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW', data=data)
print(r.url), print(r.content)
You are getting the welcome screen because you aren't sending all the requests required to view the next page.
Go to Chrome > Network tabs, and you will see that when you click the submit/search button, a bunch of other GET requests are being sent to different URLs after that first POST request.
You need to replicate that in your script. Depending upon the website it can be tough to get the response, so you should consider using Selenium
That said, it's not impossible to do this with requests:
session = requests.Session()
You need to send the POST request, and all other GET requests that follow in the same session.
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
session.post(URL, headers=headers, params=data)
# Start sending the GET requests
session.get(URL_1, headers=headers)
session.get(URL_2, headers=headers)
I am trying to programmatically download (open) data from a website using BeautifulSoup.
The website is using a php form where you need to submit input data and then outputs the resulting links apparently within this form.
My approach was as follows
Step 1: post form data via request
Step 2: parse resulting links via BeautifulSoup
However, it seems like this is not working / I am doing wrong as the post method seems not to work and Step 2 is not even possible as no results are available.
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_text_link(soup):
'Returns list of links to individual legal texts'
ergebnisse = soup.findAll(attrs={"class":"einErgebnis"})
if ergebnisse:
links = [el.find("a",href=True).get("href") for el in ergebnisse]
else:
links = []
return links
url = "https://www.justiz.nrw.de/BS/nrwe2/index.php#solrNrwe"
# Post specific day to get one day of data
params ={'von':'01.01.2018',
'bis': '31.12.2018',
"absenden":"Suchen"}
response = requests.post(url,data=params)
content = response.content
soup = BeautifulSoup(content,"lxml")
resultlinks_to_parse = get_text_link(soup) # is always an empty list
# proceed from here....
Can someone tell what I am doing wrong. I am not really familiar with request post. The form field for "bis" e.g. looks as follows:
<input id="bis" type="text" name="bis" size="10" value="">
If my approach is flawed I would appreaciate any hint how to deal with this kind of site.
Thanks!
I've found what is the issue in your requests.
My investigation give the following params was availables:
gerichtst:
yp:
gerichtsbarkeit:
gerichtsort:
entscheidungsart:
date:
von: 01.01.2018
bis: 31.12.2018
validFrom:
von2:
bis2:
aktenzeichen:
schlagwoerter:
q:
method: stem
qSize: 10
sortieren_nach: relevanz
absenden: Suchen
advanced_search: true
I think the qsize param is mandatory for yourPOST request
So, you have to replace your params by:
params = {
'von':'01.01.2018',
'bis': '31.12.2018',
'absenden': 'Suchen',
'qSize': 10
}
Doing this, here are my results when I print resultlinks_to_parse
print(resultlinks_to_parse)
OUTPUT:
[
'http://www.justiz.nrw.de/nrwe/lgs/detmold/lg_detmold/j2018/03_S_69_18_Urteil_20181031.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1122_17_Urteil_20180126.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/13_TaBV_10_18_Beschluss_20181123.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1810_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1811_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1196_17_Urteil_20180118.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1775_17_Urteil_20180614.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_SaGa_9_18_Urteil_20180712.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_748_18_Urteil_20181009.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_755_18_Urteil_20181106.html'
]
I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)