I am trying to programmatically download (open) data from a website using BeautifulSoup.
The website is using a php form where you need to submit input data and then outputs the resulting links apparently within this form.
My approach was as follows
Step 1: post form data via request
Step 2: parse resulting links via BeautifulSoup
However, it seems like this is not working / I am doing wrong as the post method seems not to work and Step 2 is not even possible as no results are available.
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_text_link(soup):
'Returns list of links to individual legal texts'
ergebnisse = soup.findAll(attrs={"class":"einErgebnis"})
if ergebnisse:
links = [el.find("a",href=True).get("href") for el in ergebnisse]
else:
links = []
return links
url = "https://www.justiz.nrw.de/BS/nrwe2/index.php#solrNrwe"
# Post specific day to get one day of data
params ={'von':'01.01.2018',
'bis': '31.12.2018',
"absenden":"Suchen"}
response = requests.post(url,data=params)
content = response.content
soup = BeautifulSoup(content,"lxml")
resultlinks_to_parse = get_text_link(soup) # is always an empty list
# proceed from here....
Can someone tell what I am doing wrong. I am not really familiar with request post. The form field for "bis" e.g. looks as follows:
<input id="bis" type="text" name="bis" size="10" value="">
If my approach is flawed I would appreaciate any hint how to deal with this kind of site.
Thanks!
I've found what is the issue in your requests.
My investigation give the following params was availables:
gerichtst:
yp:
gerichtsbarkeit:
gerichtsort:
entscheidungsart:
date:
von: 01.01.2018
bis: 31.12.2018
validFrom:
von2:
bis2:
aktenzeichen:
schlagwoerter:
q:
method: stem
qSize: 10
sortieren_nach: relevanz
absenden: Suchen
advanced_search: true
I think the qsize param is mandatory for yourPOST request
So, you have to replace your params by:
params = {
'von':'01.01.2018',
'bis': '31.12.2018',
'absenden': 'Suchen',
'qSize': 10
}
Doing this, here are my results when I print resultlinks_to_parse
print(resultlinks_to_parse)
OUTPUT:
[
'http://www.justiz.nrw.de/nrwe/lgs/detmold/lg_detmold/j2018/03_S_69_18_Urteil_20181031.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1122_17_Urteil_20180126.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/13_TaBV_10_18_Beschluss_20181123.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1810_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1811_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1196_17_Urteil_20180118.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1775_17_Urteil_20180614.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_SaGa_9_18_Urteil_20180712.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_748_18_Urteil_20181009.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_755_18_Urteil_20181106.html'
]
Related
good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance
Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you
This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324§orid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find
So, Recently I've been trying to get some marks from a result website (http://tnresults.nic.in/rgnfs.htm) for my school results.... My friends challenged me to get his marks for which I only know his DOB and not his Register Number.. How do I make a Python program to solve this by trying to input register numbers from a predefined range(I know his DOB, btw)?
I tried using requests, but it doesn't allow me to enter the register and DOB..
It creates a POST request with the following format after pushing the Submit button:
https://dge3.tn.nic.in/plusone/plusoneapi/marks/{registration number}/{DOB}
Sample (with 112231 as registration number and 01-01-2000 as DOB.
https://dge3.tn.nic.in/plusone/plusoneapi/marks/112231/01-01-2000
You can then iterate over different registration numbers with a predefined array.
Note: it has to be a POST request, not a regular GET request.
You probably have to do something like the following:
import requests
from bs4 import BeautifulSoup
DOB = '01-01-2000'
REGISTRATION_NUMBERS = ['1','2']
for reg_number in REGISTRATION_NUMBERS:
result = requests.post(f"https://dge3.tn.nic.in/plusone/plusoneapi/marks/{reg_number}/{DOB}")
content = result.content
print(content)
## BeautifulSoup logic
I don't know if that request is providing you the information you need, I don't have valid registration numbers combined with the correct date of birth, so I cannot really test it...
Update 2019-07-09:
Since you said the page is not working anymore and the website changed, I took a look.
It seems that some things have changed you now have to make a post request to http://tnresults.nic.in/rgnfs.asp. The fields 'regno', 'dob' and 'B1' (optional?) should be send as x-www-form-urlencoded.
Since that will return an 'Access Denied' you should set the 'Referer'-header to 'http://tnresults.nic.in/rgnfs.htm'. so:
import requests
from bs4 import BeautifulSoup
DOB = '23-10-2002'
REGISTRATION_NUMBERS = ['5709360']
headers = requests.utils.default_headers()
headers.update({'Referer': 'http://tnresults.nic.in/rgnfs.htm'})
for reg_number in REGISTRATION_NUMBERS:
post_data = {'regno': reg_number, 'dob': DOB}
result = requests.post(f"http://tnresults.nic.in/rgnfs.asp", data=post_data, headers=headers)
content = result.content
print(content)
## BeautifulSoup logic
Tested it myself successfully now you've provided a valid DOB and registration number.
I am using Python requests to get information from the mobile website of the german railways company (https://mobile.bahn.de/bin/mobil/query.exe/dox')
For instance:
import requests
query = {'S':'Stuttgart Hbf', 'Z':'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
which in this case gives the correct page.
However, using the following query:
query = {'S':'Cottbus', 'Z':'München Hbf'}
It gives another response, where the user is required to choose one of the given options (The server is confused about the starting stations, since there are many beginning with 'Cottbus')
Now, my question is: given this response, how can I choose one of the given options, and then repeat the request without getting this error ?
I tried to look at the cookies, to use a session instead of a simple get request. But nothing worked so far.
I hope you can help me.
Thanks.
You can use Beautifulsoup to parse the response and get the options if there is a select on the response:
import requests
from bs4 import BeautifulSoup
query = {'S': u'Cottbus', 'Z': u'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
soup = BeautifulSoup(rsp.content, 'lxml')
# check if has choice dropdown
if soup.find('select'):
# Get list of tuples with text and input values that you will nee do use in the next POST request
options_value = [(option['value'], option.text) for option in soup.find_all('option')]
I"m trying to scrape http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts for this table, and return the Company Name, Code and industry into a list, for all 15 pages of it.
And i've been trying to work with lxml.html, xpath and Beautifulsoup to try and get this information, but i'm stuck.
I realised that this information seems to be a #html embedded within the website, but i'm not sure how I can build a module to retrieve it.
Any thoughts? Or if I should be using a different module/technique?
Edit
I found out that this link was embedded into the website, which consist of the #html that I was talking about previously: http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts
When I tried to use Beautifulsoup to pull the data out:
r = requests.get('http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts')
wb = BeautifulSoup(r.text, "html.parser")
print(wb.findAll('div', attrs={'class': 'table-wrapper results-display'}))
It returns the result below:
[<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>]
But that's different from what is in the website. Any thoughts?
You might want to address this problem another way.
By looking at the server calls (chrome -> F12 -> network tab), you can figure out which url you should actually call to get a json response instead.
Apparently, you could use a url that starts like this:
http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search?callback=json&json=???? (you'll need to do some reverse engineering to figure out the actual json query but it doesn't look too difficult)
Sorry I did not look much further into the json query but I hope this helps you keep going :)
Note: I based my answer on this url
#!/usr/bin/env python
import requests
url = "http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search"
params = {
'callback': 'json',
'json': {
# key / value pairs defining your actual query to the server
# you need to figure this out yourself depending on the data you want
# to retrieve.
# I usually look at chrome's network tab (F12), find the proper URL
# that queries for the data, reverse engineer the key/value pairs
}
}
response = requests.get(url, params)
print(response.json())
I'm using Python 3.3 and the Requests library to do a basic POST request.
I want to simulate what happens if you manually enter information into the browser from the webpage:
https://www.dspayments.com/FAIRFAX. For example, at that url, enter "x" for the license plate and Virginia as the state. Then the url changes to: https://www.dspayments.com/FAIRFAX/Home/PayOption, and it displays the desired information (I care about the source code of this second webpage).
I looked through the source code of the above two url's. Doing "inspect element" on the text boxes of the first url I found some things that need to be included in the post request: {'Plate':"x", 'PlateStateProv':"VA", "submit":"Search"}.
Then the second website (ending in /PayOption), had the raw html:
<form action="/FAIRFAX/Home/PayOption" method="post"><input name="__RequestVerificationToken" type="hidden" value="6OBKbiFcSa6tCqU8k75uf00m_byjxANUbacPXgK2evexESNDz_1cwkUpVVePA2czBLYgKvdEK-Oqk4WuyREi9advmDAEkcC2JvfG2VaVBWkvF3O48k74RXqx7IzwWqSB5PzIJ83P7C5EpTE1CwuWM9MGR2mTVMWyFfpzLnDfFpM1" /><div class="validation-summary-valid" data-valmsg-summary="true">
I then used the name:value pairs from the above html as keys and values in my payload dictionary of the post request. I think the problem is that in the second url, there is the "__RequestVerificationToken" which seems to have a randomly generated value every time.
How can I properly POST to this website? A "correct" answer would be one that produces the same source code on the website ending in "/PayOption" as if you manually enter "x" as the plate number and Virginia as the state and click submit on the first url.
My code is:
import requests
url1 = r'https://www.dspayments.com/FAIRFAX'
url2 = r'https://www.dspayments.com/FAIRFAX/Home/PayOption'
s = requests.Session()
#GET request
r = s.get(url1)
text1 = r.text
startstr = '<input name="__RequestVerificationToken" type="hidden" value="'
start_ind = text1.find(startstr)+len(startstr)
end_ind = text1.find('"',start_ind)
auth_string = text1[start_ind:end_ind]
#POST request
payload = {'Plate':'x', 'PlateStateProv':'VA',"submit":"Search",
"__RequestVerificationToken":auth_string,"validation-summary-valid":"true"}
post = s.post(url2, headers=user_agent, data=payload)
source_code = post.text
Thanks, -K.
You should only need the data from the first page, and as you say, the __RequestVerificationToken changes with each request.
You'll have to do something like:
GET request to https://www.dspayments.com/FAIRFAX
harvest __RequestVerificationToken value (Requests Session will take care of any associated cookies)
POST using the data you scraped from the GET request
extract whatever you need from the 2nd page
So, just focus on creating a form that's exactly like the one in the first page. Have a stab at it and if you're still struggling I can help dig into the particulars.