Scrape a website that loads table based on dynamically selected dropdown values - python

I am trying to scrape table values from this website.. There are multiple dropdown which change based on selections in the previous dropdown. Upon inspecting the source, it seems to sending HTTP requests and render the results. However, I can't seem to write script to send those requests myself and scrape the data.
This is what I tried:
import requests
URL = 'https://voterlist.election.gov.np/bbvrs1/index_process_1.php' #API URL
payload = 'vdc=5298&ward=1&list_type=reg_centre' #Unique payload fetched from the network request
response = requests.post(URL,data=payload,verify=False) #POST request to get the data using URL and Payload information
print(response.text)
This is not giving me the expected response which should be a table containing the values. What could be the best approach to take in this case?

So after a few hours of digging in I found an answer. All I had to do was send a cookie while making the request and also change the format the payload was sent in.
headers = {
'num': number, # This the cookie that should be sent which happens to be a session id in my case.
}
payload = {
'state': "1",
"district": "14",
"vdc_mun": "5132",
"ward": "3",
"reg_centre" :""
}
#Then send a post request to the url
res = session.post(url, data=payload, headers=headers, verify=False)

Related

How do I return all pages from a rest API request without manually specifying the page number?

I'm retrieving data from a paginated API and converting it to .JSON format and I'd like to retrieve all pages in the response, without having to specify the page number in the URL. The API accepts page number and results per page (max. 250) as inputs.
I understand that the typical solution is to loop through pages using a key that specifies the address of the next page. However, it appears as though this API doesn't include a next page parameter in the output (see example response below). I can only think that the last page (i.e. total pages) parameter could be useful here? How can I scrape all of the pages without specifying the page number?
My script:
import requests
import json
url = "https://api-v2.pitchbook.com/deals/search?keywords=machine learning accelerator&perPage=250"
payload={}
headers = {
'Authorization': 'PB-Token 1234567'
}
response = requests.request("GET", url, headers=headers, data=payload)
data = response.json()
print(data)
Example response
{'stats': {'total': 2, 'perPage': 250, 'page': 1, 'lastPage': 1}, 'items': [{'dealId': '98982-
28T', 'companyId': '162120-79', 'companyName': 'companyA'}, {'dealId': '112532-05T',
'companyId': '233527-87', 'companyName': 'companyB'}]}
without having to specify the page number in the URL
Unless you can pass a page number in the request header, that's not possible. You could pass a very large number to perPage parameter, but the server could always have more data, or the client would fail to deserialize large payloads.
appears as though this API doesn't include a next page parameter in the output
It doesn't need to. You have your current page, and the number of results "per page" . As long as the "total" has not yet been read into your local results, and "last page" is greater than your current "page", you should make a new request for page+1

Python Requests: PHP-Form login is not being sent

I'am trying to scrape this website: https://gb4.typewriter.at/. I'am struggling with logging in. The weird thing is: the form is being filled out but it is not sent, and therefore I don't get any cookies.
Here is what I tried:
import requests
url = "https://gb4.typewriter.at/"
s = requests.Session()
s.get(url)
payload = {
"LoginForm[username]": "some_username",
"LoginForm[pw]": "some_password",
}
r = self.session.post(
f"{url}/index.php?r=site/login",
data=payload,
)
print(r.content) # this prints the html with both forms filled
print(r.cookies) # gives me a PHPSESSID cookie, not the cookie I need, the site uses another one (some random chrachters) as authentication cookie.
I also tried to send a post request with Postman, which worked completely fine and logged me in.
I would really appreciate some help!

Updating a webpage using python requests returning error response code 415

I am trying to update an already existing page in Atlassian confluence page through the Python requests module. I am using the requests.put() method to send the http request to update my page. The page already has the title "Update Status". I am trying to enter one line as the content of the page. The page id and other information that is within the json payload has been copied by me directly from the rest/api/content... output of the webpage I am trying to access.
Note: I am already able to access information from the webpage through python requests.get but I am not able to post information to the webpage.
Method used to access information from the webpage which works:
response = requests.get('https://confluence.ai.com/rest/api/content/525424594?expand=body.storage',
auth=HTTPBasicAuth('svc-Automation#ai.com', 'AIengineering1#ai')).json()
Method used to update information to that page which does not work as the response is in the form of an error 415.
import requests
from requests.auth import HTTPBasicAuth
import json
url = "https://confluence.ai.com/rest/api/content/525424594"
payload = {"id":"525424594","type":"page", "title":"new page-Update Status","space":{"key":"TST"},"body":{"storage":{"value": "<p>This is the updated text for the new page</p>","representation":"storage"}}, "version":{"number":2}}
result = requests.put(url, data=payload, auth=HTTPBasicAuth('svc-Automation#ai.com', 'AIengineering1#ai'))
print (result)
I am guessing that the payload is not in the right format. Any suggestions?
Note: The link, username and password shown here are all fictional.
Try sending the data with the "json" named argument instead of "data", so requests module would set the application/json to content-type header.
result = requests.put(url, json=payload, auth=HTTPBasicAuth('svc-Automation#ai.com', 'AIengineering1#ai'))

How do I insert API pagination using Python 3 & requests?

I make an API call using requests, and get a JSON response back. The API limit is 20 results per page. I can access the first page without a problem, however I cannot figure out how to include pagination in the query. In the JSON response, at the bottom of the page, it gives me the following information.
},
"_links":{
"first":{
"href":"https://search-lastSale?date=20190723-20190823&page=0&size=20
},
"last":{
"href":"https://search-lastSale?date=20190723-20190823&page=4&size=20
},
"next":{
"href":"https://search-lastSale?date=20190723-20190823&page=1&size=20
},
"self":{
"href":"https://search-lastSale?date=20190723-20190823&page=0&size=20
}
},
"page":{
"number":0,
"size":20,
"totalElements":77,
"totalPages":4
}
I've read the docs at https://2.python-requests.org//en/latest/user/advanced/#link-headers and various other articles and posts, but everything seems very specific to people's own APIs.
I've taken my code back to just a single URL request, and old auth token just so I can get a grasp of it, then rescale up to my existing project. The code below:
url = "https://search-api.corelogic.asia/search/au/property/postcode/401249/lastSale"
querystring = {"page":"0","size":"20","date":"20190723-20190823"}
headers = {
'Content-Type': "application/JSON",
'Authorization': "My Token"}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
As far as I can tell from the docs and reading, what I should be doing is either:
Get a JSON response back that finds the page count, and then send a new request with a custom list of URLs that reference that count somehow, i.e.
if 'totalPages = [4]
https:www.search/page0
https:www.search/page1
https:www.search/page2
https:www.search/page3
https:www.search/page4
loop through each URL and append the JSON file; or
Utilise the 'next' page in the JSON response to grab the next url, until there is no 'next' page in the JSON file, i.e.
While json.response = ['next']
keep getting data
append to open json file
Both methods make sense, however I cannot see where this pagination would exist in my code.
As you can see, in the response you receive there is the _links dict, you can use the href inside next to get next page.
Or you can try to manually generate those urls:
>>> def paged_url(page: int=0, size=20) -> str:
... return ("https://search-lastSale?date=20190723-20190823&"
... f"page={page}&size={size}")
...
>>> paged_url(1)
'https://search-lastSale?date=20190723-20190823&page=1&size=20'
>>> paged_url(2)
'https://search-lastSale?date=20190723-20190823&page=2&size=20'
>>> paged_url(3, 10)
'https://search-lastSale?date=20190723-20190823&page=3&size=10'
Those URLs contain the next page you should fetch.

I want to send a Python request to an ASP site but the site show access denied

Site url is http://rajresults.nic.in/resbserx18.htm when send data, but when response comes URL changes in ASP. So which URL user need to send request ASP or html?
Request:
import requests
# data for get result
>>> para = {'roll_no':'2000000','B1':'Submit'}
# this is url where data is entered and get asp response
>>> url = 'http://rajresults.nic.in/resbserx18.htm'
>>> result = requests.post(url,data=para)
>>> result.text
Response
'The page you are looking for cannot be displayed because an invalid method (HTTP verb) is being used.'
Okay after a little bit of work, I found it's some issue with the headers.
I did some trial and error, and found that it checks to make sure the Host header is set.
To debug this, I just incrementally removed chrome's request headers and found which one this web service was particular about.
import requests
headers = {
"Host": "rajresults.nic.in"
}
r = requests.post('http://rajresults.nic.in/resbserx18.asp',
headers = headers,
data = {'roll_no': 2000000, 'B1': 'Submit'}
)
print(r.text)

Categories