Fixing error 408 on first POST request when scraping data

Fixing error 408 on first POST request when scraping data - python

I'm trying to scrape website with BS4. This is the website that I have:
https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html
I want to scrape all urls of the news articles that are on this page. If I just put url in request lib, I will not get URLs of the website. But If i go to inspect page -> network, there is one post request that returns HTML that has all urls ( href-s ).
I have to use post request In order to get all URLs on the website, but the problem is that Im always getting error 408.
url = 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.filter.html?tx_wslfilter_filter%5Baction%5D=ajax&tx_wslfilter_filter%5Bcontroller%5D=Filter&cHash=88a50dfb12c7c7e03ce68f244dbfda20'
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '757',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.wsl.ch',
'Origin': 'https://www.wsl.ch',
'Referer': 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Server-Timing': 'miss, db;dur=63, app;dur=55.2'}
response = requests.post(url, headers = headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)
I have tried with and without headers but its the same.
What should I do?

You are not sending body in your post request.
I have corrected your code, now you will not get 408 (timeout)
from bs4 import BeautifulSoup
import requests
url = 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.filter.html?tx_wslfilter_filter%5Baction%5D=ajax&tx_wslfilter_filter%5Bcontroller%5D=Filter&cHash=88a50dfb12c7c7e03ce68f244dbfda20'
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '757',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.wsl.ch',
'Origin': 'https://www.wsl.ch',
'Referer': 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Server-Timing': 'miss, db;dur=63, app;dur=55.2'}
data='tx_wslfilter_filter%5Btype%5D=news&tx_wslfilter_filter%5Bslf%5D=0&tx_wslfilter_filter%5Blang%5D=0&tx_wslfilter_filter%5Bpage%5D=1&tx_wslfilter_filter%5Bperpage%5D=10&tx_wslfilter_filter%5Bkeyword%5D=&tx_wslfilter_filter%5Ball%5D=1&tx_wslfilter_filter%5Bcategory%5D%5B10%5D=10&tx_wslfilter_filter%5Bcategory%5D%5B11%5D=11&tx_wslfilter_filter%5Bcategory%5D%5B12%5D=12&tx_wslfilter_filter%5Bcategory%5D%5B13%5D=13&tx_wslfilter_filter%5Bcategory%5D%5B1%5D=1&tx_wslfilter_filter%5Btag%5D%5B76%5D=76&tx_wslfilter_filter%5Btag%5D%5B1%5D=1&tx_wslfilter_filter%5Btag%5D%5B11%5D=11&tx_wslfilter_filter%5Btag%5D%5B7%5D=7&tx_wslfilter_filter%5Btag%5D%5B9%5D=9&tx_wslfilter_filter%5Btag%5D%5B8%5D=8&tx_wslfilter_filter%5Btag%5D%5B52%5D=52&tx_wslfilter_filter%5Byear%5D=0'
response = requests.post(url,data=data, headers = headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

Related

Scrape data from API that returns 500 in Beautifulsoup

I'm stuck here for hours now. I'm trying scrape data from this link.
Primary aim is to get the school names and their email IDs. In order to do that, I need to get the list of schools first which here is not loading because of the API
The data loads from an API whose link is this.
The request method is POST and upon sending a POST request, returns 500.
Proof of work:
def data_fetch("https://scholenopdekaart.nl/api/v1/search/"):
headers = {
'accept': 'application/json, text/plain, */*',
'content-type': 'application/json',
'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
response = requests.post(url, headers=headers)
print(response) # <Response [500]>
I was able to scrape data from the other links that had GET request while POST request returns nothing. Am I missing anything right here? I even tried to put all these in the headers and still got 500.
headers = {
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'content-type': 'application/json',
'dnt': '1',
'origin': 'https://scholenopdekaart.nl',
'referer': 'https://scholenopdekaart.nl/zoeken/middelbare-scholen/?zoektermen=rotterdam&weergave=Lijst',
'sec-ch-ua': '"Chromium";v="106", "Brave";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "Windows",
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'sec-gpc': '1'
}

I'd recommend using Google Chromes Network tab to copy the request CURL command. Then using postman, you can import that curl command, and generate Python Request code with all the data.
Since the response is pure JSON, no need for bs4
Consider the following example:
import requests
import json
url = "https://scholenopdekaart.nl/api/v1/search/"
payload = json.dumps({
"zoekterm": "rotterdam",
"sectorKeuze": 1,
"weergave": "Lijst"
})
headers = {
'authority': 'scholenopdekaart.nl',
'accept': 'application/json, text/plain, */*',
'cache-control': 'no-cache',
'content-type': 'application/json',
'origin': 'https://scholenopdekaart.nl',
'pragma': 'no-cache'
}
response = requests.request("POST", url, headers=headers, data=payload)
data = json.loads(response.text)
for school in data['scholen']:
print(f"{school['bisId']} \t\t {school['naam']}")
This will output:
25592 VSO Op Zuid
25938 Op Noord
569 Accent Praktijkonderwijs Centrum
572 Accent PRO Delfshaven
574 Marnix Gymnasium
578 Portus Zuidermavo-havo
579 Portus Juliana
580 CSG Calvijn vestiging Meerpaal
582 CBSplus, school voor havo en mavo
588 Melanchthon Schiebroek
589 Melanchthon Wilgenplaslaan
4452 Melanchthon Mavo Schiebroek
594 Melanchthon Kralingen
604 Comenius Dalton Rotterdam
26152 Zuider Gymnasium
... and some more ...

Expanding on #0stone0's answer this works even with less code:
import requests
import json
import pandas
json_data = {
'zoekterm': 'rotterdam',
'sectorKeuze': 1,
'weergave': 'Lijst',
}
response = requests.post('https://scholenopdekaart.nl/api/v1/search/',
json=json_data)
data = json.loads(response.content)
df = pd.DataFrame(data["scholen"])
df[["bisId", "naam"]].head()
Output:
bisId naam
0 25592 VSO Op Zuid
1 25938 Op Noord
2 569 Accent Praktijkonderwijs Centrum
3 572 Accent PRO Delfshaven
4 574 Marnix Gymnasium

Cant understand Request response json

I am sending requests to discord but the json response is encoded. I dont know how to convert it to the json which is shown in
cookies = {
'__dcfduid': '9cdb771aa91811ecbbb166a2644e1ebd',
'__sdcfduid': '9cdb771aa91811ecbbb166a2644e1ebd4139917683b03f98166d7f12ed245c6b4eec20a521219fa14bc3dcad9fc83f17',
'__cf_bm': 'XATJg9tmbXKxU0XYhhk_NEc7jJI3G9cezBLutkAPL14-1647867777-0-AVUnjtyleUf5uH4NZKnrirKJ67tGnkxs3rdmrUmPM7jxGtiu7AV0DEfzOThZyTXG+6WwhYAvb4vecRMRQLixl7sX5hKh05wldjuEukidOaruFgJ0EFkBwt9f3/fv678s1g==',
'locale': 'en-US',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/json',
'Authorization': 'Token here',
'X-Super-Properties': 'eyJvcyI6IldpbmRvd3MiLCJicm93c2VyIjoiRmlyZWZveCIsImRldmljZSI6IiIsInN5c3RlbV9sb2NhbGUiOiJlbi1VUyIsImJyb3dzZXJfdXNlcl9hZ2VudCI6Ik1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdpbjY0OyB4NjQ7IHJ2Ojk4LjApIEdlY2tvLzIwMTAwMTAxIEZpcmVmb3gvOTguMCIsImJyb3dzZXJfdmVyc2lvbiI6Ijk4LjAiLCJvc192ZXJzaW9uIjoiMTAiLCJyZWZlcnJlciI6IiIsInJlZmVycmluZ19kb21haW4iOiIiLCJyZWZlcnJlcl9jdXJyZW50IjoiIiwicmVmZXJyaW5nX2RvbWFpbl9jdXJyZW50IjoiIiwicmVsZWFzZV9jaGFubmVsIjoic3RhYmxlIiwiY2xpZW50X2J1aWxkX251bWJlciI6MTE5ODgwLCJjbGllbnRfZXZlbnRfc291cmNlIjpudWxsfQ==',
'X-Discord-Locale': 'en-US',
'X-Debug-Options': 'bugReporterEnabled',
'Origin': 'https://discord.com',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://discord.com/channels/#me/948159391838904380',
# Requests sorts cookies= alphabetically
# 'Cookie': '__dcfduid=9cdb771aa91811ecbbb166a2644e1ebd; __sdcfduid=9cdb771aa91811ecbbb166a2644e1ebd4139917683b03f98166d7f12ed245c6b4eec20a521219fa14bc3dcad9fc83f17; __cf_bm=XATJg9tmbXKxU0XYhhk_NEc7jJI3G9cezBLutkAPL14-1647867777-0-AVUnjtyleUf5uH4NZKnrirKJ67tGnkxs3rdmrUmPM7jxGtiu7AV0DEfzOThZyTXG+6WwhYAvb4vecRMRQLixl7sX5hKh05wldjuEukidOaruFgJ0EFkBwt9f3/fv678s1g==; locale=en-US',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
# Requests doesn't support trailers
# 'TE': 'trailers',
}
data = {
'content':'hello',
'nonce':955456750633353216,
'tts':False}
response = requests.post('https://discord.com/api/v9/channels/948159391838904380/messages', headers=headers, cookies=cookies, json=data)
print(response.status_code)
print(response.content)
The response is
b'\x03\xfe\x00\x00dS\x97\xd3\xaa\xd3\xfa$\x03.s7vc}\x1bR\xf2D\xeb\x8e\xfc#\xda\xfa!\xe2\x96\x05Q\x8b\xc20\r0\x0b(\xf8!\x080\x93\x0b\xa6\xa7\xfa\
xf6\xecd\x08\xbd;\xf1\xcf\x03\xdb\x8a\x1909\x17bJ\xde\xc4\x98\x8c\xf3Q\x13.\x01\x8f\xb7\x991\x03-\x01G\xbf\xed3\xe0?\x0f\xc3dq\t\xc8\xdd\xa8\xbe
\xe30\xd6ze\xb5O\x9a\xbcsQ\x90\xe7\xd0fH\xc7\xd9`}H\xc6{\x1d\x83#-\xf2\xe9\xc0{)#c\x06\xdc\x16i[\x7f8\x97c\xb9\x90\xd30,\x01k{h\xf6\xed\xd8J9\x8
6]\xa1I1\xe1\x12pyzM\xf3\xbd\x1d\xca\xdf\x013\xd0k\t\xd8\xfc\xc3\x013\xbc\x7f.\x01y\xfc\xe5\xaa\xa1\xd9j\xc5\x13\xbe\xf7\xd3\xc0\x06\xe5\xdc\x8a
p\xc5\x0c\xdb2\x1c\xf8e\xb6D}?\t\x9b\xdd1\x1bi^\xb6#\x1f\x8ee\x9c1\x03j\xd2zEf\xa5\xf5\x9bRY\xc7Ln\xad\x14\x11\xd1\x82(S\xc7g\\\xdb#\xd7o_[;\xd5
f\xc0\x18\xc6y\x92\xebe\x92\x86\x81\x14\xa2w\xd6yR\xc1\x07R\x16(\xfcI>bi\xb8~\x13X\x12\xcc \xa7ax\x19\x03'
Does any one know how to get the json as shown in developers tool. I have never seen this type of response and know nothing about it. I want to know this for educational purpose. As message being sent without knowing the response.

You should use response.json() to get it as a JSON.
response.content returns a binary representation.

How do I Webscrape a website that uses iframes?

I am trying to scrape this website 'https://swimming.org.nz/results.html'. In the form that comes up, I am filling in only the Age column as 8 to 8. I am using the following code to scrape the table as suggested elsewhere in StackOverflow. I am unable to get the table. How to get all the tables for this age group 8 to 8.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://swimming.org.nz/results.html")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("x-MS_FIELD_AGE.FROM.L").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select("x-form-text x-form-field"):
print("\t".join([e.text for e in row.select("th, td")]))

You will see that it is not necessary using BeautifulSoup if you' ve look at the developer tools on your browser. Sending request is below and response type is xml. You don't need any scraping tool. You can get all of that data changing the StartRowIndexand MaximumRowCount.
import requests
url = "https://connect.swimming.org.nz/snz-wrap-public/pages/pageRequestHandler?tunnelTarget=tableData%2F%3F&data_file=MS.COMP.RESULTS&dict_file=MS.COMP.RESULTS&doGet=true"
payload="StartRowIndex=0&MaximumRowCount=100&sort=BY-DSND%20COMP.DATE%20BY-DSND%20STAGE&dir=ASC&tid=extTable1620108707767_4352538&selectCriteria=GET-LIST%20CMS_TABLE_19483_65507_076_184811_&extraColumns=%3CColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CColumnName%3EExpander%3C%2FColumnName%3E%3CField%3EFRAGMENT_DISPLAY.SPLITS%3C%2FField%3E%3CShowInExpander%3Etrue%3C%2FShowInExpander%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BMEMBER.FORE1%7D%20%7BMEMBER.SURNAME%7D%3C%2FFieldExpression%3E%3CField%3EEXPRESSION_FIELD_1%3C%2FField%3E%3CColumnName%3EName%2520%3C%2FColumnName%3E%3CWidth%3E130%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3CWidth%3E35%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BXCATEGORY2%7D%3C%2FFieldExpression%3E%3CField%3ECATEGORY2.NUM%24%24SNZ%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BTIME%24%24SNZ%7D%3C%2FFieldExpression%3E%3CField%3ERESULT.TIME.MILLISECONDS%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3CWidth%3E85%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3CWidth%3E80%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3CWidth%3E190%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3C%2FColumns%3E&extraColumnsDownload=%3CDownloadColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY2%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3ETIME%24%24SNZ%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3C%2FColumn%3E%3C%2FDownloadColumns%3E"
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'accept-language': 'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7,ru;q=0.6',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://connect.swimming.org.nz',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://connect.swimming.org.nz/snz-wrap-public/workflows/COMP.RESULTS.FIND',
'Cookie': 'JSESSIONID=93F2FEA63BA41ECB2505E2D1CD76374D; _ga=GA1.3.1735786808.1620106921; _gid=GA1.3.1806138988.1620106921'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Python POST to check a box on a webpage

I am trying to scrape a webpage that posts prices for the Mexico power market. The webpage has checkboxes that need to be checked for the file with prices to show up. Once I get the relevant box checked, I want to pull the links on the page and check if the particular file I am looking for is posted. I am getting stuck in the first part where I get the checkbox selected using requests.post. I used fiddler to track the changes when I post and passed those arguments in through requests.post.
I was expecting to be able to parse out all the 'href' links in the response but I didn't get any. Any help in redirecting me toward a solution would be greatly appreciated.
Below is the relevant portion of the code I am using:
data{
"ctl00$ContentPlaceHolder1$toolkit":"ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal",
"_EVENTTARGET": "ctl00$ContentPlaceHolder1$treePrincipal",
"__EVENTARGUMENT":{"commandName":"Check","index":"0:0:0:0"},
"__VIEWSTATE": "/verylongstringhere",
"__VIEWSTATEGENERATOR":"6B88769A",
"__EVENTVALIDATION":"/wEdAAPhpIpHlL5kdIfX6MRCtKcRwfFVx5pEsE3np13JV2opXVEvSNmVO1vU+umjph0Dtwe41EcPKcg0qvxOp6m6pWTIV4q0ZOXSBrDwJTrxjo3dZg==",
"ctl00_ContentPlaceHolder1_treePrincipal_ClientState":{"expandedNodes":[],"collapsedNodes":
[],"logEntries":[],"selectedNodes":[],"checkedNodes":["0","0:0","0:0:0","0:0:0:0"],"scrollPosition":0},
"ctl00_ContentPlaceHolder1_ListViewNodos_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_ClientState":"",
"ctl00$ContentPlaceHolder1$NotifAvisos$hiddenState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_XmlPanel_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_TitleMenu_ClientState":"",
"__ASYNCPOST":"true"
}
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Length': '26255',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ga=GA1.3.1966843891.1571403663; _gid=GA1.3.1095695800.1571665852',
'Host': 'www.cenace.gob.mx',
'Origin': 'https://www.cenace.gob.mx',
'Referer': 'https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/77.0.3865.120 Safari/537.36',
'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'
}
url ="https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx"
r= requests.post(url,data=data, headers=headers, verify=False)
This is what Fiddler showed on the Post:enter image description here

Maybe you have incorrect __EVENTVALIDATION or __VIEWSTATE fields. You can get the initial page & scrape all the inputs with the initial values.
The following code grabs the input on the first requests, edit them like you did & then send the POST request scraping all the href values :
import requests
import json
from bs4 import BeautifulSoup
base_url = "https://www.cenace.gob.mx"
url = "{}/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx".format(base_url)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
(t['name'],t.get('value',''))
for t in soup.select("input")
if t.has_attr('name')
])
payload['ctl00$ContentPlaceHolder1$toolkit'] = 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal'
payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$treePrincipal'
payload['__ASYNCPOST'] = 'true'
payload['__EVENTARGUMENT']= json.dumps({
"commandName":"Check",
"index":"0:1:1:0"
})
payload['ctl00_ContentPlaceHolder1_treePrincipal_ClientState'] = json.dumps({
"expandedNodes":[], "collapsedNodes":[],
"logEntries":[], "selectedNodes":[],
"checkedNodes":["0","0:1","0:1:1","0:1:1:0"],
"scrollPosition":0
})
r = requests.post(url, data = payload, headers= {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"
})
soup = BeautifulSoup(r.text, "html.parser")
print([
"{}/{}".format(base_url, t["href"])
for t in soup.findAll('a')
if not t["href"].startswith('javascript')
])

Forbidden response when logging into website via requests

I am trying to login via requests and receiving a forbidden response.
import requests
from bs4 import BeautifulSoup
header = {'Host': 'fetlife.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
#'Cookie': 'language=en; _fl_sessionid=???',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'If-None-Match': 'W/"7d905c0faa3450522096dfbfaea7558a"',
'Cache-Control': 'max-age=0',
}
login_post_url = str('https://fetlife.com/users/sign_in')
internal_url = str('https://fetlife.com/home')
with requests.Session() as sesh:
response = sesh.get(login_post_url, headers=header)
html = response.content
soup = BeautifulSoup(html, 'lxml')
hidden_tags = soup.find_all("input", type="hidden")
payload = {'utf8': '/',
'authenticity_token': soup.find(attrs={'name': 'authenticity_token'})['value'],
'user[otp_attempt]': 'step_1',
'user_locale': 'en',
'user[login]': 'un',
'user[password]': 'pw',
}
sesh.post(login_post_url, data=payload, headers=header)
response = sesh.get(internal_url)
html = response.text
print(html)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fixing error 408 on first POST request when scraping data - python

Related

Scrape data from API that returns 500 in Beautifulsoup

Cant understand Request response json

How do I Webscrape a website that uses iframes?

Python POST to check a box on a webpage

Forbidden response when logging into website via requests

Categories

Resources