I am trying to scrape some data from a Korean website for goods.
The website displays general data such as arrival date, department date, mother ship name of cargo ships.
Website Link
The black button on the right is the search button.
In order to obtain data from it, some radio buttons have to be set up then hit search.
So what I thought was I could do a Post request to the website so I can extract data from the response.
Unfortunately, the response was just a plain page without the Post request.
This is the Post request
POST /Berth_status_text_servlet_sw_kr HTTP/1.1
Accept: text/html, application/xhtml+xml, image/jxr, */*
Referer: http://info.bptc.co.kr:9084/content/sw/frame/berth_status_text_frame_sw_kr.jsp
Accept-Language: en-US,en;q=0.7,ko;q=0.3
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Content-Length: 40
Host: info.bptc.co.kr:9084
Pragma: no-cache
Connection: close
v_time=month&ROCD=ALL&ORDER=item2&v_gu=S
And this is what I did in Python
from bs4 import BeautifulSoup
import requests
params ={'v_time': 'month',
'ROCD': 'ALL',
'ORDER': 'item2',
'v_gu': 'S'}
response = requests.post(url, data = params)
soup = BeautifulSoup(response.content,"html")
print(soup)
I did try to put encoding and other things in the headers like below
response = requests.post(url, data = params,
headers={'Accept': 'text/html, application/xhtml+xml, image/jxr, */*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Content-type': 'application/x-www-form-urlencoded; text/html; charset=euc-kr',
'Accept-Language': 'en-US,en;q=0.7,ko;q=0.3'
})
It did not work either.
The codes work fine on the other websites, so I guess it is something related to Korean characteristic.
I tried to search solutions for the issue, but I didn't have a luck.
Would you mind help me?
Thanks !
Your approach is correct. The response returns html and you need to parse it into a more usable format. The following code converts the table from the html response into a list if dicts:
from bs4 import BeautifulSoup
import requests
params = {"v_time": "month", "ROCD": "ALL", "ORDER": "item2", "v_gu": "S"}
response = requests.post(
"http://info.bptc.co.kr:9084/Berth_status_text_servlet_sw_kr", data=params
)
soup = BeautifulSoup(response.content, features="html.parser")
keys = [th.get_text(strip=True) for th in soup("th")]
data = [
{key: value.get_text(strip=True) for key, value in zip(keys, row("td"))}
for row in soup("tr")
]
print(data)
Prints:
[
{
"S/H": "0",
"모선항차": "DPYT-21",
"반입 마감일시": "",
"선박명": "PEGASUS YOTTA",
"선사": "DYS",
"선석": "2",
"선적": "0",
"양하": "0",
"입항 예정일시": "2020/06/08 21:00",
"입항일시": "",
"전배": "",
"접안": "P",
"출항 예정일시": "2020/06/09 11:00",
"출항일시": "",
"항로": "NCK",
}
...
]
Related
I'm trying to log in into one web application with python but very attempt ends with 500 error and the html body shows the error: [HttpAntiForgeryException]. I tried to apply a few solutions from the other questions here but nothing helped. So now, I'm sucked at first request which response Is giving me 500.
import requests
from bs4 import BeautifulSoup
url = "http://localhost:52053/Account/Login"
username = "test#test.sk"
user_password = "pass"
session = requests.Session()
response = session.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
#print(soup)
states = ["__RequestVerificationToken", "Email", "RememberMe"]
login_data = {"username": username, "password": user_password, "Login": "submit"}
headers = {"Host": "localhost:52053",
"Content-Type": "application/x-www-form-urlencoded",
"Connection": "close",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0)",
"Cookie": str(session.cookies.get_dict())}
for state in states: # search for existing aspnet states and get its values
result = soup.find('input', {'name': state})
if not (result is None): # when existent (some may not be needed!)
if state == "Email":
login_data.update({state: login_data["username"]})
else:
login_data.update({state: result['value']})
post_request = session.post(url, headers=headers, data=login_data)
Successful login attempt looks like this.
POST /Account/Login HTTP/1.1
Host: localhost:52053
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: sk,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
Content-Length: 193
Origin: http://localhost:52053
Connection: close
Referer: http://localhost:52053/Account/Login
Cookie: __RequestVerificationToken=j9yFGpTFSlH5_aQt0k-Gvz10I16TVXbDk31NKPm1HkcWsksUfKXkjL567yFplCS_VovTR7lVuEgNjwgp-EO3RjNj4gQOvNUXnPkjymZx_jA1
Upgrade-Insecure-Requests: 1
__RequestVerificationToken=LjHuOdKSCr1A7KRDNie4GUnCZ3qRwUCdHyLlPYT40DsEB-GNUvEKxe5nvZWf5gZ4ZflwI43xGWPyYu8GI15wroEg9WRRVtSzZ9-KY9Mu_JA1&Email=test%40test.sk&Password=pass&RememberMe=false
Following response is:
HTTP/1.1 302 Found
Cache-Control: no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Expires: -1
Location: /
Server: Microsoft-IIS/10.0
X-AspNetMvc-Version: 5.2
X-AspNet-Version: 4.0.30319
Set-Cookie: .AspNet.ExternalCookie=; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: .AspNet.ApplicationCookie=KCLm03FHj8v_6rIpTzBTm7EzEtzpKmIz1Z9_z29wycUSqUVyKbGEmptXUwG41MqNOMR7Vbeq2u576ijazupNLffLP-Ua0n60aLmnVSDsLsdTqYT7jjqyGPw1Ppp8AnIDs3sdefmksazX2UvKTxzxRBufFCoxtCJx51mWtBv7v0JzUeC1hnfu1AIJ7GH_8T59KD3iv0hRSHDqlWHlkWzyN1Xt0m5ixC14e4eC2YxEm3_acy96atB2Jv5u0HREPzssLmywuzj6sLa9cHCllTG2gMVWvHA3IDhCWu7Ojf8BO02Eml3pPM5QTJ-sq540fcj9QyELayUOwBZWffSgsJeq8mlt3FupQcJ-JTJxDzAsDc4Cmk-BcvYSfpAJq4SdR-Y4mTN_6vu-wwAOLZPSgh-5K7guWmZ3VfRitZHXd_rvTEmMiVrgHFTEQAkUYu4zTSupxRplTtKb1VSDs0Nc1uEos2z0_aw-nBbRBrTPpvmqGok
Auth flow continues with this request. I'm not trying to sent this request yet (I put it here just for better imagination):
GET / HTTP/1.1
Host: localhost:52053
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: sk,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://localhost:52053/Account/Login
Connection: close
Cookie: __RequestVerificationToken=j9yFGpTFSlH5_aQt0k-Gvz10I16TVXbDk31NKPm1HkcWsksUfKXkjL567yFplCS_VovTR7lVuEgNjwgp-EO3RjNj4gQOvNUXnPkjymZx_jA1;
.AspNet.ApplicationCookie=gvv113IJhtdaOhdc0Rz2N--5Ob18W6gS64J3wtOJggRTqE70h-8HyBGQAmLvSM2qCV2e-dXR2Uto-BktD6NmNz6dJtxckIYasPOfqodDNZX33YJxNEDg7a64LPi1bNnmrnvQcOHAceQNqZDykXrhFm55dqoo1oZnJHfZQnltwqAdg7DGO31PZpzu-GAZh2_gzuxd_saJdS09ZZQrc9h7WiU2ONqeya87pSAN7ZyHQ_XvsU5cUwDGq7FWLpzlIeeZWkay6iWVmCSwNEofpdVsb880P3XZnFKEj2SW2PfazdNLfgy86YNjkoD6_3Vb1BLirRoSP0XIQMcs2F_CzgXkxD5GvDray8TPYqcQJ4L2fikReUJHadx9fFnslF2BFcnKYC8D-Xusrda_5r-CQoQ4SzAe2Cqn0h1NYHxS1wsxt35neC5RuQ3geadAEEghjrSSVhSl8jCfACtQtcBeNL2x_m6I9L3XJCjMpzJjtP6up3E
Upgrade-Insecure-Requests: 1
Next response is just kind of 200 - you are in.
So my problem is that the response from the first request is failing. Is someone able to see some mistake or did I forgot something?
Failed response from the first request call:
HTTP/1.1 500 Internal Server Error
Cache-Control: private
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/10.0
X-AspNet-Version: 4.0.30319
X-SourceFiles: =?UTF-8?B?QzpcVXNlcnNcUENBZG1pbmlzdHJhdG9yXERlc2t0b3BccGVuIHRlc3RpbmdcU2VjdXJpdHlXb3Jrc2hvcC1EVldBLW1hc3RlclxkdndhLXRyYWluaW5nXGR2d2EtdHJhaW5pbmdcQWNjb3VudFxMb2dpbg==?=
If I try to pint request headers and login_data, result is:
print(post_request.request.headers)
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0)', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'close', 'Host': 'localhost:52053', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': "{'__RequestVerificationToken': 'yg-7mFRyZiONwsZ2dIVkIIW5tB7gSL_sazgphg-VuW2OpNNRRkxmLH-9SZJXiN9whUC_BYTo8RgsiDrVjcYtLEf9anW56rVwZ2RQPzxHA481'}", 'Content-Length': '249'}
print(login_data)
{'username': 'test#test.sk', 'password': 'pass', 'Login': 'submit', '__RequestVerificationToken': '14OuwaRqldlGKi93C91zf6QD_ouOorHBDe63s4KgfP3gbt85V0QMy2X5OMwWAo1TUrD8zJ-zoZbXLPpgDI_wrxVZv3ceYNos_e5_elFhVt01', 'Email': 'test#test.sk', 'RememberMe': 'true', 'Password': 'pass'}
I just find the solution out.
Requests could handle all headers by itself (and my headers was, for some reason, causing errors), it was redirecting my request to 200 response so I didn't instantly saw, that it is actually working and catching 302 response.
I found out by printing:
print(post_request.history)
Which gave me <[302]>
Now, when I know, there is a redirection, I just have to allow_redirects=False and now I'm able to catch my set-cookie header
Full code, witch is getting an expected responses is:
import requests
from bs4 import BeautifulSoup
url = "http://localhost:52053/Account/Login"
username = "test#test.sk"
user_password = "pass"
session = requests.Session()
response = session.get(url)
soup = BeautifulSoup(response.content, features="html.parser")
#print(soup)
states = ["__RequestVerificationToken", "Email", "RememberMe"]
login_data = {"username": username, "password": user_password, "Login": "submit"}
for state in states: # search for existing aspnet states and get its values
result = soup.find('input', {'name': state})
if not (result is None): # when existent (some may not be needed!)
if state == "Email":
login_data.update({state: login_data["username"]})
else:
login_data.update({state: result['value']})
post_request = session.post(url, data=login_data, allow_redirects=False)
print(login_data)
#the code below is testing, if the HttpAntiForgeryException is in code
if "HttpAntiForgeryException" not in post_request.text:
print(post_request.headers)
else:
print("antiforgery")
This is the url https://www.lowes.com/store/AK-Anchorage/2955 when we reach this url there is a button name "Shop this store" if we click the button the request made by the clicking the button and using the link are the same but still after clicking the button one gets a different page then directly using the link. I need to make the same request as the button is making.
I need to make request to "https://www.lowes.com/store/AK-Anchorage/2955" then i need to make the same request as made my clicking the button.
I have tried making the requests two consecutive times to get the desired page but no luck.
url='https://www.lowes.com/store/AK-Anchorage/2955'
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=header)
response = requests.get(url, headers=header)
So, this seems to work. I get a 200 OK response both times, and the content isn't the same length.
For what it's worth, in Firefox, when I click the blue "Shop this store" button, it takes me to what appears to be the exact same page, but without the blue button I just clicked. In Chrome (Beta), when I click the blue button, I get a 403 Access denied page. Their server isn't playing nice. You might struggle to achieve what you want to achieve.
If I call session.get without my headers, I never get a response at all. So they're obviously checking the user-agent, possibly cookies, etc.
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",}
session = requests.Session()
url = "https://www.lowes.com/store/AK-Anchorage/2955"
response1 = session.get(url, headers=headers)
print(response1, len(response1.content))
response2 = session.get(url, headers=headers)
print(response2, len(response2.content))
Output:
<Response [200]> 56282
<Response [200]> 56323
I've done some more testing. The server times out if you don't change the user-agent from the default Python Requests one. Even changing it to "" seems to be enough for the server to give you a response.
You can get product information, including description, specifications, and price, without selecting a specific store. Take a look at this GET request, with no cookies, and no session:
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0"}
url = "https://www.lowes.com/pd/Google-Nest-Learning-Thermostat-3rd-Gen-Thermostat-and-Room-Sensor-with-with-Wi-Fi-Compatibility/1001080012"
r = requests.get(url, headers=headers, timeout=5)
print("return code:", r)
print("content length:", len(r.content))
for line in r.text.splitlines():
if "window.digitalData.products = [" in line:
print("This line includes the 'sellingPrice' and the 'retailPrice'. After some splicing, we can treat it as JSON.")
left = line.find(" = ") + 3
right = line.rfind(";")
print(json.dumps(json.loads(line[left:right]), indent=True))
break
Output:
return code: <Response [200]>
content length: 107134
This line includes the 'sellingPrice' and the 'retailPrice'. After some splicing, we can treat it as JSON.
[
{
"productId": [
"1001080012"
],
"productName": "Nest_Learning_Thermostat_3rd_Gen_Thermostat_and_Room_Sensor_with_with_Wi-Fi_Compatibility",
"ivm": "753160-83910-T3007ES",
"itemNumber": "753160",
"vendorNumber": "83910",
"modelId": "T3007ES",
"type": "ANY",
"brandName": "Google",
"superCategory": "Heating & Cooling",
"quantity": 1,
"sellingPrice": 249,
"retailPrice": 249
}
]
The product description and specification can be found in this element:
<section class="pd-information met-product-information grid-100 grid-parent v-spacing-jumbo">
(It's ~300 lines, so I'm just going to copy the parent tag.)
There's an API that takes a product id and store number, and returns the pricing information:
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0"}
url = "https://www.lowes.com/PricingServices/price/balance?productId=1001080012&storeNumber=1955"
r = requests.get(url, headers=headers, timeout=5)
print("return code:", r)
print("content length:", len(r.content))
print(json.dumps(json.loads(r.text), indent=True))
Output:
return code: <Response [200]>
content length: 768
[
{
"productId": 1001080012,
"storeNumber": 1955,
"isSosVendorDirect": true,
"price": {
"selling": "249.00",
"retail": "249.00",
"typeCode": 1,
"typeIndicator": "Regular Price"
},
"availability": [
{
"availabilityStatus": "Available",
"productStockType": "STK",
"availabileQuantity": 822,
"deliveryMethodId": 1,
"deliveryMethodName": "Parcel Shipping",
"storeNumber": 907
},
{
"availabilityStatus": "Available",
"productStockType": "STK",
"availabileQuantity": 8,
"leadTime": 1570529161540,
"deliveryMethodId": 2,
"deliveryMethodName": "Store Pickup",
"storeNumber": 1955
},
{
"availabilityStatus": "Available",
"productStockType": "STK",
"availabileQuantity": 1,
"leadTime": 1570529161540,
"deliveryMethodId": 3,
"deliveryMethodName": "Truck Delivery",
"storeNumber": 1955
}
],
"#type": "item"
}
]
It can take multiple product numbers. For example:
https://www.lowes.com/PricingServices/price/balance?productId=1001080046%2C1001135076%2C1001091656%2C1001086418%2C1001143824%2C1001094006%2C1000170557%2C1000920864%2C1000338547%2C1000265699%2C1000561915%2C1000745998&storeNumber=1564
You can get information on every store by using this API which returns a 1.6MB json file. maxResults is normally set to 30, and query is your longitude and latitude. I would suggest saving this to disk. I doubt it changes much.
https://www.lowes.com/wcs/resources/store/10151/storelocation/v1_0?maxResults=2000&query=0%2C0
Keep in mind the PricingServices/price/balance endpoint can take multiple values for storeNumber separated by %2C (a comma), so you won't need 1763 separate GET requests. I still made multiple requests using a requests.Session (so it reuses the underlying connection).
It depends on what do you want to do with the data. In the URL you already have shop ID.
When clicking on the button it issues the request to https://www.lowes.com/store/api/2955 to get shop information. Is it what you're looking for?
If so, you don't need 2 requests, but rather just one to get needed shop information.
I can't figure out how to correctly set up a POST request with the following data:
General
Request URL: https://myurl.com/install/index.cgi
Request Method: POST
Request Headers
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en
Cache-Control: max-age=0
Connection: keep-alive
Content-Length: 48
Content-Type: application/x-www-form-urlencoded
Host: myurl.com
Origin: https://myurl.com
Referer: https://myurl.com/install/
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
Form Data
page: install
state: STATUS
I can do the following:
import requests
headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding":"gzip,deflate,br",
"Accept-Language":"en-US,en;q=0.8",
"Cache-Control":"max-age=0",
"Connection":"keep-alive",
"Content-Length":"48",
"Content-Type":"application/x-www-form-urlencoded",
"Host":"myurl.com",
"Origin":"https://myurl.com",
"Referer":"https://myurl.com/install/?s=ROM",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
f = requests.put(path, headers=headers)
But how do I handle the form data? Under the form data there is a page: install and a state: STATUS.
How do I include this on my POST request?
Just add data= to your request:
import requests
path = ...
headers = ...
form_data = {
"page": "install",
"state": "STATUS",
}
f = requests.put(path, headers=headers, data=form_data)
I presume you know how to use the developer tools on the browser of your choice. The following is a template I follow:
Load the page (GET)
Use XPATH to find_element_by_id I'm targeting, i.e. username
Set XPATH to SetValue of such element
Post the page (POST)
I am using the requests module for python to try to login on a webpage. I open up a requests.session(), then I get the cookie and the csrf-token which is included in a meta tag. I build up my payload with username, password, a hidden input field and the csrf-token from the meta tag. After that i use the post method and I am passing through the login url, the cookie, the payload and the header. But after that I can't access a page behind the login page.
What am I doing wrong?
This is the request header when I perfom a login:
Request Headers:
:authority: www.die-staemme.de
:method: POST
:path: /page/auth
:scheme: https
accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7
content-length: 50
content-type: application/x-www-form-urlencoded
cookie: cid=261197879; remember_optout=0; ref=start;
PHPSESSID=3eb4f503f38bfda1c6f48b8f9036574a
origin: https://www.die-staemme.de
referer: https://www.die-staemme.de/
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
x-csrf-token: 3c49b84153f91578285e0dc4f22491126c3dfecdabfbf144
x-requested-with: XMLHttpRequest
This is my code so far:
import requests
from bs4 import BeautifulSoup as bs
import lxml
# Page header
head= { 'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
# Start Page
url = 'https://www.die-staemme.de/'
# Login URL
login_url = 'https://www.die-staemme.de/page/auth'
# URL behind the login page
url2= 'https://de159.die-staemme.de/game.php?screen=overview&intro'
# Open up a session
s = requests.session()
# Open the login page
r = s.get(url)
# Get the csrf-token from meta tag
soup = bs(r.text,'lxml')
csrf_token = soup.select_one('meta[name="csrf-token"]')['content']
# Get the page cookie
cookie = r.cookies
# Set CSRF-Token
head['X-CSRF-Token'] = csrf_token
head['X-Requested-With'] = 'XMLHttpRequest'
# Build the login payload
payload = {
'username': '', #<-- your username
'password': '', #<-- your password
'remember':'1'
}
# Try to login to the page
r = s.post(login_url, cookies=cookie, data=payload, headers=head)
# Try to get a page behind the login page
r = s.get(url2)
# Check if login was successful, if so there have to be an element with the id menu_row2
soup = bs(r.text, 'lxml')
element = soup.select('#menu_row2')
print(element)
It's worth noting that your request, when using the Python Requests module, will not be the exact same as a standard user request. In order to fully mimic a realistic request, and thus not be blocked by any firewall or security measures by the site, you will need to copy both all POST parameters, GET parameters and finally headers.
You can use a tool such as Burp Suite to intercept the login request. Copy the URL it is sending it to, copy all POST parameters also, and finally copy all headers. You should be using the requests.Session() function in order to store cookies. You may also want to do a initial session GET request to the homepage in order to pick up cookies as it is not realistic for a user to send a login request without first visiting the homepage.
I hope that makes sense, header parameters can be passed like so:
import requests
headers = {
'User-Agent': 'My User Agent (copy your real one for a realistic request).'
}
data = {
'username': 'John',
'password': 'Doe'
}
s = requests.Session()
s.get("https://mywebsite.com/")
s.post("https://mywebsite.com/", data=data, headers=headers)
had also the same issue. what did it for me was to add
s.headers.update(headers)
before the first get request in Cillian Collins example.
I want to retrieve Atmospheric particulate matter values from a table (sadly the site is not in english, so feel free to ask for everything): I failed with the combination of BeautifulSoup and GET request sent with requests, since table is filled with Bootstrap dinamically and a parser like BeautifulSoup can't find values which still must be inserted.
With Firebug I checked every angle of the page, and I found out that by selecting a different day of the table, a POST request is sent (the site, as you can see in Referer, is http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/, where the table is):
POST /temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini HTTP/1.1
Host: www.arpat.toscana.it
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/26-12-2016
Content-Length: 114
Cookie: [...]
DNT: 1
Connection: keep-alive
With the following params:
v_data_osservazione=26-12-2016&v_tipo_bollettino=regionale&v_zona=&csrf_test_name=b88d2517c59809a529
b6f8141256e6ca
Data in the answer are in JSON format.
So I started to craft my personal POST request, in order to directly get the JSON data which will fill the table.
In the params, in addition to the date, a csrf_test_name is required: here I discovered this site is protected against CSRF vulnerability; in order to perform a correct query in params, I need a CSRF token: that's why I perform a GET request to the site (see Referer in POST request for the URL) and get CSRF token from the cookie like this:
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
At end of the day, with my CSRF token and POST request ready, I send it...and with status code 200, I always get Disallowed Key Characters.!
Looking for this error, I always see posts about CodeIgniter, which (I think) is not what I need: I tried every combination of headers and parameters, yet nothing changed. Before giving up on BeautifulSoup and requests and start learning Selenium, I'd like to figure out what the problem is: Selenium is too high level, low level libraries like BeautifulSoup and requests let me learn lot of useful things, so I'd prefer continue learning with these two.
Here's the code:
from requests import get, post
from bs4 import BeautifulSoup
import datetime
import json
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/" # + %d-%m-%Y
yesterday = datetime.date.today() - datetime.timedelta(1)
date_object = datetime.datetime.strptime(str(yesterday), '%Y-%m-%d')
yesterday_string = str(date_object.strftime('%d-%m-%Y'))
full_url = url + yesterday_string
print("REFERER " + full_url)
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
print(csrf_token)
# preparing headers for POST request
headers = {
"Host": "www.arpat.toscana.it",
"Accept" : "*/*",
"Accept-Language" : "en-US,en;q=0.5",
"Accept-Encoding" : "gzip, deflate",
"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With" : "XMLHttpRequest", # XHR
"Referer" : full_url,
"DNT" : "1",
"Connection" : "keep-alive"
}
# preparing POST parameters (to be inserted in request's body)
payload_string = "v_data_osservazione="+yesterday_string+"&v_tipo_bollettino=regionale&v_zona=&csrf_test_name="+csrf_token
print(payload_string)
# data -- (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
# json -- (optional) json data to send in the body of the Request.
req = post("http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini",
headers = headers, json = payload_string
)
print("URL " + req.url)
print("RESPONSE:")
print('\t'+str(req.status_code))
print("\tContent-Encoding: " + req.headers["Content-Encoding"])
print("\tContent-type: " + req.headers["Content-type"])
print("\tContent-Length: " + req.headers["Content-Length"])
print('\t'+req.text)
This code works for me:
I use request.Session() and it keeps all cookies
I use data= instead of json=
finally I don't need all commented elements
to compare browser requests and code requests I used Charles web debugging proxy application
code:
import requests
import datetime
#proxies = {
# 'http': 'http://localhost:8888',
# 'https': 'http://localhost:8888',
#}
s = requests.Session()
#s.proxies = proxies # for test only
date = datetime.datetime.today() - datetime.timedelta(days=1)
date = date.strftime('%d-%m-%Y')
# --- main page ---
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/"
print("REFERER:", url+date)
r = s.get(url)
# --- data ---
csrf_token = s.cookies["csrf_cookie_name"]
#headers = {
#'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
#"Host": "www.arpat.toscana.it",
#"Accept" : "*/*",
#"Accept-Language" : "en-US,en;q=0.5",
#"Accept-Encoding" : "gzip, deflate",
#"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
#"X-Requested-With" : "XMLHttpRequest", # XHR
#"Referer" : url,
#"DNT" : "1",
#"Connection" : "keep-alive"
#}
payload = {
'csrf_test_name': csrf_token,
'v_data_osservazione': date,
'v_tipo_bollettino': 'regionale',
'v_zona': None,
}
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini"
r = s.post(url, data=payload) #, headers=headers)
print('Status:', r.status_code)
print(r.json())
proxy: