I want to retrieve Atmospheric particulate matter values from a table (sadly the site is not in english, so feel free to ask for everything): I failed with the combination of BeautifulSoup and GET request sent with requests, since table is filled with Bootstrap dinamically and a parser like BeautifulSoup can't find values which still must be inserted.
With Firebug I checked every angle of the page, and I found out that by selecting a different day of the table, a POST request is sent (the site, as you can see in Referer, is http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/, where the table is):
POST /temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini HTTP/1.1
Host: www.arpat.toscana.it
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/26-12-2016
Content-Length: 114
Cookie: [...]
DNT: 1
Connection: keep-alive
With the following params:
v_data_osservazione=26-12-2016&v_tipo_bollettino=regionale&v_zona=&csrf_test_name=b88d2517c59809a529
b6f8141256e6ca
Data in the answer are in JSON format.
So I started to craft my personal POST request, in order to directly get the JSON data which will fill the table.
In the params, in addition to the date, a csrf_test_name is required: here I discovered this site is protected against CSRF vulnerability; in order to perform a correct query in params, I need a CSRF token: that's why I perform a GET request to the site (see Referer in POST request for the URL) and get CSRF token from the cookie like this:
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
At end of the day, with my CSRF token and POST request ready, I send it...and with status code 200, I always get Disallowed Key Characters.!
Looking for this error, I always see posts about CodeIgniter, which (I think) is not what I need: I tried every combination of headers and parameters, yet nothing changed. Before giving up on BeautifulSoup and requests and start learning Selenium, I'd like to figure out what the problem is: Selenium is too high level, low level libraries like BeautifulSoup and requests let me learn lot of useful things, so I'd prefer continue learning with these two.
Here's the code:
from requests import get, post
from bs4 import BeautifulSoup
import datetime
import json
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/" # + %d-%m-%Y
yesterday = datetime.date.today() - datetime.timedelta(1)
date_object = datetime.datetime.strptime(str(yesterday), '%Y-%m-%d')
yesterday_string = str(date_object.strftime('%d-%m-%Y'))
full_url = url + yesterday_string
print("REFERER " + full_url)
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
print(csrf_token)
# preparing headers for POST request
headers = {
"Host": "www.arpat.toscana.it",
"Accept" : "*/*",
"Accept-Language" : "en-US,en;q=0.5",
"Accept-Encoding" : "gzip, deflate",
"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With" : "XMLHttpRequest", # XHR
"Referer" : full_url,
"DNT" : "1",
"Connection" : "keep-alive"
}
# preparing POST parameters (to be inserted in request's body)
payload_string = "v_data_osservazione="+yesterday_string+"&v_tipo_bollettino=regionale&v_zona=&csrf_test_name="+csrf_token
print(payload_string)
# data -- (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
# json -- (optional) json data to send in the body of the Request.
req = post("http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini",
headers = headers, json = payload_string
)
print("URL " + req.url)
print("RESPONSE:")
print('\t'+str(req.status_code))
print("\tContent-Encoding: " + req.headers["Content-Encoding"])
print("\tContent-type: " + req.headers["Content-type"])
print("\tContent-Length: " + req.headers["Content-Length"])
print('\t'+req.text)
This code works for me:
I use request.Session() and it keeps all cookies
I use data= instead of json=
finally I don't need all commented elements
to compare browser requests and code requests I used Charles web debugging proxy application
code:
import requests
import datetime
#proxies = {
# 'http': 'http://localhost:8888',
# 'https': 'http://localhost:8888',
#}
s = requests.Session()
#s.proxies = proxies # for test only
date = datetime.datetime.today() - datetime.timedelta(days=1)
date = date.strftime('%d-%m-%Y')
# --- main page ---
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/"
print("REFERER:", url+date)
r = s.get(url)
# --- data ---
csrf_token = s.cookies["csrf_cookie_name"]
#headers = {
#'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
#"Host": "www.arpat.toscana.it",
#"Accept" : "*/*",
#"Accept-Language" : "en-US,en;q=0.5",
#"Accept-Encoding" : "gzip, deflate",
#"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
#"X-Requested-With" : "XMLHttpRequest", # XHR
#"Referer" : url,
#"DNT" : "1",
#"Connection" : "keep-alive"
#}
payload = {
'csrf_test_name': csrf_token,
'v_data_osservazione': date,
'v_tipo_bollettino': 'regionale',
'v_zona': None,
}
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini"
r = s.post(url, data=payload) #, headers=headers)
print('Status:', r.status_code)
print(r.json())
proxy:
Related
I am trying to scrape some data from a Korean website for goods.
The website displays general data such as arrival date, department date, mother ship name of cargo ships.
Website Link
The black button on the right is the search button.
In order to obtain data from it, some radio buttons have to be set up then hit search.
So what I thought was I could do a Post request to the website so I can extract data from the response.
Unfortunately, the response was just a plain page without the Post request.
This is the Post request
POST /Berth_status_text_servlet_sw_kr HTTP/1.1
Accept: text/html, application/xhtml+xml, image/jxr, */*
Referer: http://info.bptc.co.kr:9084/content/sw/frame/berth_status_text_frame_sw_kr.jsp
Accept-Language: en-US,en;q=0.7,ko;q=0.3
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Content-Length: 40
Host: info.bptc.co.kr:9084
Pragma: no-cache
Connection: close
v_time=month&ROCD=ALL&ORDER=item2&v_gu=S
And this is what I did in Python
from bs4 import BeautifulSoup
import requests
params ={'v_time': 'month',
'ROCD': 'ALL',
'ORDER': 'item2',
'v_gu': 'S'}
response = requests.post(url, data = params)
soup = BeautifulSoup(response.content,"html")
print(soup)
I did try to put encoding and other things in the headers like below
response = requests.post(url, data = params,
headers={'Accept': 'text/html, application/xhtml+xml, image/jxr, */*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Content-type': 'application/x-www-form-urlencoded; text/html; charset=euc-kr',
'Accept-Language': 'en-US,en;q=0.7,ko;q=0.3'
})
It did not work either.
The codes work fine on the other websites, so I guess it is something related to Korean characteristic.
I tried to search solutions for the issue, but I didn't have a luck.
Would you mind help me?
Thanks !
Your approach is correct. The response returns html and you need to parse it into a more usable format. The following code converts the table from the html response into a list if dicts:
from bs4 import BeautifulSoup
import requests
params = {"v_time": "month", "ROCD": "ALL", "ORDER": "item2", "v_gu": "S"}
response = requests.post(
"http://info.bptc.co.kr:9084/Berth_status_text_servlet_sw_kr", data=params
)
soup = BeautifulSoup(response.content, features="html.parser")
keys = [th.get_text(strip=True) for th in soup("th")]
data = [
{key: value.get_text(strip=True) for key, value in zip(keys, row("td"))}
for row in soup("tr")
]
print(data)
Prints:
[
{
"S/H": "0",
"모선항차": "DPYT-21",
"반입 마감일시": "",
"선박명": "PEGASUS YOTTA",
"선사": "DYS",
"선석": "2",
"선적": "0",
"양하": "0",
"입항 예정일시": "2020/06/08 21:00",
"입항일시": "",
"전배": "",
"접안": "P",
"출항 예정일시": "2020/06/09 11:00",
"출항일시": "",
"항로": "NCK",
}
...
]
I am trying to fill a form like that and submit it automaticly. To do that, I sniffed the packets while logging in.
POST /?pg=ogrgiris HTTP/1.1
Host: xxx.xxx.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
Origin: http://xxx.xxx.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15
Referer: http://xxx.xxx.com/?pg=ogrgiris
Upgrade-Insecure-Requests: 1
DNT: 1
Content-Length: 60
Connection: close
seviye=700&ilkodu=34&kurumkodu=317381&ogrencino=40&isim=ahm
I repeated that packet by burp suite and saw works porperly. the response was the html of the member page.
Now I tried to do that on python. The code is below:
import requests
url = 'http://xxx.xxx.com/?pg=ogrgiris'
headers = {'Host':'xxx.xxx.com',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Content-Type':'application/x-www-form-urlencoded',
'Referer':'http://xxx.xxx.com/?pg=ogrgiris',
'Content-Lenght':'60','Connection':'close'}
credentials = {'seviye': '700','ilkodu': '34','kurumkodu': '317381','ogrecino': '40','isim': 'ahm'}
r = requests.post(url,headers=headers, data=credentials)
print(r.content)
the problem is, that code prints the html of the login page even I send all of the credentials enough to log in. How can I get the member page? thanks.
If the POST request displays a page with the content you want, then the problem is only that you are sending data as JSON, not in "form" data format (application/x-www-form-urlencoded).
If a session is created at the request base and you have to make another request for the requested data, then you have to deal with cookies.
Problem with data format:
r = requests.post(url, headers=headers, data=credentials)
Kwarg json = creates a request body as follows:
{"ogrecino": "40", "ilkodu": "34", "isim": "ahm", "kurumkodu": "317381", "seviye": "700"}
While data= creates a request body like this:
seviye=700&ilkodu=34&kurumkodu=317381&ogrencino=40&isim=ahm
You can try https://httpbin.org:
from requests import post
msg = {"a": 1, "b": True}
print(post("https://httpbin.org/post", data=msg).json()) # Data as Form data, look at key `form`, it's object in JSON because it's Form data format
print(post("https://httpbin.org/post", json=msg).json()) # Data as json, look at key `data`, it's string
If your goal is to replicate the sample request, you are missing a lot of the headers; this in particular is very important Content-Type: application/x-www-form-urlencoded because it will tell your HTTP client how to format/encode the payload.
Check the documentation for requests so see how these form posts can work.
I am using the requests module for python to try to login on a webpage. I open up a requests.session(), then I get the cookie and the csrf-token which is included in a meta tag. I build up my payload with username, password, a hidden input field and the csrf-token from the meta tag. After that i use the post method and I am passing through the login url, the cookie, the payload and the header. But after that I can't access a page behind the login page.
What am I doing wrong?
This is the request header when I perfom a login:
Request Headers:
:authority: www.die-staemme.de
:method: POST
:path: /page/auth
:scheme: https
accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7
content-length: 50
content-type: application/x-www-form-urlencoded
cookie: cid=261197879; remember_optout=0; ref=start;
PHPSESSID=3eb4f503f38bfda1c6f48b8f9036574a
origin: https://www.die-staemme.de
referer: https://www.die-staemme.de/
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
x-csrf-token: 3c49b84153f91578285e0dc4f22491126c3dfecdabfbf144
x-requested-with: XMLHttpRequest
This is my code so far:
import requests
from bs4 import BeautifulSoup as bs
import lxml
# Page header
head= { 'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
# Start Page
url = 'https://www.die-staemme.de/'
# Login URL
login_url = 'https://www.die-staemme.de/page/auth'
# URL behind the login page
url2= 'https://de159.die-staemme.de/game.php?screen=overview&intro'
# Open up a session
s = requests.session()
# Open the login page
r = s.get(url)
# Get the csrf-token from meta tag
soup = bs(r.text,'lxml')
csrf_token = soup.select_one('meta[name="csrf-token"]')['content']
# Get the page cookie
cookie = r.cookies
# Set CSRF-Token
head['X-CSRF-Token'] = csrf_token
head['X-Requested-With'] = 'XMLHttpRequest'
# Build the login payload
payload = {
'username': '', #<-- your username
'password': '', #<-- your password
'remember':'1'
}
# Try to login to the page
r = s.post(login_url, cookies=cookie, data=payload, headers=head)
# Try to get a page behind the login page
r = s.get(url2)
# Check if login was successful, if so there have to be an element with the id menu_row2
soup = bs(r.text, 'lxml')
element = soup.select('#menu_row2')
print(element)
It's worth noting that your request, when using the Python Requests module, will not be the exact same as a standard user request. In order to fully mimic a realistic request, and thus not be blocked by any firewall or security measures by the site, you will need to copy both all POST parameters, GET parameters and finally headers.
You can use a tool such as Burp Suite to intercept the login request. Copy the URL it is sending it to, copy all POST parameters also, and finally copy all headers. You should be using the requests.Session() function in order to store cookies. You may also want to do a initial session GET request to the homepage in order to pick up cookies as it is not realistic for a user to send a login request without first visiting the homepage.
I hope that makes sense, header parameters can be passed like so:
import requests
headers = {
'User-Agent': 'My User Agent (copy your real one for a realistic request).'
}
data = {
'username': 'John',
'password': 'Doe'
}
s = requests.Session()
s.get("https://mywebsite.com/")
s.post("https://mywebsite.com/", data=data, headers=headers)
had also the same issue. what did it for me was to add
s.headers.update(headers)
before the first get request in Cillian Collins example.
I want to communicate with some website. I successfully login into the site but I can't send query below. parameter is clear in below images but I don't know why my code response code 400.
Header:
Params:
Here is my code in python:
#init user-agent header for perfomance and compatibility
heads={ 'User-Agent' : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0" }
user = "XXXXXXXXXXX"
user_id="4013370545"
#update session
user_url = instagram_url + user + "/"
result=c.get(user_url,headers=heads)
#fist request for followers.
qdata= "ig_user("+user_id+")+{++followed_by.first(10)+{++++count,++++page_info+{++++++end_cursor,++++++has_next_page++++},++++nodes+{++++++id,++++++is_verified,++++++followed_by_viewer,++++++requested_by_viewer,++++++full_name,++++++profile_pic_url,++++++username++++}++}}"
queryid="17851938028087704"
query_data=dict(q=qdata,ref="relationships::follow_list",query_id=queryid)
query_data= {'q': qdata, 'ref': "relationships::follow_list", 'query_id': queryid}
#query_data="q="+qdata+"ref="+"relationships::follow_list"+"query_id="+queryid
#set number of header required for login
heads['X-Requested-With']="XMLHttpRequest"
heads['X-Instagram-AJAX']="1"
heads['Referer']='https://www.instagram.com/'+user+'/'
heads['Host']= 'www.instagram.com'
heads['X-CSRFToken']=result.cookies['csrftoken']
#heads['Accept-Encoding']="gzip, deflate, br"
#heads['Accept-Language']="en-US,en;q=0.5"
#heads['Accept']='*/*'
#heads['Content-Type']='application/x-www-form-urlencoded'
#login to the instagram using query_data and prepared headers
result =c.post(instagram_url+"query/", data=query_data, headers=heads)
but result is:
<Response [400]>
where is my mistake? any suggestion?
I have this url, the content are produced in this way (php, it's supose to generate a random cookie on every request):
setcookie('token', md5(time()), time()+99999);
if(isset($_COOKIE['token'])) {
echo 'Cookie: ' .$_COOKIE['token'];
die();
}
echo 'Cookie not set yet';
As you can see, the cookie changes on every reload/refresh of the page. Now i have a python (python3) script with three completely independent from each other requests:
import requests
def get_req_data(req):
print('\n\ntoken: ', req.cookies['token'])
print('headers we sent: ', req.request.headers)
print('headers server sent back: ', req.headers)
url = 'http://migueldvl.com/heya/login/tests2.php'
headers = {
"User-agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
"Referer": 'https://www.google.com'
}
req1 = requests.get(url, headers=headers)
get_req_data(req1)
req2 = requests.get(url, headers=headers)
get_req_data(req2)
req3 = requests.get(url, headers=headers)
get_req_data(req3)
How can be that we sometimes have the same cookie in diferent requests? If clearly it's program to change on every request?
If we:
import time
and add a
time.sleep(1) # wait one second before the next request
between requests, the cookie change all the time, this is the right and expected behaviour, but my question is why do we need this (time.sleep(1)) to be certain of the changing cookie? Wouldn't different requests be enough?