I tried to scrape some data from a national betting site called https://tippmix.hu.
I wanted to get the data from this specific page:https://www.tippmix.hu/sportfogadas#?sportid=999&countryid=99999999&competitionid=45975&page=1. The data is dynamically loaded so I inspected the page and found a specific POST request which is responsible for loading the data I need in json. This is it.
I opened python and used the requests library to make a post request to this page: https://api.tippmix.hu/tippmix/search with exact same headers and data you can see on my previous picture. Unfortunately, my code returned the whole json file as if I did not specified any parameters.
Here is my code (the header_convert function converts the copied header string into a dictionary:
import requests
from header_convert import header_convert
events_url = "https://api.tippmix.hu/tippmix/search"
data = {"fieldValue": "",
"sportId": "999",
"competitionGroupId": "99999999",
"competitionId": "45975",
"type": "0",
"date": "0001-01-01T00:00:00.000Z",
"hitsPerPage": "20",
"page": "1",
"minOdds": "null",
"maxOdds": "null"}
raw_headers = """Accept: application/json, text/plain, */*
Content-Type: application/x-www-form-urlencoded
Origin: https://www.tippmix.hu
Content-Length: 182
Accept-Language: en-us
Host: api.tippmix.hu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Referer: https://www.tippmix.hu/
Accept-Encoding: gzip, deflate, br
Connection: keep-alive"""
headers = header_convert.header_convert(raw_headers)
print(headers)
page = requests.post(events_url, data=data, headers=headers)
print(page.content)
Here are my headers:
{'Accept': 'application/json, text/plain, */*', 'Content-Type': 'application/x-www-form-urlencoded', 'Origin': 'https://www.tippmix.hu', 'Content-Length': '182', 'Accept-Language': 'en-us', 'Host': 'api.tippmix.hu', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15', 'Referer': 'https://www.tippmix.hu/', 'Accept-Encoding': 'gzip, deflate, br', 'Connection': 'keep-alive'}
I wonder if someone could help me.
Thank you!
You need to post the data as json, updated with working code:
import requests
import json
url = 'https://api.tippmix.hu/tippmix/search'
data = {"fieldValue":"",
"sportId":999,
"competitionGroupId":99999999,
"competitionId":45975,
"type":0,
"date":"0001-01-01T00:00:00.000Z",
"hitsPerPage":20,
"page":1,
"minOdds":"null",
"maxOdds":"null"}
headers = {
'Accept':'application/json, text/plain, */*',
'Content-Type':'application/x-www-form-urlencoded',
'Host':'api.tippmix.hu',
'Origin':'https://www.tippmix.hu',
'Referer':'https://www.tippmix.hu/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
resp = requests.post(url,headers=headers,data=json.dumps(data)).json()
print(resp)
Related
I have a scraper that has worked without an issue for 18 months until today. Now I get 403 response from htlv.org and don't seem to be able to fix the issue. My code is below so the answer is not the usual to just add headers. If I print response.text it says something about captchas. So I assume I'd have to bypass captcha or my ip is blocked? Please help :)
import requests
url = 'https://www.hltv.org/matches'
headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite.com",
"Connection": "keep-alive"}
response = requests.get(url, headers=headers)
print response
EDIT: This remains a mystery to me, but today my code started working again on my main PC. Did not make any changes to the code.
KokoseiJ could not reproduce the problem, but Booboo did. The code also worked on my old PC, which I dug from storage, but not on my main PC. Anyways, thanks to all who tried to help me with this issue.
I am posting this not as a solution but as something that did not work, but may be useful information.
I went to https://www.hltv.org/matches then brought up Chrome's Inspector and reloaded the page and looked at the request headers Chrome (supposedly) used for the GET request. Some of the header names began with a ':', which requests considers illegal. But looking around Stack Overflow, I found a way to get around that (supposedly for Python 3.7 and greater). See the accepted answer and comments here for details.
This still resulted in a 403 error. Perhaps somebody might spot an error in this (or not).
These were the headers shown by the Inspector:
:authority: www.hltv.org
:method: GET
:path: /matches
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
cookie: MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}
dnt: 1
pragma: no-cache
sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36
And the code:
import requests
import http.client
import re
http.client._is_legal_header_name = re.compile(rb'\S[^:\r\n]*').fullmatch
url = 'https://www.hltv.org/matches'
headers = {
':authority': 'www.hltv.org',
':method': 'GET',
':path': '/matches',
':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'cookie': 'MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.text)
print(response)
Also came across this issue recently.
My solution was using th js-fetch library (see answer)
I assume cloudfare and others found some way to detect, wheather a request is made by a browser (js) or other programming languages.
I'm trying to submit info to this site > https://cxkes.me/xbox/xuid
The info: e = {'gamertag' : "Xi Fall iX"}
Every time I try, I get WinError 10054. I can't seem to find a fix for this.
My Code:
import urllib.parse
import urllib.request
import json
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
url = "https://cxkes.me/xbox/xuid"
e = {'gamertag' : "Xi Fall iX"}
f = {'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'accept-encoding': "gzip, deflate, br",
'accept-language': "en-GB,en-US;q=0.9,en;q=0.8",
'cache-control': "max-age=0",
'content-length': "76",
'content-type': "application/x-www-form-urlencoded",
'cookie': "__cfduid=d2f371d250727dc4858ad1417bdbcfba71593253872; XSRF-TOKEN=eyJpdiI6IjVcL2dHMGlYSGYwd3ZPVEpTRGlsMnFBPT0iLCJ2YWx1ZSI6InA4bDJ6cEtNdzVOT3UxOXN4c2lcLzlKRTlYaVNvZjdpMkhqcmllSWN3eFdYTUxDVHd4Y2NiS0VqN3lDSll4UDhVMHM1TXY4cm9lNzlYVGE0dkRpVWVEZz09IiwibWFjIjoiYjdlNjU3ZDg3M2Y0MDBlZDY3OWE5YTdkMWUwNGRiZTVkMTc5OWE1MmY1MWQ5OTQ2ODEzNzlhNGFmZGNkZTA1YyJ9; laravel_session=eyJpdiI6IjJTdlFhK0dacFZ4cFI5RFFxMHgySEE9PSIsInZhbHVlIjoia2F6UTJXVmNSTEt1M3lqekRuNVFqVE5ZQkpDang4WWhraEVuNm0zRmlVSjVTellNTDRUb1wvd1BaKzNmV2lISGNUQ0l6Z21jeFU3VlpiZzY0TzFCOHZ3PT0iLCJtYWMiOiIwODU3YzMxYzg2N2UzMjdkYjcxY2QyM2Y4OTVmMTY1YTcxZTAxZWI0YTExZDE0ZjFhYWI2NzRlODcyOTg3MjIzIn0%3D",
'origin': "https://cxkes.me",
'referer': "https://cxkes.me/xbox/xuid",
'sec-fetch-dest': "document",
'sec-fetch-mode': "navigate",
'sec-fetch-site': "same-origin",
'sec-fetch-user': "?1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
data = urllib.parse.urlencode(e)
data = json.dumps(data)
data = str(data)
data = data.encode('ascii')
req = urllib.request.Request(url, data, f)
with urllib.request.urlopen(req) as response:
the_page = response.read()
print(the_page)
Having run the code, I get the following error
[WinError 10054] An existing connection was forcibly closed by the remote host
that could be caused by any of the followings:
The network link between server and client may be temporarily going down.
running out of system resources.
sending malformed data.
I am not sure what you trying to achieve here entirely. But if your aim is to simply read the XUID of a gamer tag then use a web-automator like Selenium to retrieve that value.
I have problem on site where email is under obfuscator.
When i run my program i get output:
The email is:{"success":"","code":1,"msg":"ReCAPTCHA"}
But when i want click 'watch email' on computer all is fine and i get:
The email is:{"success":"","code":0,"msg":"xxxx#gmail.com"}
Code in POST:
REQUEST HEADERS
Accept: application/json, text/javascript, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7
Connection: keep-alive
Content-Length: 141
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: SOOOME COOKIES.
Host: https://xxxxxxx.com
Origin: https://xxxxxxx.com
Referer: https://xxxxxxx.com/asas
User-Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36
X-Requested-With: XMLHttpRequest
QUERY STRING PARAMETRS
decode:
FORM DATA
hash: YToyOntpOjA7czo0NDoidHh3VFlXck83eFdza1FRUWgydUlvb0MveHRRemNLaCtNa3BuenVJU0VmUT0iO2k6MTtzOjE2OiK3SJ7OlhTa5DgPfA1YqCfRIjt9
type: ademail
And here is my code:
import requests
url = "https://xxxxx.com/_ajax/obfuscator/?decode"
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Content-Length': '141',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'https://xxxxx.com/',
'Origin': 'https://xxxxx.com/',
'Referer':'https://xxxxx.com/asd',
'X-Requested-With':'XMLHttpRequest' }
data = {'hash':'YToyOntpOjA7czo0NDoiQStHbXkrY2p1dllrUmlXSWdWTjdNbHF2Y3cyak13QU5GeUtaQXZReFcrbz0iO2k6MTtzOjE2OiJ7Byq7O88ydxCtVWgoEETOIjt9',
'type':'adsemail'}
r = requests.post(url, data, headers)
pastebin_url = r.text
print("The email is:%s"%pastebin_url)
I also try do it the same as Webdriver
driver = webdriver.Chrome("C:/Users/User/Desktop/Email/chromedriver.exe")
driver.set_page_load_timeout(5000)
driver.get("https://xxxx.com/asd")
driver.implicitly_wait(3000)
sleep(1)
RODO = "//input[#class='btn btn-confirm']"
driver.find_element_by_xpath(RODO).click()
sleep(7)
email = "//span[#class='click_to_show']"
driver.find_element_by_xpath(email).click()
But i get Recaptcha to do.... ;/
Where is the problem?
I also try:
``` user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('user-agent={0}'.format(user_agent))
driver = webdriver.Chrome("C:/Users/User/Desktop/Email/chromedriver.exe")
driver.set_page_load_timeout(5000)
But not working, site want captcha ;/
I have bought a little wifi relay module - though it is in Chinese which I do not read I have worked out how to open and close the relay from the buttons on the home page on the embedded web server.
I then used postman interceptor to capture the 'open' and 'close' actions, and I can now click the 'post' button to make the action happen.
However the 'generate code' python script doesn't work, and from my limited understanding doesn't have the right info.
import requests
url = "http://192.168.4.1/"
payload = ""
headers = {
'origin': "http://192.168.4.1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
'content-type': "application/x-www-form-urlencoded",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'dnt': "1",
'referer': "http://192.168.4.1/",
'accept-encoding': "gzip, deflate",
'accept-language': "en-US,en;q=0.8",
'cache-control': "no-cache",
'postman-token': "bece04e7-ee50-3764-ca50-e86d07ebc0f3"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
The output when I select HTTP instead of Python Requests is
POST / HTTP/1.1
Host: 192.168.4.1
Origin: http://192.168.4.1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
DNT: 1
Referer: http://192.168.4.1/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cache-Control: no-cache
Postman-Token: 0bd42b4f-067d-b5be-dd1c-b7e689196043
open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD
Could someone suggest how to modify the Python to correctly send the POST which works correctly from with Postman itself ?
Your python code is missing the POST data which contains the command to the piece of equipment, which is listed at the bottom of the http request.
Put open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD into the payload variable in the python code:
import requests
url = "http://192.168.4.1/"
payload = "open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD"
headers = {
'origin': "http://192.168.4.1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
'content-type': "application/x-www-form-urlencoded",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'dnt': "1",
'referer': "http://192.168.4.1/",
'accept-encoding': "gzip, deflate",
'accept-language': "en-US,en;q=0.8",
'cache-control': "no-cache",
'postman-token': "bece04e7-ee50-3764-ca50-e86d07ebc0f3"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
I'm trying to simulate a request
that has various headers and bracketed form data.
Form Data:
{"username": "MY_USERNAME", "pass": "MY_PASS", "AUTO": "true"}
That is the form data shown in the console of Chrome
So I tried putting it together with Python's requests library:
import requests
reqUrl = 'http://website.com/login'
postHeaders = {
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '68',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'website.com',
'Origin': 'http://www.website.com',
'Referer': 'http://www.website.com/',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
}
payload = {"username": "MY_USERNAME",
"pass": "MY_PASS",
"AUTO": "true"
}
session = requests.Session()
response = session.post(reqUrl, data=payload, headers=postHeaders)
I'm receiving a response but it shows:
{"status":"failure","error":"Invalid request data"}
Am I going about implementing the form data wrong? I was also thinking it could have to do with modifying the Content-Length?
Yes, you are setting a content length, overriding anything requests might set. You are setting too many headers, leave most of those to the library instead:
postHeaders = {
'Accept-Language': 'en-US,en;q=0.8',
'Origin': 'http://www.website.com',
'Referer': 'http://www.website.com/',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
}
is plenty. All the others will be generated for you.
However, from your description of the form data, it looks like you are posting JSON instead. In that case, use the json keyword argument instead of data, which will encode your payload to JSON and set the Content-Type header to application/json:
response = session.post(reqUrl, json=payload, headers=postHeaders)