Crawler got identified when headers provided and requst interval set

Crawler got identified when headers provided and requst interval set - python

Steam provides an endpoint which shows market information of a specific item. Parameter includes country, currency, language, item_nameid, and two_factor.
For example,
https://steamcommunity.com/market/itemordershistogram?country=TW&language=english&currency=30&item_nameid=176185978&two_factor=0
means my country, currency, and language are TW, 30(TWD), and english. ID of the item I'm interesting in is 17618597.
Normally, you can continually access this endpoint without any problem after loggin. I tried accessing this endpoint thirty times within one minute, no problem occurred.
But when using requests library to access this endpoint, steam quickly identified crawling even when user-agent, cookies, other headers provided and requst interval set. I got ip banned after 8 to 12 requests.
My code:
url = 'https://steamcommunity.com/market/itemordershistogram'
payload = {'country': 'TW', 'language': 'english', 'currency': 30, 'item_nameid': 176185978, 'two_factor': 0}
# headers copied from network tab
temp = '''Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: keep-alive
Host: steamcommunity.com
Pragma: no-cache
sec-ch-ua: "Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'''
headers = {key: value for key, value in [line.split(': ') for line in temp.split('\n')]} // conver headers from string literals to dict
temp = 'cookies copied from network tab'
cookies = {key: value for key, value in [pair.split('=') for pair in temp.split('; ')]} // conver cookies from string literals to dict
session = Session()
for i in range(100):
print(f'{i}:')
res = session.get(url, headers=headers, params=payload, cookies=cookies)
print(res.status_code)
print(res.text)
sleep_time = 5 + random.random()
time.sleep(sleep_time)
I do some research on how to deal with anti-spider, most article saying setting user-agent, cookies or extending request interval would help. But in my case, steam still be able to identified crawling.
Some article says adding Referer header would help, but I see no Referer header in browser's network tab.
Therefore, I assume adding Referer header wouldn't help.
So I'm wondering, besides user-agent, cookies, other headers and the request interval, how can steam know I'm using a crawler?

Related

Requests: Python Post Request only returns error code 400

(I'm new to posting on stack overflow tell me stuff to improve on ty)
I recently got into python requests and I am a bit new to web stuff related to python so if its something stupid don't flame me. I've been trying a lot of ways to get one of my post events to not return error 400. I messed with headers and data and a bunch of other stuff and I would either get code 404 or 400.
I have tried adding a bunch of headers to my project. Im trying to login to the site "Roblox" making a auto purchase bot. The Csrf Token was not in cookies so I found a loop hole getting the token with sending a request to a logout url (If anyone gets confused). I double checked the Data I inputted aswell and it is indeed correct. I was researching for multiple hours and I couldn't find a way to fix so I came to stack overflow for the first time.
RobloxSignInDeatils = {
"cvalue": "TestWebDevAcc", #Username
"ctype": "Username", #Leave Alone
"password": "Test123123" #Password
}
def GetToken(Session, Headers):
LogoutUrl = "https://api.roblox.com/sign-out/v1" #Url to get CSRF
Request = Session.post(LogoutUrl, Headers) #Sent Post
print("Token: ", Request.headers["X-CSRF-TOKEN"]) #Steal the token
return Request.headers["X-CSRF-TOKEN"] #Return Token
with requests.session() as CurrentSession:
Header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36"
} #Get the UserAgent Header Only
CurrentSession.get("https://www.roblox.com/login", headers=Header) #Send to the login page
CsrfToken = GetToken(CurrentSession, Header) #Get The Token
Header["X-CSRF-TOKEN"] = CsrfToken #Set the token
SignIn = CurrentSession.post("https://auth.roblox.com/v2/login",
data=RobloxSignInDeatils, headers=Header) # Send Post to Sign in
print(SignIn.status_code) #Returns 400 ?
Printing "print(SignIn.status_code)" just returns 400 nothing else to explain
EDIT: If this helps heres a list of ALL the headers:
Request URL: https://auth.roblox.com/v2/login
Request Method: OPTIONS
Status Code: 200 OK
Remote Address: 128.116.112.44:443
Referrer Policy: no-referrer-when-downgrade
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: X-CSRF-TOKEN,Content-Type,Pragma,Cache-Control,Expires
Access-Control-Allow-Methods: OPTIONS, TRACE, GET, HEAD, POST, DELETE, PATCH
Access-Control-Allow-Origin: https://www.roblox.com
Access-Control-Max-Age: 600
Cache-Control: max-age=120, private
Content-Length: 0
Date: Thu, 14 Feb 2019 04:04:13 GMT
P3P: CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV INT DEM PRE"
Roblox-Machine-Id: CHI1-WEB2875
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Access-Control-Request-Headers: content-type,x-csrf-token
Access-Control-Request-Method: POST
Connection: keep-alive
Host: auth.roblox.com
Origin: https://www.roblox.com
Referer: https://www.roblox.com/
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36
Payload:
{cvalue: "TestWebDevAcc", ctype: "Username", password: "Test123123"}

Login with python requests and csrf-token

I am using the requests module for python to try to login on a webpage. I open up a requests.session(), then I get the cookie and the csrf-token which is included in a meta tag. I build up my payload with username, password, a hidden input field and the csrf-token from the meta tag. After that i use the post method and I am passing through the login url, the cookie, the payload and the header. But after that I can't access a page behind the login page.
What am I doing wrong?
This is the request header when I perfom a login:
Request Headers:
:authority: www.die-staemme.de
:method: POST
:path: /page/auth
:scheme: https
accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7
content-length: 50
content-type: application/x-www-form-urlencoded
cookie: cid=261197879; remember_optout=0; ref=start;
PHPSESSID=3eb4f503f38bfda1c6f48b8f9036574a
origin: https://www.die-staemme.de
referer: https://www.die-staemme.de/
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
x-csrf-token: 3c49b84153f91578285e0dc4f22491126c3dfecdabfbf144
x-requested-with: XMLHttpRequest
This is my code so far:
import requests
from bs4 import BeautifulSoup as bs
import lxml
# Page header
head= { 'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
# Start Page
url = 'https://www.die-staemme.de/'
# Login URL
login_url = 'https://www.die-staemme.de/page/auth'
# URL behind the login page
url2= 'https://de159.die-staemme.de/game.php?screen=overview&intro'
# Open up a session
s = requests.session()
# Open the login page
r = s.get(url)
# Get the csrf-token from meta tag
soup = bs(r.text,'lxml')
csrf_token = soup.select_one('meta[name="csrf-token"]')['content']
# Get the page cookie
cookie = r.cookies
# Set CSRF-Token
head['X-CSRF-Token'] = csrf_token
head['X-Requested-With'] = 'XMLHttpRequest'
# Build the login payload
payload = {
'username': '', #<-- your username
'password': '', #<-- your password
'remember':'1'
}
# Try to login to the page
r = s.post(login_url, cookies=cookie, data=payload, headers=head)
# Try to get a page behind the login page
r = s.get(url2)
# Check if login was successful, if so there have to be an element with the id menu_row2
soup = bs(r.text, 'lxml')
element = soup.select('#menu_row2')
print(element)

It's worth noting that your request, when using the Python Requests module, will not be the exact same as a standard user request. In order to fully mimic a realistic request, and thus not be blocked by any firewall or security measures by the site, you will need to copy both all POST parameters, GET parameters and finally headers.
You can use a tool such as Burp Suite to intercept the login request. Copy the URL it is sending it to, copy all POST parameters also, and finally copy all headers. You should be using the requests.Session() function in order to store cookies. You may also want to do a initial session GET request to the homepage in order to pick up cookies as it is not realistic for a user to send a login request without first visiting the homepage.
I hope that makes sense, header parameters can be passed like so:
import requests
headers = {
'User-Agent': 'My User Agent (copy your real one for a realistic request).'
}
data = {
'username': 'John',
'password': 'Doe'
}
s = requests.Session()
s.get("https://mywebsite.com/")
s.post("https://mywebsite.com/", data=data, headers=headers)

had also the same issue. what did it for me was to add
s.headers.update(headers)
before the first get request in Cillian Collins example.

How to correctly form a POST request to this website with python request

The url I'd like to send post request to is http://www.hkexnews.hk/sdw/search/searchsdw.aspx
The search I'd like to do (manually) is simply input "1" in "Stock Code" and click "Search"
I have tried many time with python and chrome extension "Postman" sending post request with the following header:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: max-age=0
Connection: keep-alive
Content-Length: 1844
Content-Type: application/x-www-form-urlencoded
Cookie: TS0161f2e5=017038eb490da17e158ec558c902f520903c36fad91e96a3b9ca79b098f2d191e3cac56652
Host: www.hkexnews.hk
Origin: http://www.hkexnews.hk
Referer: http://www.hkexnews.hk/sdw/search/searchsdw.aspx
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
and the following as params:
today: 20180624
sortBy:
selPartID:
alertMsg:
ddlShareholdingDay: 23
ddlShareholdingMonth: 06
ddlShareholdingYear: 2018
txtStockCode: 00001
txtStockName:
txtParticipantID:
txtParticipantName:
btnSearch.x: 35
btnSearch.y: 8
but it doesn't work.

Try the below way. It should fetch you the required response along with the tabular data available in that site generated according to the search criteria.
import requests
from bs4 import BeautifulSoup
URL = "http://www.hkexnews.hk/sdw/search/searchsdw.aspx"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
res = s.get(URL)
soup = BeautifulSoup(res.text,"lxml")
payload = {item['name']:item.get('value','') for item in soup.select("input[name]")}
payload['__EVENTTARGET'] = 'btnSearch'
payload['txtStockCode'] = '00001'
payload['txtParticipantID'] = 'A00001'
req = s.post(URL,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup_obj = BeautifulSoup(req.text,"lxml")
for items in soup_obj.select("#pnlResultSummary .ccass-search-datarow"):
data = [item.get_text(strip=True) for item in items.select("div")]
print(data)

If the news site provides a search API and you have the access then you can use something like Postman to get the search results. Otherwise, you will have scrape the results.
The use case you mentioned is typical of scraping. See if there is a search API, if not use something like selenium to scrape the results.

POST request failing after migrating from requests to urllib2

I have moved away from using the python requests library as it was a bit fiddly to get working on Google App Engine. Instead I'm using urllib2 which has better support. Unfortunately the POST request that previously worked with the requests library no longer works with urllib2.
With requests the code was as follows
values = { 'somekey' : 'somevalue'}
r = requests.post(some_url, data=values)
With urllib2, the code is as follows
values = { 'somekey' : 'somevalue'}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data)
response = urllib2.urlopen(req)
Unfortunately the latter raises the following error
HTTP Error 405: Method Not Allowed
The url I'm posting to has the following form:
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com'
I have read that there is an issue with trailing slashes with urllib2, however when I add the slash as follows
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com/'
I still get the same error.
There is no redirection at the destination - so the request shouldn't be transformed to a GET. Even if it were, the url actually accepts GET and POST requests. The EC2 linux instance is running django and I can see that it successfully receives the request and makes use of it - so doesn't return a 405. For some reason urllib2 is picking up a 405 and throwing the exception.
Any ideas as to what may be going wrong?
Edit 1:
As per #philip-tzou 's good call, the following information might help
print req.get_method()
yields POST
print req.header_items()
yields []
Edit 2
Adding a user agent header (as #padraic-cunningham suggests) didn't solve it unfortunately. I added the same header shown in the urllib2 example
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data, headers)
Edit 3
As #furas suggested, I've sent the request I'm making to requestb.in to double check what's being sent. It is indeed a POST request being made, with the following headers:
Connection: close
Total-Route-Time: 0
X-Cloud-Trace-Context: ed073df03ccd05657<removed>2a203/<removed>129639701404;o=5
Connect-Time: 1
Cf-Ipcountry: US
Cf-Ray: <removed>1d155ac-ORD
Cf-Visitor: {"scheme":"http"}
Content-Length: 852
Content-Type: application/x-www-form-urlencoded
Via: 1.1 vegur
X-Request-Id: 20092050-5df4-42f8-8fe0-<removed>
Accept-Encoding: gzip
Cf-Connecting-Ip: <removed>
Host: requestb.in
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)
and now req.header_items() yields [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)')]
by way of comparison, the headers from the requests POST are
Cf-Connecting-Ip: <removed>
Accept-Encoding: gzip
Host: requestb.in
Cf-Ipcountry: US
Total-Route-Time: 0
Cf-Ray: 2f42a45185<removed>-ORD
Connect-Time: 1
Connection: close
Cf-Visitor: {"scheme":"http"}
Content-Type: application/x-www-form-urlencoded
Content-Length: 852
X-Cloud-Trace-Context: 8932f0f9b165d9a0f698<removed>/1740038835186<removed>;o=5
Accept: */*
X-Request-Id: c88a29e1-660e-4961-8112-<removed>
Via: 1.1 vegur
User-Agent: python-requests/2.11.1 AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)

Sending HTTP Post Requests work in Fiddler but not in Python

I am sending several post requests in Fiddler2 to check my site to make sure it is working properly. However, when I automate in Python to simulate this over several hours (I really don't want to spend 7 hours hitting space!).
This works in fiddler. I can create the account and perform related API commands. However in Python, nothing happens with this code:
def main():
import socket
from time import sleep
x = raw_input("Points: ")
x = int(x)
x = int(x/150)
for y in range(x):
new = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
new.connect(('example.com', 80))
mydata ="""POST http://www.example.com/api/site/register/ HTTP/1.1
Host: www.example.com
Connection: keep-alive
Content-Length: 191
X-NewRelic-ID: UAIFVlNXGwEFV1hXAwY=
Origin: http://www.example.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
X-CSRFToken: CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg
DNT: 1
Referer: http://www.example.com/signup
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en;q=0.8
Cookie: sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
username=user&password=password&moredata=here """
new.send(mydata.encode('hex'))
print "Sent", y, "of", x
sleep(1)
print "Sent all!"
print "restarting"
main()
main()
I know I could use While True, but I intend to add more functions later to test more sites.
Why does this program not do anything to the API, when Fiddler2 can? I know it is my program, as I can send the exact same packet in fiddler (obviously pointing to the right place) and it works.
PS - If anyone does fix this, as its probably something really obvious, please can you only use modules that are bundled with Python. I cannot install modules from other places. Thanks!

HTTP requests are not as easy as you think they are. First of all this is wrong:
"""POST http://www.example.com/api/site/register/ HTTP/1.1
Host: www.example.com
Connection: keep-alive
...
"""
Each line in HTTP request has to end with CRLF (in Python with \r\n), i.e. it should be:
"""POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
...
"""
Note: LF = line feed = \n is there implicitly. Also you didn't see CR in your fiddler, because it's a white-space. But it has to be there (simple copy-paste won't work).
Also HTTP specifies that after headers there has to be CRLF as well. I.e. your entire request should be:
mydata = """POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
Content-Length: 191\r
X-NewRelic-ID: UAIFVlNXGwEFV1hXAwY=\r
Origin: http://www.example.com\r
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36\r
Content-Type: application/x-www-form-urlencoded; charset=UTF-8\r
Accept: application/json, text/javascript, */*; q=0.01\r
X-Requested-With: XMLHttpRequest\r
X-CSRFToken: CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg\r
DNT: 1\r
Referer: http://www.example.com/signup\r
Accept-Encoding: gzip,deflate,sdch\r
Accept-Language: en-GB,en;q=0.8\r
Cookie: sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)\r
\r
username=user&password=password&moredata=here"""
Warning: it should be exactly as I've written. There can't be any spaces in front of each line, i.e. this:
mydata = """POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
...
"""
is wrong.
Side note: you can move mydata to the top, outside of the loop. Unimportant optimization but makes your code cleaner.
Now you've said that the site you are using wants you to hex-encode HTTP request? It's hard for me to believe that (HTTP is a raw string by definition). Don't do that (and ask them to specify what exactly this hex-encoding means). Possibly they've meant that the URL should be hex-encoded (since it is the only hex-encoding that is actually used in HTTP)? In your case there is nothing to encode so don't worry about it. Just remove the .encode('hex') line.
Also Content-Length header is messed up. It should be the actual length of the content. So if for example body is username=user&password=password&moredata=here then it should be Content-Length: 45.
Next thing is that the server might not allow you making multiple requests without getting a response. You should use new.recv(b) where b is a number of bytes you want to read. But how many you should read? Well this might be problematic and that's where Content-Length header comes in. First you have to read headers (i.e. read until you read \r\n\r\n which means the end of headers) and then you have to read the body (based on Content-Length header). As you can see things are becoming messy (see: final section of my answer).
There are probably more issues with your code. For example X-CSRFToken suggests that the site uses CSRF prevention mechanism. In that case your request might not work at all (you have to get the value of X-CSRFToken header from the server).
Finally: don't use sockets directly. Httplib (http://docs.python.org/2/library/httplib.html) is a great (and built-in) library for making HTTP requests which will deal with all the funky and tricky HTTP stuff for you. Your code for example may look like this:
import httplib
headers = {
"Host": "www.example.com",
"X-NewRelic-ID": "UAIFVlNXGwEFV1hXAwY=",
"Origin": "http://www.example.com",
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Accept": "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"X-CSRFToken": "CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg",
"DNT": "1",
"Referer": "http://www.example.com/signup",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "en-GB,en;q=0.8",
"Cookie": "sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)"
}
body = "username=user&password=password&moredata=here"
conn = httplib.HTTPConnection("example.com")
conn.request("POST", "http://www.example.com/api/site/register/", body, headers)
res = conn.getresponse()
Note that you don't need to specify Content-Length header.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawler got identified when headers provided and requst interval set - python

Related

Requests: Python Post Request only returns error code 400

Login with python requests and csrf-token

How to correctly form a POST request to this website with python request

POST request failing after migrating from requests to urllib2

Sending HTTP Post Requests work in Fiddler but not in Python

Categories

Resources