I built a scraper that works up to a point: It navigates to a list of records, parses the records to key ones for further crawling, goes to those individual records but is unable to parse tables in the records because they are loaded via JavaScript. JavaScript issues a POST request (xmr) to populate them. So if JavaScript is not enabled it returns something like 'No records found.'
So I read this question: Link
I inspected Request Headers with browser dev tools. Headers include:
fetch("https://example.com/Search/GridQuery?query=foo", {
"headers": {
"accept": "text/plain, */*; q=0.01",
"accept-language": "en-US,en;q=0.9,es;q=0.8",
"cache-control": "no-cache",
"content-type": "application/x-www-form-urlencoded",
"pragma": "no-cache",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"x-requested-with": "XMLHttpRequest"
},
"referrer": "https://example.com/SiteSearch/Search?query=bar",
"referrerPolicy": "no-referrer-when-downgrade",
"body": "page=1&size=10&useFilters=false",
"method": "POST",
"mode": "cors",
"credentials": "include"
});
The browser does indicate a cookie although not output by copying fetch...
I then tried this:
url = response.urljoin(response.css('div#Foo a::attr(href)').get())
yield Request(url=url,
method='POST',
body='{"filters": ["page": "1", "size": "10", "useFilters": "False"]}',
headers={'x-requested-with': 'XMLHttpRequest'},
callback=self.parse_table)
I get a response but it still says 'No records found'. So the POST request is not working right.
Do I need to put everything in the request header? How do I determine what must be included? Are cookies required?
I did not test this, since you didn't provide real url, but I see a couple of problems there.
Note that content type is application/x-www-form-urlencoded, and you are sending JSON object in the body (that's for application/json)
Instead, you should be sending FormRequest:
url = "https://example.com/Search/GridQuery?query=foo"
form_data = {"page": "1", "size": "10", "useFilters": "False"}
yield FormRequest(url, formdata=form_data, callback=self.parse_table)
Or simply add parameters as query parameters in the URL (still POST request, just omit the body).
url="https://example.com/Search/GridQuery?query=foo&page=1&size=10&useFilters=False"
Either way, you do not need that "filters":[], just use simple key-value object.
Related
So, basically, it seems that requests.get(url) can't complete with Tiktok user profiles url:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(url) #<= will never complete
As I don't get any error message, I have no idea what's going on. Why is it not completing? How do I get it to complete?
TikTok is quite strict when it comes to automated connections so you need to provide headers in your request, like this:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(
url,
headers={
"Accept": "*/*",
"Accept-Encoding": "identity;q=1, *;q=0",
"Accept-Language": "en-US;en;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"User-Agent": "Mozilla/5.0",
},
)
print(rep)
This should respond with 200.
However, if you plan on doing some heavy lifting with your code, consider using one of the unofficial API wrappers, like this one.
Before I start let me point out that I have almost no clue wtf I'm doing. Like imagine a cat that tries to do some coding. I try to write some Python code using Pycharm on Ubuntu 22.04.1 LTS and also used Insomnia if this makes any difference. Here is the code:
`
# sad_scrape_code_attempt.py
import time
import httpx
from playwright.sync_api import sync_playwright
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
def get_cookie_playwright():
with sync_playwright() as p:
browser = p.firefox.launch(headless=False, slow_mo=50)
context = browser.new_context()
page = context.new_page()
page.goto('https://shop.metro.bg/shop/cart')
page.fill('input#user_id', 'the_sad_cat_username')
page.fill('input#password', 'the_sad_cat_password')
page.click('button[type=submit]')
page.click('button.btn-primary.accept-btn.field-accept-button-name')
page.evaluate(
"""
var intervalID = setInterval(function () {
var scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement.scrollHeight;
}, 200);
"""
)
prev_height = None
while True:
curr_height = page.evaluate('(window.innerHeight + window.scrollY)')
if not prev_height:
prev_height = curr_height
time.sleep(1)
elif prev_height == curr_height:
page.evaluate('clearInterval(intervalID)')
break
else:
prev_height = curr_height
time.sleep(1)
# print(context.cookies())
cookie_for_requests = context.cookies()[11]['value']
browser.close()
return cookie_for_requests
def req_with_cookie(cookie_for_requests):
cookies = dict(
Cookie=f'BIGipServerbetty.metrosystems.net-80={cookie_for_requests};')
r = httpx.get('https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current', cookies=cookies)
return r.text
if __name__ == '__main__':
data = req_with_cookie(get_cookie_playwright())
print(data)
# Used packages
#Playwright
#PyTest
#PyTest-Playwirght
#JavaScript
#TypeScript
#httpx
`
so basically I copy paste the code of 2 tutorials made by John Watson Rooney called:
The Biggest Mistake Beginners Make When Web Scraping
Login and Scrape Data with Playwright and Python
Than combined them and added some JavaScript to scroll to the bottom of the page. Than I found an article called: How Headers Are Used to Block Web Scrapers and How to Fix It
thus replacing "import requests" with "import httpx" and added the HEADERS as per given from Insomnia. From what I understand browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy! If this is true I need to figure out a way to return my cookies header following the correct order, which by the way I have no clue how to figure out but I believe its #11 or #3 judging by the code generated by Insomnia:
`
import requests
url = "https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current"
querystring = {"customerId":"1001100022726355","cardholderNumber":"1","storeId":"00022","country":"BG","locale":"bg-BG","fsdAddressId":"1001100022726355994-AD0532EI","__t":"1668082324830"}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Cookie": "selectedLocale_BG=bg-BG; BIGipServerbetty.metrosystems.net-80=!DHrH53oKfz3YHEsEdKzHuTxiWd+ak6uA3C+dv7oHRDuEk+ScE0MCf7DPAzLTCmE+GApsIOFM2GKufYk=; anonymousUserId=24EE2F84-55B5-4F94-861E-33C4EB770DC6; idamUserIdToken=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IktfYWE1NTAxNWEtMjA2YS0xMWVkLTk4ZDUtZTJjYzEyYjBkYzUwIn0.eyJleHAiOjE2NjgwODQxMjIsImlhdCI6MTY2ODA4MjMyMiwiYXVkIjoiQlRFWCIsInNlc3Npb25fc3RhdGUiOiJPTnJweFVhOG12WHRJeDR0c3pIZ09GR296WHUyeHZVVzVvNnc3eW1lLUdZLnJMRU1EWGFGIiwiaXNzIjoiaHR0cHM6Ly9pZGFtLm1ldHJvLmJnIiwiZW1haWwiOiJvZmZpY2VAdGVydmlvbi5iZyIsIm5vbmNlIjoiZjg3ZDMyYzEyYTRkNDY1ZGEzYjQwMTQ3OTlkYzc4NzMiLCJjX2hhc2giOiIiLCJzdWIiOiJVX2Y0MjBhY2E4LWY2OTMtNGMxNS1iOTIzLTc1NWY5NTc3ZTIwMCIsImF0X2hhc2giOiJlbkFGRFNJdUdmV0wzNnZ0UnJEQ253IiwicmVhbG0iOiJTU09fQ1VTVF9CRyIsImF1dGhfdGltZSI6MTY2ODA4MjMyMiwiYW1yIjpbIlVTRVJfQ1JFREVOVElBTFMiXX0.AC9vccz5PBe0d2uD6tHV5KdQ8_zbZvdARGUqo5s8KpJ0bGw97vm3xadF5TTHBUwkXX3oyJsbygC1tKvQInycU-zE0sqycIDtjP_hAGf6tUG-VV5xvtRsxBkacTBMy8OmbNHi5oncko7-dZ_tSOzQwSclLZKgKaqBcCqPBQVF0ug4pvbbqyZcw6D-MH6_T5prF7ppyqY11w9Ps_c7pFCciFR965gsO3Q-zr8CjKq1qGJeEpBFMKF0vfwinrc4wDpC5zd0Vgyf4ophzo6JkzA8TiWOGou5Z0khIpl435qUzxzt-WPFwPsPefhg_X9fYHma_OqQIpNjnV2tQwHqBD1qMTGXijtfOFQ; USER_TYPE=CUST; compressedJWT=eNpVUtlyozAQ/CJvcdrhEZtLGGGbQ4BeUlwGiTMhMeCvX5Fkt3YfVKrqme6eaalc7Tozc3Ihth8+Ae8SW/lVrvaYi3AD7Vx0eRwD4pzsuohuG3YsIm/EkUyxDybQjVzqgz2gqnBrj0ZpthNEtzUNqjWl3uqb4xkxA8Z/FhHY+ATHHld+cdFnYbZcZqIPpsflK9PpsBbw4LvfVFYcsh6LzdLJfGbOE+hR8B9ObOmG4FTqLgz4InCs+hhw81Q0BnQsHIQGmBLe3TR/7nzC7fHqmBh6uuIDMpMCuVwm2u2Xf2NbngbWDc9NQ85MpcYnhvcfOejtB5s1B3TMQefyueg9sgit8QlM8cnmc1P+rlF9hpq+QE2dIQUipMnTDRiPLBuvzjtvyISlwbF9KSKe5WH/8Izvnt5rE6FGuYDWsFMmjOa/+zMfLmWegYkEHC0/PO+P9qPYcuzbb5ztwvqVr1061LHzTHX8yDu33XbCnTHlQsgydcesK5iPO2JBvmbk3xpmH6RtNt00YnNQXXBpNV+0UIYU8lCD2ztKOdODQSJcNFVyg2aF60zS2GVvjvQk9lpAh8WliQS1aoVPwPJQn/fbr0vdxRiDJLh7d8pJhzVeNIW+75QK7H0zFVp9Z3BeGmZlA17s5LAcHDgjmc8vO/QiqorcSOenYVEx0/HJATQIqDJxAS7qsKnGQqrrXf5qNaf9GyRl3emruki8vxg0It5IhsxSfI8lGkvl+72qsoNMjhUp75xzR7NRq83w0Pp6oRqg74eq65zPaD/H9TX6GIyDfmFccfA8/fVtkPe7y5AUosA+fpZWBO0l9QzSZIfuoeG2n8aJNKG0WMfoap2XOcVJKT0ex9ep0m9vZv0gJwkqKue+Xb0TZ0Bjz+HMqi9W6Z81h+8PCaRZTJtoFYOun46FkQiPyFmGF65/VX33RdKl+ZYcXDvs7/Nv6PdLkg==; SES2_customerAdr_1001100022726355={%22addressId%22:%221001100022726355994-AD0532EI%22%2C%22addressHash%22:%221001100022726355994-AD0532EI%22%2C%22storeId%22:%2200022%22}; SES2_customerAdr_={%22addressId%22:null%2C%22addressHash%22:null%2C%22storeId%22:%2200022%22}; UserSettings=SelectedStore=1b1fc6ac-2ad6-4243-806e-a4a28c96dff4&SelectedAddress=1001100022726355994-ad0532ei",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
`
So I'm stuck. Any help or ideas will be greatly appreciated.
The page you're navigating to shows this on a GET request:
HTTP ERROR 401 You must provide a http header 'JWT'
This means that this page requires a level of authorization to be accessed.
See JWTs.
"Authorization: This is the most common scenario for using JWT. Once the user is logged in, each subsequent request will include the JWT, allowing the user to access routes, services, and resources that are permitted with that token."
You can access the root page just fine, but once you navigate to more user specific pages or "routes", you will need to provide a JWT to access that page's content.
There is a way to get past this using scraping. You will have to log in to the site as a user using your scraper and collect the JWT which is created by the server and sent back to your client. Then use that JWT in your request's headers:
token = "randomjwtgibberishshahdahdahwdwa"
HEADERS = {
"Authorization": "Bearer " + token
}
I am trying to access the data on this Binance website. It is the P2P: https://p2p.binance.com/en/trade/buy/USDT.
For BUY I am using this in python3 (I am getting the data correctly for this section):
import requests
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Length": "123",
"content-type": "application/json",
"Host": "p2p.binance.com",
"Origin": "https://p2p.binance.com",
"Pragma": "no-cache",
"TE": "Trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
data = {
"asset": "USDT",
"fiat": "ARS",
"merchantCheck": False,
"page": 1,
"payTypes": [],
"publisherType": None,
"rows": 50,
"tradeType": "BUY"
}
r = requests.post('https://p2p.binance.com/bapi/c2c/v2/friendly/c2c/adv/search', headers=headers, json=data)
print(r.text)
But then when I want to access this part of the page: https://p2p.binance.com/en/trade/sell/USDT (to SELL), I can't do it. Because when I change in the data the following: "tradeType": "SELL", it still brings me the same values of the BUY. It never brings me the SELL data.
And I am not finding out why yet.
Instead of sending requests to the website itself you can send a request to the source itself. If you check your console while the website loads results you will notice it sends a request to https://p2p.binance.com/bapi/c2c/v2/friendly/c2c/adv/search
which returns an array of trade details. I believe this is what you need.
Expanding on the accepted answer
https://p2p.binance.com/bapi/c2c/v2/friendly/c2c/adv/search
POST Fields
"asset": "USDT",
"fiat": "NGN",
"merchantCheck": true,
"page": 1,
"payTypes": ["BANK"],
"publisherType": null,
"rows": 20,
"tradeType": "SELL",
"transAmount": "5000"
}
asset: Currently available asset USDT, BTC, BNB, BUSD, ETH, DAI.
fiat: Its long, visit sanchezmarcos.
merchantCheck: Well i don't know its use, but value is null, true, false.
page: The endpoint is paginated.
payTypes: An array of payment types example BANK, GoMoney, CashDeposit etc.
payTypes depend on fiat used so you might not see some of these but there's a lot of payment types.
publisherType: I'm only aware of merchant.
row: Amount of rows from 1 - 20.
tradeType: BUY or SELL.
transAmount: Filter merchant by amount.
Note The api is for binance internal operation which means it can change any time.
I did check the result of the api call and confirm the data is correct except tradeType, you did nothing wrong
You need use it:
client.get_c2c_trade_history() and go to ['data']
Next step is work like json or dateframe, the column that you need have the header 'tradeType'
I inspected how a request was sent to a website in firefox:
(Unfortunately I had to change the website URL to a fake one to prevent the server form being requested too much).
I tried to do this request in python:
import requests
import json
seq = 'ATGGCAGACTCTATTGAGGTC'
url = 'http://www.test.com'
body = {'QUERY': seq}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url, data=json.dumps(body), headers=headers)
print(r.text)
However when doing this the website says: Empty gene sequence passed for blast analysis. Please enter a valid gene sequence. So that means that the sequence (i.e. QUERY) is not sent correctly to the server. What am I missing here?
(P.s. hopefully missing of the website is not a problem to answer this question, if it is please let me know maybe I can ask to mention their website)
I am guessing the string / sequence that you are submitting to that particular website is the problem. I ran your sample code against a POST accepting website:
import requests
import json
seq = 'ATGGCAGACTCTATTGAGGTC'
url = 'http://httpbin.org/post'
body = {'QUERY': seq}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url, data=json.dumps(body), headers=headers)
print(r.text)
And got this result, which shows your query properly formed:
{
"args": {},
"data": "{\"QUERY\": \"ATGGCAGACTCTATTGAGGTC\"}",
"files": {},
"form": {},
"headers": {
"Accept": "text/plain",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "34",
"Content-Type": "application/json",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0"
},
"json": {
"QUERY": "ATGGCAGACTCTATTGAGGTC"
},
"origin": "2.122.222.8, 2.122.222.8",
"url": "https://httpbin.org/post"
}
Are you sure the it's supposed to be "QUERY=<>?" Cause it could be incorrect formatting of the body. Normally it's in JSON format as in "title: information." Note the ':' rather than the '='
I'm trying to grab a cookie from a POST request. Previously, I used urllib2, which still works fine but I wanted to switch to the clearer library python-requests. Unfortunately I get an error on the page.
Since the request is HTTPS I can't sniff them to locate the difference.
urllib2 code:
NINTENDO_LOGIN_PAGE = "https://id.nintendo.net/oauth/authorize/"
MIIVERSE_CALLBACK_URL = "https://miiverse.nintendo.net/auth/callback"
parameters = {'client_id': 'ead88d8d450f40ada5682060a8885ec0',
'response_type': 'code',
'redirect_uri': MIIVERSE_CALLBACK_URL,
'username': MIIVERSE_USERNAME,
'password': miiverse_password}
data = urlencode(parameters)
self.logger.debug(data)
req = urllib2.Request(NINTENDO_LOGIN_PAGE, data)
page = urllib2.urlopen(req).read()
self.logger.debug(page)
Result (good):
[...]
<div id="main-body">
<div id="try-miiverse">
<p class="try-miiverse-catch">A glimpse at some of the posts that are currently popular on Miiverse.</p>
<h2 class="headline">Miiverse Sampler</h2>
<div id="slide-post-container" class="list post-list">
[...]
Requests code:
req = requests.post(NINTENDO_LOGIN_PAGE, data=parameters)
self.logger.debug(req.text)
Result (bad):
[...]
<div id="main-body">
<h2 class="headline">Activity Feed</h2>
<div class="activity-feed content-loading-window">
<div>
<img src="https://d13ph7xrk1ee39.cloudfront.net/img/loading-image-green.gif" alt=""></img>
<p class="tleft"><span>Loading activity feed...</span></p>
</div>
</div>
<div class="activity-feed content-load-error-window none"><div>
<p>The activity feed could not be loaded. Check your Internet connection, wait a moment and then try reloading.</p>
<div class="buttons-content">Reload</div>
</div>
</div>
[...]
Thanks in advance for any hints towards solving this.
Update 1: Thank you all for your responses!
As suggested by #abarnert, I checked the redirects.
resp = urllib2.urlopen(req)
print(resp.geturl()) # https://miiverse.nintendo.net/
req = requests.post(NINTENDO_LOGIN_PAGE, data=parameters)
print(req.url) # https://miiverse.nintendo.net/
print(req.history) # (<Response [303]>, <Response [302]>)
It seems they did both follow a redirect, but ended up in the same place.
#sigmavirus24, very useful website, thank you for making me discover it. Here are the results (I edited the order of parameters so they are easily comparable):
urllib2:
{
"args": {},
"data": "",
"files": {},
"form": {
"client_id": "ead88d8d450f40ada5682060a8885ec0",
"response_type": "code",
"redirect_uri": "https://miiverse.nintendo.net/auth/callback",
"username": "Wiwiweb",
"password": "password"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "170",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/2.7"
},
"json": null,
"origin": "24.85.129.188",
"url": "http://httpbin.org/post"
}
requests:
{
"args": {},
"data": "",
"files": {},
"form": {
"client_id": "ead88d8d450f40ada5682060a8885ec0",
"response_type": "code",
"redirect_uri": "https://miiverse.nintendo.net/auth/callback",
"username": "Wiwiweb",
"password": "password"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, compress",
"Connection": "close",
"Content-Length": "170",
"Content-Type": "application/x-www-form-urlencoded"
"Host": "httpbin.org",
"User-Agent": "python-requests/1.2.3 CPython/2.7.5 Windows/7"
},
"json": null,
"origin": "24.85.129.188"
"url": "http://httpbin.org/post",
}
Looks like some headers are slightly different. I don't have any other idea so I might as well try to completely copy the urllib2 headers. Spoofing the user agent might be it.
Update 2: I have added these headers to the "requests" request:
headers = {'User-Agent': 'Python-urllib/2.7',
'Accept-Encoding': 'identity'}
I am still getting the same results... The only difference between the requests now is the "requests" one has an extra header: "Accept": "*/*". I'm not sure this is the problem.
Could it be coming from the redirect?
Well, I didn't quite solve "why" the redirects are different, but I found out where to get my cookie using requests.
I figured the difference between the two libraries had something to do with the way they handle redirects. So I checked the history of both requests. For 'requests' that's as easy as doing req.history, but for urllib2, I used this bit of code:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print("New request:")
print(headers)
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
opener = urllib2.build_opener(MyHTTPRedirectHandler, urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
Checking the history allowed me to see that the 'requests' request had the 'set-cookie' header during it's first redirect (so the second request out of three), but not at the end. That's good enough for me because I know where to get it now: req.history[1].cookies['ms']
As a curious note, because of that bit that I added to the urllib2 request, it started returning the same thing as the 'requests' request! Even changing it to that:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
pass
opener = urllib2.build_opener(MyHTTPRedirectHandler, urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
is enough to make it completely change its response to the same thing 'requests' returned all along (That bit I marked as 'bad' in the question).
I'm stumped, but knowing where to find the cookie is good enough for me. Maybe someone curious and bored will be interested in trying to find the cause.
Thank you all for your help :)