How to fix 403 response in Scrapy - python

http://prntscr.com/o56670
Please check the screenshot
I am using python 3 and using scrapy in my terminal.
fetch("https://angel.co/adil-wali")
When the link is requested, it responses with 403.
so i have changed and rotated user agent and robots obey false but still showing 403 response so this time i buy crawlera plan but crawlera still saying 523 response
Do you have any idea about why the request returns 403 instead of 200 response in scrapy shell

Try adding headers to your request:
fetch(
"https://angel.co/adil-wali",
headers={
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-language": "en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7",
"cache-control": "no-cache",
"pragma": "no-cache",
"upgrade-insecure-requests": "1"
}
)
With this approach, I was able to get Response 200 from the mentioned URL.

Related

requests.get() not completing with Tiktok user profile

So, basically, it seems that requests.get(url) can't complete with Tiktok user profiles url:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(url) #<= will never complete
As I don't get any error message, I have no idea what's going on. Why is it not completing? How do I get it to complete?
TikTok is quite strict when it comes to automated connections so you need to provide headers in your request, like this:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(
url,
headers={
"Accept": "*/*",
"Accept-Encoding": "identity;q=1, *;q=0",
"Accept-Language": "en-US;en;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"User-Agent": "Mozilla/5.0",
},
)
print(rep)
This should respond with 200.
However, if you plan on doing some heavy lifting with your code, consider using one of the unofficial API wrappers, like this one.

Python web-scraping: error 401 You must provide a http header

Before I start let me point out that I have almost no clue wtf I'm doing. Like imagine a cat that tries to do some coding. I try to write some Python code using Pycharm on Ubuntu 22.04.1 LTS and also used Insomnia if this makes any difference. Here is the code:
`
# sad_scrape_code_attempt.py
import time
import httpx
from playwright.sync_api import sync_playwright
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
def get_cookie_playwright():
with sync_playwright() as p:
browser = p.firefox.launch(headless=False, slow_mo=50)
context = browser.new_context()
page = context.new_page()
page.goto('https://shop.metro.bg/shop/cart')
page.fill('input#user_id', 'the_sad_cat_username')
page.fill('input#password', 'the_sad_cat_password')
page.click('button[type=submit]')
page.click('button.btn-primary.accept-btn.field-accept-button-name')
page.evaluate(
"""
var intervalID = setInterval(function () {
var scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement.scrollHeight;
}, 200);
"""
)
prev_height = None
while True:
curr_height = page.evaluate('(window.innerHeight + window.scrollY)')
if not prev_height:
prev_height = curr_height
time.sleep(1)
elif prev_height == curr_height:
page.evaluate('clearInterval(intervalID)')
break
else:
prev_height = curr_height
time.sleep(1)
# print(context.cookies())
cookie_for_requests = context.cookies()[11]['value']
browser.close()
return cookie_for_requests
def req_with_cookie(cookie_for_requests):
cookies = dict(
Cookie=f'BIGipServerbetty.metrosystems.net-80={cookie_for_requests};')
r = httpx.get('https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current', cookies=cookies)
return r.text
if __name__ == '__main__':
data = req_with_cookie(get_cookie_playwright())
print(data)
# Used packages
#Playwright
#PyTest
#PyTest-Playwirght
#JavaScript
#TypeScript
#httpx
`
so basically I copy paste the code of 2 tutorials made by John Watson Rooney called:
The Biggest Mistake Beginners Make When Web Scraping
Login and Scrape Data with Playwright and Python
Than combined them and added some JavaScript to scroll to the bottom of the page. Than I found an article called: How Headers Are Used to Block Web Scrapers and How to Fix It
thus replacing "import requests" with "import httpx" and added the HEADERS as per given from Insomnia. From what I understand browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy! If this is true I need to figure out a way to return my cookies header following the correct order, which by the way I have no clue how to figure out but I believe its #11 or #3 judging by the code generated by Insomnia:
`
import requests
url = "https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current"
querystring = {"customerId":"1001100022726355","cardholderNumber":"1","storeId":"00022","country":"BG","locale":"bg-BG","fsdAddressId":"1001100022726355994-AD0532EI","__t":"1668082324830"}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Cookie": "selectedLocale_BG=bg-BG; BIGipServerbetty.metrosystems.net-80=!DHrH53oKfz3YHEsEdKzHuTxiWd+ak6uA3C+dv7oHRDuEk+ScE0MCf7DPAzLTCmE+GApsIOFM2GKufYk=; anonymousUserId=24EE2F84-55B5-4F94-861E-33C4EB770DC6; idamUserIdToken=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IktfYWE1NTAxNWEtMjA2YS0xMWVkLTk4ZDUtZTJjYzEyYjBkYzUwIn0.eyJleHAiOjE2NjgwODQxMjIsImlhdCI6MTY2ODA4MjMyMiwiYXVkIjoiQlRFWCIsInNlc3Npb25fc3RhdGUiOiJPTnJweFVhOG12WHRJeDR0c3pIZ09GR296WHUyeHZVVzVvNnc3eW1lLUdZLnJMRU1EWGFGIiwiaXNzIjoiaHR0cHM6Ly9pZGFtLm1ldHJvLmJnIiwiZW1haWwiOiJvZmZpY2VAdGVydmlvbi5iZyIsIm5vbmNlIjoiZjg3ZDMyYzEyYTRkNDY1ZGEzYjQwMTQ3OTlkYzc4NzMiLCJjX2hhc2giOiIiLCJzdWIiOiJVX2Y0MjBhY2E4LWY2OTMtNGMxNS1iOTIzLTc1NWY5NTc3ZTIwMCIsImF0X2hhc2giOiJlbkFGRFNJdUdmV0wzNnZ0UnJEQ253IiwicmVhbG0iOiJTU09fQ1VTVF9CRyIsImF1dGhfdGltZSI6MTY2ODA4MjMyMiwiYW1yIjpbIlVTRVJfQ1JFREVOVElBTFMiXX0.AC9vccz5PBe0d2uD6tHV5KdQ8_zbZvdARGUqo5s8KpJ0bGw97vm3xadF5TTHBUwkXX3oyJsbygC1tKvQInycU-zE0sqycIDtjP_hAGf6tUG-VV5xvtRsxBkacTBMy8OmbNHi5oncko7-dZ_tSOzQwSclLZKgKaqBcCqPBQVF0ug4pvbbqyZcw6D-MH6_T5prF7ppyqY11w9Ps_c7pFCciFR965gsO3Q-zr8CjKq1qGJeEpBFMKF0vfwinrc4wDpC5zd0Vgyf4ophzo6JkzA8TiWOGou5Z0khIpl435qUzxzt-WPFwPsPefhg_X9fYHma_OqQIpNjnV2tQwHqBD1qMTGXijtfOFQ; USER_TYPE=CUST; compressedJWT=eNpVUtlyozAQ/CJvcdrhEZtLGGGbQ4BeUlwGiTMhMeCvX5Fkt3YfVKrqme6eaalc7Tozc3Ihth8+Ae8SW/lVrvaYi3AD7Vx0eRwD4pzsuohuG3YsIm/EkUyxDybQjVzqgz2gqnBrj0ZpthNEtzUNqjWl3uqb4xkxA8Z/FhHY+ATHHld+cdFnYbZcZqIPpsflK9PpsBbw4LvfVFYcsh6LzdLJfGbOE+hR8B9ObOmG4FTqLgz4InCs+hhw81Q0BnQsHIQGmBLe3TR/7nzC7fHqmBh6uuIDMpMCuVwm2u2Xf2NbngbWDc9NQ85MpcYnhvcfOejtB5s1B3TMQefyueg9sgit8QlM8cnmc1P+rlF9hpq+QE2dIQUipMnTDRiPLBuvzjtvyISlwbF9KSKe5WH/8Izvnt5rE6FGuYDWsFMmjOa/+zMfLmWegYkEHC0/PO+P9qPYcuzbb5ztwvqVr1061LHzTHX8yDu33XbCnTHlQsgydcesK5iPO2JBvmbk3xpmH6RtNt00YnNQXXBpNV+0UIYU8lCD2ztKOdODQSJcNFVyg2aF60zS2GVvjvQk9lpAh8WliQS1aoVPwPJQn/fbr0vdxRiDJLh7d8pJhzVeNIW+75QK7H0zFVp9Z3BeGmZlA17s5LAcHDgjmc8vO/QiqorcSOenYVEx0/HJATQIqDJxAS7qsKnGQqrrXf5qNaf9GyRl3emruki8vxg0It5IhsxSfI8lGkvl+72qsoNMjhUp75xzR7NRq83w0Pp6oRqg74eq65zPaD/H9TX6GIyDfmFccfA8/fVtkPe7y5AUosA+fpZWBO0l9QzSZIfuoeG2n8aJNKG0WMfoap2XOcVJKT0ex9ep0m9vZv0gJwkqKue+Xb0TZ0Bjz+HMqi9W6Z81h+8PCaRZTJtoFYOun46FkQiPyFmGF65/VX33RdKl+ZYcXDvs7/Nv6PdLkg==; SES2_customerAdr_1001100022726355={%22addressId%22:%221001100022726355994-AD0532EI%22%2C%22addressHash%22:%221001100022726355994-AD0532EI%22%2C%22storeId%22:%2200022%22}; SES2_customerAdr_={%22addressId%22:null%2C%22addressHash%22:null%2C%22storeId%22:%2200022%22}; UserSettings=SelectedStore=1b1fc6ac-2ad6-4243-806e-a4a28c96dff4&SelectedAddress=1001100022726355994-ad0532ei",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
`
So I'm stuck. Any help or ideas will be greatly appreciated.
The page you're navigating to shows this on a GET request:
HTTP ERROR 401 You must provide a http header 'JWT'
This means that this page requires a level of authorization to be accessed.
See JWTs.
"Authorization: This is the most common scenario for using JWT. Once the user is logged in, each subsequent request will include the JWT, allowing the user to access routes, services, and resources that are permitted with that token."
You can access the root page just fine, but once you navigate to more user specific pages or "routes", you will need to provide a JWT to access that page's content.
There is a way to get past this using scraping. You will have to log in to the site as a user using your scraper and collect the JWT which is created by the server and sent back to your client. Then use that JWT in your request's headers:
token = "randomjwtgibberishshahdahdahwdwa"
HEADERS = {
"Authorization": "Bearer " + token
}

JSON web scraping query encounters 'forbidden' error

I encounter an error "403 Client Error: Forbidden for url:" when running the following code.
import requests
url = "https://www.marinetraffic.com/map/gettrackjson/shipid:5630138/stdate:2022-05-18%2015:07/endate:2022-05-19%2015:07/trackorigin:livetrack"
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
print(response.json())
On the other hand, replacing with following url works:
url = "https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:5630138"
Would anyone know why the first url doesn't work and if there's a way to make it work? The original page is at https://www.marinetraffic.com/en/ais/home/centerx:79.3/centery:6.9/zoom:10.

Cannot get cookies with python requests while postman, curl, and wget work

I'm trying to authenticate on a French water provider website to get my water consumption data. The website does not provide any api and I'm trying to make a python script that authenticates on the website and crawls the data. My work is based on a working Domoticz python script and a shell script.
The workflow is the following:
Get a token from the website
Authenticate with login, password, and token get at step 1
Get 1 or more cookies from step 2
Get data using the cookie(s) from 3
I'm stuck at step 2 where I can't get the cookies with my python script. I tried with postman, curl, and wget and it is working. I even used the python code generated by postman and I still get no cookies.
Heres is a screenshot of my postman post request
which gives two cookies in the response.
And here is my python code:
import requests
url = "https://www.toutsurmoneau.fr/mon-compte-en-ligne/je-me-connecte"
querystring = {"_username":"mymail#gmail.com","_password":"mypass","_csrf_token":"knfOIFZNhiCVxHS0U84GW5CrfMt36eLvqPPYGDSsOww","signin[username]":"mymail#gmail.com","signin[password]":"mypass","tsme_user_login[_username]":"mymail#gmail.com","tsme_user_login[_password]":"mypass"}
payload = ""
headers = {
'Accept': "application/json, text/javascript, */*; q=0.01",
'Content-Type': "application/x-www-form-urlencoded",
'Accept-Language': "fr,fr-FR;q=0.8,en;q=0.6",
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Mobile Safari/537.36",
'Connection': "keep-alive",
'cache-control': "no-cache",
'Postman-Token': "c7e5f7ca-abea-4161-999a-3c28ec979628"
}
response = requests.request("POST", url, data=payload, headers=headers, params=querystring)
print(response.cookies.get_dict())
The output is {}.
I cannot figure out what I'm doing wrong.
If you have any help to provide, I'll be happy to get it.
Thanks for reading.
Edit:
Some of my assumptions were wrong. The shell script was indeed working but not Postman. I was confused because of response 200 I receive.
So I answer my own question.
First, when getting the token at step 1, I receive a cookie. I'm supposed to use this cookie when logging in which I did not do before.
Then, when using this cookie and the token to log in step 2, I was not able to see any cookie in the response I receive while I was well connected (I find in the content a "disconnect" string which is here only if well logged in). That's a normal behavior since cookies are not sent in the response of a post request.
I had to create a requests.session to post my log in form, and the session stores the cookie.
Now, I'm able to use this information to grab the data from the server.
Hope that will help others.

Python requests returning succesful POST request, but not redirecting page on Twitter

I've got a script that is meant to at first login to Twitter. I get a 200 response if I check it, but I don't redirect to a logged in Twitter account after succeeding, instead it stays on the same page.
url = 'https://twitter.com/login/error?redirect_after_login=%2F'
r = requests.session()
# Need the headers below as so Twitter doesn't reject the request.
headers = {
'Host': "twitter.com",
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate, br",
'Referer': "https://twitter.com/login/error?redirect_after_login=%2F",
'Upgrade-Insecure-Requests': "1",
'Connection': "keep-alive"
}
login_data = {"session[username_or_email]":"Username", "session[password]":"Password"}
response = r.post(url, data=login_data, headers=headers, allow_redirects=True)
How do I go about redirecting to my account upon successful POST request to the logged in state. Am I not using the correct headers or something like that? I've not done a huge amount of web stuff before, so I'm sorry if it's something really obvious.
Note: (I cannot use Twitter API to do this) & The referrer is the error page because that's where I'm logging in from - unless of course I'm wrong in doing that.
Perhaps the GET parameter redirect_after_login will use kind of javascript or html meta refresh redirection instead of HTTP redirection, so if it's the case, the requests python module will not handle it correctly.
So once you retrieve your authentication token from your first request, you could make again the second request to https://twiter.com/ without to forget your specify your security token from your HTTP request fields. You can find more information about REST API of twitter here: https://dev.twitter.com/overview/api
But the joy of python is to have libraries for everything, so I suggest you to take a look here:
https://github.com/bear/python-twitter
It's a library to communicate with the REST API of twitter.

Categories