JSON web scraping query encounters 'forbidden' error - python

I encounter an error "403 Client Error: Forbidden for url:" when running the following code.
import requests
url = "https://www.marinetraffic.com/map/gettrackjson/shipid:5630138/stdate:2022-05-18%2015:07/endate:2022-05-19%2015:07/trackorigin:livetrack"
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
print(response.json())
On the other hand, replacing with following url works:
url = "https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:5630138"
Would anyone know why the first url doesn't work and if there's a way to make it work? The original page is at https://www.marinetraffic.com/en/ais/home/centerx:79.3/centery:6.9/zoom:10.

Related

requests.get() not completing with Tiktok user profile

So, basically, it seems that requests.get(url) can't complete with Tiktok user profiles url:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(url) #<= will never complete
As I don't get any error message, I have no idea what's going on. Why is it not completing? How do I get it to complete?
TikTok is quite strict when it comes to automated connections so you need to provide headers in your request, like this:
import requests
url = "http://tiktok.com/#malopedia"
rep = requests.get(
url,
headers={
"Accept": "*/*",
"Accept-Encoding": "identity;q=1, *;q=0",
"Accept-Language": "en-US;en;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"User-Agent": "Mozilla/5.0",
},
)
print(rep)
This should respond with 200.
However, if you plan on doing some heavy lifting with your code, consider using one of the unofficial API wrappers, like this one.

Python web-scraping: error 401 You must provide a http header

Before I start let me point out that I have almost no clue wtf I'm doing. Like imagine a cat that tries to do some coding. I try to write some Python code using Pycharm on Ubuntu 22.04.1 LTS and also used Insomnia if this makes any difference. Here is the code:
`
# sad_scrape_code_attempt.py
import time
import httpx
from playwright.sync_api import sync_playwright
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
def get_cookie_playwright():
with sync_playwright() as p:
browser = p.firefox.launch(headless=False, slow_mo=50)
context = browser.new_context()
page = context.new_page()
page.goto('https://shop.metro.bg/shop/cart')
page.fill('input#user_id', 'the_sad_cat_username')
page.fill('input#password', 'the_sad_cat_password')
page.click('button[type=submit]')
page.click('button.btn-primary.accept-btn.field-accept-button-name')
page.evaluate(
"""
var intervalID = setInterval(function () {
var scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement.scrollHeight;
}, 200);
"""
)
prev_height = None
while True:
curr_height = page.evaluate('(window.innerHeight + window.scrollY)')
if not prev_height:
prev_height = curr_height
time.sleep(1)
elif prev_height == curr_height:
page.evaluate('clearInterval(intervalID)')
break
else:
prev_height = curr_height
time.sleep(1)
# print(context.cookies())
cookie_for_requests = context.cookies()[11]['value']
browser.close()
return cookie_for_requests
def req_with_cookie(cookie_for_requests):
cookies = dict(
Cookie=f'BIGipServerbetty.metrosystems.net-80={cookie_for_requests};')
r = httpx.get('https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current', cookies=cookies)
return r.text
if __name__ == '__main__':
data = req_with_cookie(get_cookie_playwright())
print(data)
# Used packages
#Playwright
#PyTest
#PyTest-Playwirght
#JavaScript
#TypeScript
#httpx
`
so basically I copy paste the code of 2 tutorials made by John Watson Rooney called:
The Biggest Mistake Beginners Make When Web Scraping
Login and Scrape Data with Playwright and Python
Than combined them and added some JavaScript to scroll to the bottom of the page. Than I found an article called: How Headers Are Used to Block Web Scrapers and How to Fix It
thus replacing "import requests" with "import httpx" and added the HEADERS as per given from Insomnia. From what I understand browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy! If this is true I need to figure out a way to return my cookies header following the correct order, which by the way I have no clue how to figure out but I believe its #11 or #3 judging by the code generated by Insomnia:
`
import requests
url = "https://shop.metro.bg/ordercapture.customercart.v1/carts/alias/current"
querystring = {"customerId":"1001100022726355","cardholderNumber":"1","storeId":"00022","country":"BG","locale":"bg-BG","fsdAddressId":"1001100022726355994-AD0532EI","__t":"1668082324830"}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://shop.metro.bg/shop/cart",
"CallTreeId": "||BTOC-1BF47A0C-CCDD-47BB-A9DA-592009B5FB38",
"Content-Type": "application/json; charset=UTF-8",
"x-timeout-ms": "5000",
"DNT": "1",
"Connection": "keep-alive",
"Cookie": "selectedLocale_BG=bg-BG; BIGipServerbetty.metrosystems.net-80=!DHrH53oKfz3YHEsEdKzHuTxiWd+ak6uA3C+dv7oHRDuEk+ScE0MCf7DPAzLTCmE+GApsIOFM2GKufYk=; anonymousUserId=24EE2F84-55B5-4F94-861E-33C4EB770DC6; idamUserIdToken=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IktfYWE1NTAxNWEtMjA2YS0xMWVkLTk4ZDUtZTJjYzEyYjBkYzUwIn0.eyJleHAiOjE2NjgwODQxMjIsImlhdCI6MTY2ODA4MjMyMiwiYXVkIjoiQlRFWCIsInNlc3Npb25fc3RhdGUiOiJPTnJweFVhOG12WHRJeDR0c3pIZ09GR296WHUyeHZVVzVvNnc3eW1lLUdZLnJMRU1EWGFGIiwiaXNzIjoiaHR0cHM6Ly9pZGFtLm1ldHJvLmJnIiwiZW1haWwiOiJvZmZpY2VAdGVydmlvbi5iZyIsIm5vbmNlIjoiZjg3ZDMyYzEyYTRkNDY1ZGEzYjQwMTQ3OTlkYzc4NzMiLCJjX2hhc2giOiIiLCJzdWIiOiJVX2Y0MjBhY2E4LWY2OTMtNGMxNS1iOTIzLTc1NWY5NTc3ZTIwMCIsImF0X2hhc2giOiJlbkFGRFNJdUdmV0wzNnZ0UnJEQ253IiwicmVhbG0iOiJTU09fQ1VTVF9CRyIsImF1dGhfdGltZSI6MTY2ODA4MjMyMiwiYW1yIjpbIlVTRVJfQ1JFREVOVElBTFMiXX0.AC9vccz5PBe0d2uD6tHV5KdQ8_zbZvdARGUqo5s8KpJ0bGw97vm3xadF5TTHBUwkXX3oyJsbygC1tKvQInycU-zE0sqycIDtjP_hAGf6tUG-VV5xvtRsxBkacTBMy8OmbNHi5oncko7-dZ_tSOzQwSclLZKgKaqBcCqPBQVF0ug4pvbbqyZcw6D-MH6_T5prF7ppyqY11w9Ps_c7pFCciFR965gsO3Q-zr8CjKq1qGJeEpBFMKF0vfwinrc4wDpC5zd0Vgyf4ophzo6JkzA8TiWOGou5Z0khIpl435qUzxzt-WPFwPsPefhg_X9fYHma_OqQIpNjnV2tQwHqBD1qMTGXijtfOFQ; USER_TYPE=CUST; compressedJWT=eNpVUtlyozAQ/CJvcdrhEZtLGGGbQ4BeUlwGiTMhMeCvX5Fkt3YfVKrqme6eaalc7Tozc3Ihth8+Ae8SW/lVrvaYi3AD7Vx0eRwD4pzsuohuG3YsIm/EkUyxDybQjVzqgz2gqnBrj0ZpthNEtzUNqjWl3uqb4xkxA8Z/FhHY+ATHHld+cdFnYbZcZqIPpsflK9PpsBbw4LvfVFYcsh6LzdLJfGbOE+hR8B9ObOmG4FTqLgz4InCs+hhw81Q0BnQsHIQGmBLe3TR/7nzC7fHqmBh6uuIDMpMCuVwm2u2Xf2NbngbWDc9NQ85MpcYnhvcfOejtB5s1B3TMQefyueg9sgit8QlM8cnmc1P+rlF9hpq+QE2dIQUipMnTDRiPLBuvzjtvyISlwbF9KSKe5WH/8Izvnt5rE6FGuYDWsFMmjOa/+zMfLmWegYkEHC0/PO+P9qPYcuzbb5ztwvqVr1061LHzTHX8yDu33XbCnTHlQsgydcesK5iPO2JBvmbk3xpmH6RtNt00YnNQXXBpNV+0UIYU8lCD2ztKOdODQSJcNFVyg2aF60zS2GVvjvQk9lpAh8WliQS1aoVPwPJQn/fbr0vdxRiDJLh7d8pJhzVeNIW+75QK7H0zFVp9Z3BeGmZlA17s5LAcHDgjmc8vO/QiqorcSOenYVEx0/HJATQIqDJxAS7qsKnGQqrrXf5qNaf9GyRl3emruki8vxg0It5IhsxSfI8lGkvl+72qsoNMjhUp75xzR7NRq83w0Pp6oRqg74eq65zPaD/H9TX6GIyDfmFccfA8/fVtkPe7y5AUosA+fpZWBO0l9QzSZIfuoeG2n8aJNKG0WMfoap2XOcVJKT0ex9ep0m9vZv0gJwkqKue+Xb0TZ0Bjz+HMqi9W6Z81h+8PCaRZTJtoFYOun46FkQiPyFmGF65/VX33RdKl+ZYcXDvs7/Nv6PdLkg==; SES2_customerAdr_1001100022726355={%22addressId%22:%221001100022726355994-AD0532EI%22%2C%22addressHash%22:%221001100022726355994-AD0532EI%22%2C%22storeId%22:%2200022%22}; SES2_customerAdr_={%22addressId%22:null%2C%22addressHash%22:null%2C%22storeId%22:%2200022%22}; UserSettings=SelectedStore=1b1fc6ac-2ad6-4243-806e-a4a28c96dff4&SelectedAddress=1001100022726355994-ad0532ei",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
`
So I'm stuck. Any help or ideas will be greatly appreciated.
The page you're navigating to shows this on a GET request:
HTTP ERROR 401 You must provide a http header 'JWT'
This means that this page requires a level of authorization to be accessed.
See JWTs.
"Authorization: This is the most common scenario for using JWT. Once the user is logged in, each subsequent request will include the JWT, allowing the user to access routes, services, and resources that are permitted with that token."
You can access the root page just fine, but once you navigate to more user specific pages or "routes", you will need to provide a JWT to access that page's content.
There is a way to get past this using scraping. You will have to log in to the site as a user using your scraper and collect the JWT which is created by the server and sent back to your client. Then use that JWT in your request's headers:
token = "randomjwtgibberishshahdahdahwdwa"
HEADERS = {
"Authorization": "Bearer " + token
}

Access problem on a website with login, web scraping Phyton

I created a web scraping program in order to scrape information from the platform plus500 (trading platform) in order to get the real time value of the market index, the problem I had is that when I run the program:
import requests
from pprint import pprint
from Config import username, password
def main():
url = 'https://app.plus500.com/trade?innerTags=_cc_&webvisitid=d9cf772d-6ad5-492c-b782-e3fbeaf7863d&page=login' \
'&_ga=2.35401569.1585895796.1661533386-1432537898.1661336007 '
with requests.session() as session:
response = session.post(url, auth=(username, password))
pprint(response.text)
if __name__ == '__main__':
main()
the result I get is this:
('{\n'
' "status": "Rejected",\n'
' "statusCode": "406",\n'
' "supportID": "11920948162926473185252678965843397577",\n'
' "ipAddress": "my IP",\n'
' "timeStamp": "2022-08-27 12:30:47"\n'
'}')
Process finished with exit code 0
As you can see the post request is sent but I get back the status rejected and I don't know why, I created a dummy account for you, email: myrandomcode#gmail.com - Plus500_password: MyRandomCode87 - . Can you help me please?
It's a problem with the headers, if you add
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive"
}
and change
response = session.post(url, auth=(username, password))
for
response = session.post(url, auth=(username, password), headers=headers)
work.
You could check, if yu want, what headers are necessary and what not

Requests not working in Python 3 - only in Python 2

I'm using Requests to handle my post requests and ran into a situation where if I run the same exact code in Python 3 I get an invalid response, but if I run it in Python 2 it works!
import requests
url = "https://creator.zoho.com/api/xml/write"
querystring = {"authtoken":"token"}
payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"XMLString\"\r\n\r\n\n
<ZohoCreator>
<applicationlist>
... content ...
</applicationlist>
</ZohoCreator>\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"
headers = {
'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
'Content-Type': "application/xml",
'cache-control': "no-cache",
'Postman-Token': "03197e8c-2aef-4ac4-829d-f7dca06a14be",
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
}
session = requests.Session()
response = session.request("POST", url, data=payload, headers=headers, params=querystring)
print(response.text)
Python 3 Response:
{"code":2945,"message":"LESS_THAN_MIN_OCCURANCE"}
Python 2 Response:
<response><result>
... content ...
<status>Success</status></add></form></result></response>
I'm positive the request is fine as it works in Postman and this is the code it generated. am I missing something when it comes to Python 3?
I dont really know how it works with python2.
But error show that it happen due to invalid ticket. Refer the link below to generate api and Post Url to insert data into zoho creator.
https://www.zoho.com/creator/help/api/prerequisites/generate-auth-token.html
https://www.zoho.com/creator/help/script/post-url.html#Example
I changed the structure and Content-Type. It now posts successfully with Python 3.
payload = "XMLString=<ZohoCreator> ... </ZohoCreator>"
headers = {
'Content-Type': "application/x-www-form-urlencoded",
'cache-control': "no-cache",
}

Error while uploading picture with the requests library

I'm trying to implement the Yandex OCR translator tool into my code. With the help of Burp Suite, I managed to find that the following request is the one that is used to send the image:
I'm trying to emulate this request with the following code:
import requests
from requests_toolbelt import MultipartEncoder
files={
'file':("blob",open("image_path", 'rb'),"image/jpeg")
}
#(<filename>, <file object>, <content type>, <per-part headers>)
burp0_url = "https://translate.yandex.net:443/ocr/v1.1/recognize?srv=tr-image&sid=9b58493f.5c781bd4.7215c0a0&lang=en%2Cru"
m = MultipartEncoder(files, boundary='-----------------------------7652580604126525371226493196')
burp0_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "https://translate.yandex.com/", "Content-Type": "multipart/form-data; boundary=-----------------------------7652580604126525371226493196", "Origin": "https://translate.yandex.com", "DNT": "1", "Connection": "close"}
print(requests.post(burp0_url, headers=burp0_headers, files=m.to_string()).text)
though sadly it yields the following output:
{"error":"BadArgument","description":"Bad argument: file"}
Does anyone know how this could be solved?
Many thanks in advance!
You are passing the MultipartEncoder.to_string() result to the files parameter. You are now asking requests to encode the result of the multipart encoder to a multipart component. That's one time too many.
You don't need to replicate every byte here, just post the file, and perhaps set the user agent, referer, and origin:
files = {
'file': ("blob", open("image_path", 'rb'), "image/jpeg")
}
url = "https://translate.yandex.net:443/ocr/v1.1/recognize?srv=tr-image&sid=9b58493f.5c781bd4.7215c0a0&lang=en%2Cru"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0",
"Referer": "https://translate.yandex.com/",
"Origin": "https://translate.yandex.com",
}
response = requests.post(url, headers=headers, files=files)
print(response.status)
print(response.json())
The Connection header is best left to requests, it can control when a connection should be kept alive just fine. The Accept* headers are there to tell the server what your client can handle, and requests sets those automatically too.
I get a 200 OK response with that code:
200
{'data': {'blocks': []}, 'status': 'success'}
However, if you don't set additional headers (remove the headers=headers argument), the request also works, so Yandex doesn't appear to be filtering for robots here.

Categories