POST request (python) - invalid request - python

I'm trying to use the API of a media ID registry, the EIDR, to download tv show information. I'd like to be able to query many shows automatically. I'm not experienced in how to use APIs, and the documentation for this specific one is very opaque. I'm using python 3 (requests library) in Ubuntu 16.04.
I tried sending a request for a specific tv show. I took the headers and parameters information from the browser, as in, I did the query from the browser (I looked up 'anderson cooper 360' from this page) and then looked at the information in the "network" tab of the browser's page inspector. I used the following code:
import requests
url = 'https://resolve.eidr.org/EIDR/query/'
headers = {'Host': 'ui.eidr.org', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) \
Gecko/20100101 Firefox/58.0', \
'Accept': '*/*', \
'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate, br', \
'Referer': 'https://ui.eidr.org/search/results/19c70c63d73790b86f3fb385f2a9b3f4', \
'Cookie': 'ci_session=f4tnbi8qm7oaq30agjtn8p69j91s4li4; \
_ga=GA1.2.1738620664.1519337357; _gid=GA1.2.1368695940.1519337357; _gat=1', \
'Connection': 'keep-alive'}
params = {'search_page_size':25, 'CustomAsciiSearch[_v]':1, \
'search_type':'content', 'ResourceName[_v]':'anderson cooper 360', \
'AlternateResourceNameAddition[_v]':1, \
'AssociatedOrgAlternateNameAddition[_v]':1, 'Status[_v]':'valid'}
r = requests.post(url, data=params, headers=headers)
print(r.text)
I get this response that basically says it's an invalid request:
<?xml version="1.0" encoding="UTF-8"?><Response \
xmlns="http://www.eidr.org/schema" version="2.1.0"><Status>\
<Code>3</Code><Type>invalid request</Type></Status></Response>
Now, I read in an answer to this Stackoverflow question that I should somehow use a session object. The code suggested in the answer by
Padraic Cunningham was this:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:46.0) \
Gecko/20100101 Firefox/46.0','X-Requested-With': 'XMLHttpRequest', \
"referer": "https://www.hackerearth.com/challenges/"}
with requests.Session() as s:
s.get("https://www.hackerearth.com")
headers["X-CSRFToken"] = s.cookies["csrftoken"]
r = s.post("https://www.hackerearth.com/AJAX/filter-challenges/?modern=true", \
headers=headers, files={'submit': (None, 'True')})
print(r.json())
So I understand that I should somehow use this, but I don't fully understand why or how.
So my question(s) would be:
1) What does 'invalid request' mean in this case?
2) Do you have any suggestions for how to write the request in a way that I can iterate it many times for different items I want to look up?
3) Do you know what I should do to properly use a session object here?
Thank you!

you probably need this documentation.
1) from the documentation:
invalid request: An API (URI) that does not exist including missing a required
parameter. May also include an incorrect HTTP operation on a valid
URI (such as a GET on a registration). Could also be POST multipart
data that is syntactically invalid such as missing required headers or
if the end-of-line characters are not CR-LF.
2) As far as I understand, this API accepts XML requests. See what appears after clicking on 'View XML' on the web page with results (https://ui.eidr.org/search/results). For the 'anderson cooper 360' you can use the XML data in Python like this:
import requests
import xml.etree.ElementTree as ET
url = 'https://resolve.eidr.org/EIDR/query/'
headers = {'Content-Type': 'text/xml',
'Authorization': 'Eidr 10.5238/webui:10.5237/D4C9-7E59:9kDMO4+lpsZGUIl8doWMdw==',
'EIDR-Version': '2.1'}
xml_query = """<Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Operation>
<Query>
<Expression><![CDATA[ASCII(((/FullMetadata/BaseObjectData/ResourceName "anderson" AND /FullMetadata/BaseObjectData/ResourceName "cooper" AND /FullMetadata/BaseObjectData/ResourceName "360") OR (/FullMetadata/BaseObjectData/AlternateResourceName "anderson" AND /FullMetadata/BaseObjectData/AlternateResourceName "cooper" AND /FullMetadata/BaseObjectData/AlternateResourceName "360")) AND /FullMetadata/BaseObjectData/Status "valid")]]></Expression>
<PageNumber>1</PageNumber>
<PageSize>25</PageSize>
</Query>
</Operation>
</Request>"""
r = requests.post(url, data=xml_query, headers=headers)
root = ET.fromstring(r.text)
for sm in root.findall('.//{http://www.eidr.org/schema}SimpleMetadata'):
print({ch.tag.replace('{http://www.eidr.org/schema}',''):ch.text for ch in sm.getchildren()})
3) I don't think you need the session object.

Related

python requests get aspx issue

Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)
The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.

Requests Python - Cross Origin API access denied even with headers, user-agent, and token. How do I get authroized?

Hello I'm new to dealing with servers and requests so bear with me,
I am trying to send get requests to an API hosted on a different server than the site itself (soldhistory.ca) so that I can get a JSON response of a list of property ids and then manually send individual requests to the API sever to get the details of each property. The later part works fine with my authorization token (and doesn't when I use a random string) so I know the issue isn't my authorization token but getting the initial json of property ids doesn't work.
I keep getting an 'Access Denied' - 403 message even when I send the headers, authorization token, and user-agent. When I send the wrong token, I get a different message that says 'Access Denied: Token not found or Expired'. When I send no token, it says 'Signature Required'. I think I am missing something trivial here. I have also noticed that when I login, the json response of that has an access token and another token that is different from the access token in it called 'token' which I think may have something to do with the problem I am experiencing but I have no idea what do with it. All in all, how do I get authorized to be able to send requests to the API server?
I have included a dummy account I have made with fake credentials in the code below if anyone wants to send requests. If you visit the site ,zoom out of the maps entirely and filter to show any price and only show sold properties, you will see there is data on roughly 450 000 past properties sold in Canada that I would like to get. My end goal is to get this data. If anyone can help me out I would greatly appreciate it.
It is worth noting, I have also tried using selenium to initially go to the homepage and then transfer the cookies to the requests session but that didn't work either. I have also tried using selenium-requests with no luck either but maybe I did not implement those right.
Also, if you look at the XMLHttpRequests of the site you will that there is an initial request called properties that is made and then the subsequent get requests are generated from that json response. I am trying to get the JSON response for properties. It is a SEARCH method.
Code:
import requests
s = requests.Session()
s.get('https://www.soldhistory.ca')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': '*/*',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'Access-Control-Request-Method': 'SEARCH',
'Access-Control-Request-Headers': 'access-token,access_token,content-type',
'Referer': 'https://www.soldhistory.ca/mapsearchapp/search/eyJzZWFyY2giOnsic2VhcmNoVHlwZSI6InJlc2lkZW50aWFsIiwibGlzdGluZ1R5cGUiOlsiU29sZCJdLCJvcGVuSG91c2UiOnsiZnJvbSI6MCwidG8iOjB9LCJiZWQiOjAsIm1hcmtldGRheXMiOjAsImJhdGgiOjAsInNlYXJjaEJ5Ijoic2VhcmNoYWxsIiwic2VhcmNoQnlUZXh0IjoiIiwicHJpY2VSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sImZlZXRSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sInNob3dPbmx5IjpbXSwicHJpY2VEcm9wIjpmYWxzZSwic29ydGJ5IjoibmV3ZXN0IiwiY29uZG9UeXBlIjoiIiwiY29uZG9PY2N1cGFuY3kiOiIiLCJjb25kb1N0YXR1cyI6IiIsImNvbmRvQnVpbGRlciI6IiIsImtleXdvcmRzIjpbXSwiUG9zdGFsQ29kZSI6ZmFsc2UsIlByb3ZpbmNlIjpmYWxzZSwiQ2l0eSI6ZmFsc2UsImNpdHlOYW1lIjoiTWVuZXNldCJ9LCJsb2NhdGlvbiI6eyJMb25naXR1ZGUiOi04Ni43Njc2MTkyMDA0ODM5MiwiTGF0aXR1ZGUiOjUzLjIzNjIzOTgyNTA1NjUxLCJab29tIjoyLCJtYXBWaWV3VHlwZSI6InJvYWRtYXAiLCJtYXBJbmZvVHlwZSI6W10sInNlbGVjdGVkUGF0aElEIjoiIiwiQm91bmRzIjp7InNvdXRoIjotNC41NDgwMzU0MjY0NTgxNzQsIndlc3QiOi0xODAsIm5vcnRoIjo3OC4zNTI5NDI4MzEyNjQ2MywiZWFzdCI6MTgwfX0sImNvbnRyb2xTcGVjaWFsIjp7fX0=',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
}
response = s.options('https://api.mapsearch.vps-private.net/properties', headers=headers)
headers2 = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': 'application/json',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
'Referer': 'https://www.soldhistory.ca/',
}
data2 = '{"mail":"robbydummy123#gmail.com","pass":"helloworld"}'
response2 = s.post('https://www.soldhistory.ca/mapsearchapp/visitor/login', headers=headers2, data=data2, verify=True)
parsed = response2.json()
print(json.dumps(parsed, indent=1, sort_keys=True))
accessToken = parsed['accessToken']
headers3 = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': 'application/json',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'access_token': accessToken,
'Access-Token': accessToken,
'Content-Type': 'application/json',
'Content-Length': '317',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
'Referer': 'https://www.soldhistory.ca/mapsearchapp/search/eyJzZWFyY2giOnsic2VhcmNoVHlwZSI6InJlc2lkZW50aWFsIiwibGlzdGluZ1R5cGUiOlsiU29sZCJdLCJvcGVuSG91c2UiOnsiZnJvbSI6MCwidG8iOjB9LCJiZWQiOjAsIm1hcmtldGRheXMiOjAsImJhdGgiOjAsInNlYXJjaEJ5Ijoic2VhcmNoYWxsIiwic2VhcmNoQnlUZXh0IjoiIiwicHJpY2VSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sImZlZXRSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sInNob3dPbmx5IjpbXSwicHJpY2VEcm9wIjpmYWxzZSwic29ydGJ5IjoibmV3ZXN0IiwiY29uZG9UeXBlIjoiIiwiY29uZG9PY2N1cGFuY3kiOiIiLCJjb25kb1N0YXR1cyI6IiIsImNvbmRvQnVpbGRlciI6IiIsImtleXdvcmRzIjpbXSwiUG9zdGFsQ29kZSI6ZmFsc2UsIlByb3ZpbmNlIjpmYWxzZSwiQ2l0eSI6ZmFsc2UsImNpdHlOYW1lIjoiTWVuZXNldCJ9LCJsb2NhdGlvbiI6eyJMb25naXR1ZGUiOi04Ni43Njc2MTkyMDA0ODM5MiwiTGF0aXR1ZGUiOjUzLjIzNjIzOTgyNTA1NjUxLCJab29tIjoyLCJtYXBWaWV3VHlwZSI6InJvYWRtYXAiLCJtYXBJbmZvVHlwZSI6W10sInNlbGVjdGVkUGF0aElEIjoiIiwiQm91bmRzIjp7InNvdXRoIjotNC41NDgwMzU0MjY0NTgxNzQsIndlc3QiOi0xODAsIm5vcnRoIjo3OC4zNTI5NDI4MzEyNjQ2MywiZWFzdCI6MTgwfX0sImNvbnRyb2xTcGVjaWFsIjp7fX0=',
}
data3 = '{"query":{"coordinates":{"$geoWithin":{"$box":[[160.3305414532229,35.087235763335656],[2.6547602032228923,71.87799155489013]]}},"searchType":"residential","listingType":{"$in":["Sale","Sold","Rent"]}},"fields":["Latitude","Longitude","listingType","searchType","Price"],"sort":{"Price":1},"limit":20,"soldData":false}'
response3 = s.post('https://api.mapsearch.vps-private.net/properties', headers=headers3, data=data3)
parsed = response3.json()
print(json.dumps(parsed, indent=1, sort_keys=True))
print(response3.status_code)

Python requests.get fails with 403 forbidden, even after using headers and Session object

I'm making a GET request to fetch JSON, which works absolutely fine from any browser on any device, but not by python requests:
url = 'https://angel.co/autocomplete/new_tags'
params = {'query': 'sci', 'tag_type': 'MarketTag'}
resp = requests.get(url,params=params)
resp.raise_for_status()
gives HTTPError: 403 Client Error: Forbidden for url: https://angel.co/autocomplete/new_tags?query=ab&tag_type=MarketTag
So I tried:
Python requests. 403 Forbidden - I not only tried using User-Agent in headers but also all other headers that I found in Request Headers section in firefox for JSON response, but still 403!
Python requests - 403 forbidden - despite setting `User-Agent` headers - By making request through Session object, I still get 403!
What can be the possible cause? Is there something else I could try using?
EDIT: Request Headers (inspecting headers section of JSON in firefox) that I used in headers attribute:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'angel.co',
'If-None-Match: 'W/"5857a9eac987138be074e7bdd4537df8"',
'TE': 'Trailers',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
If a get request returns 403 Forbidden even after adding user-agent to headers, you may need to add more headers like this:
headers = {
'user-agent':"Mozilla/5.0 ...",
'accept': '"text/html,application...',
'referer': 'https://...',
}
r = requests.get(url, headers=headers)
In the chrome, Request headers can be found in the Network > Headers > Request-Headers of the Developer Tools. (Press F12 to toggle it.)
I assume you website detects, when a request isn' sent from a browser (made with javascript).
I had a similar issue recently, and this answer had worked for me.

Python requests returning succesful POST request, but not redirecting page on Twitter

I've got a script that is meant to at first login to Twitter. I get a 200 response if I check it, but I don't redirect to a logged in Twitter account after succeeding, instead it stays on the same page.
url = 'https://twitter.com/login/error?redirect_after_login=%2F'
r = requests.session()
# Need the headers below as so Twitter doesn't reject the request.
headers = {
'Host': "twitter.com",
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate, br",
'Referer': "https://twitter.com/login/error?redirect_after_login=%2F",
'Upgrade-Insecure-Requests': "1",
'Connection': "keep-alive"
}
login_data = {"session[username_or_email]":"Username", "session[password]":"Password"}
response = r.post(url, data=login_data, headers=headers, allow_redirects=True)
How do I go about redirecting to my account upon successful POST request to the logged in state. Am I not using the correct headers or something like that? I've not done a huge amount of web stuff before, so I'm sorry if it's something really obvious.
Note: (I cannot use Twitter API to do this) & The referrer is the error page because that's where I'm logging in from - unless of course I'm wrong in doing that.
Perhaps the GET parameter redirect_after_login will use kind of javascript or html meta refresh redirection instead of HTTP redirection, so if it's the case, the requests python module will not handle it correctly.
So once you retrieve your authentication token from your first request, you could make again the second request to https://twiter.com/ without to forget your specify your security token from your HTTP request fields. You can find more information about REST API of twitter here: https://dev.twitter.com/overview/api
But the joy of python is to have libraries for everything, so I suggest you to take a look here:
https://github.com/bear/python-twitter
It's a library to communicate with the REST API of twitter.

Python httplib2: HTTPS login fails

I am trying to use httplib2 to log in to a web page. I am able to log in to the page by simply opening the following URL in a Chrome incognito window:
https://domain.com/auth?name=USERNAME&pw=PASSWORD
I tried the following code to emulate this login with httplib2:
from httplib2 import Http
h = Http(disable_ssl_certificate_validation=True)
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Unfortunately, this request does not lead to a successful login.
I tried changing the request headers to match those provided by Chrome:
headers = {
'Host': 'domain.com',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8'
}
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD', 'GET', headers=headers)
This changes the response slightly, but still does not lead to a successful login.
I tried inspecting the actual network traffic with Wireshark but since it's HTTPS and thus encrypted, I can't see the actual traffic.
Does anybody know what the difference in requests between Chrome and httplib2 could be? Maybe httplib2 changes some of my headers?
Following Games Brainiac's comment, I ended up simply using Python Requests instead of httplib2. The following requests code works out of the box:
import requests
session = requests.Session()
response = session.get('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Further requests with the same username/password can simply be performed on the Session object:
...
next_response = session.get('https://domain.com/someOtherPage')

Categories