Blob URLs with python requests - python

I'm trying to figure out why the request i send gives me this error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'blob:https://192.168.56.108/7020557a-95f0-4560-a3d4-94e23bc3db4a'
In another thread, i read that it was due to https missing. But in my url i still do have it. Here is the code i wrote to send the request :
url_image = 'blob:https://192.168.56.108/7020557a-95f0-4560-a3d4-94e23bc3db4a'
headers = {'Origin': 'https://192.168.56.108',
'Referer':'',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'}
response = s.get(url_image, stream=True, verify=False)
print(response.url)
I also read in another thread that blob url where generated by the browser once the page is loaded. So i thought a doing a GET request to the page where i would usually download first then sending the POST request but it doesn't work still. I thought it could be for the fact that the blob url was not the one associated to the page i loaded (a new one would have been generated).
For a bit more context, i load a page on which there is a graphic that i can download. To check what happens, i use the network console. What happens is that each time i click and download that graphic. A GET request is made with a blob URL that changes each time i download.
So my question is more how to get the correct url with python requests and why would i get the first error when sending the request to the blob url ?

Related

Getting a URL with an authenticity token using python

I am trying to read a web page using a get request in python.
The original URL is given here. I found out that the information I am interested in is in a subpage with this URL (I replaced the authenticity token with XXX).
I tried using the second URL in my script but I get a 406 error. Can you suggest what am I doing wrong? Is the authenticity token for preventing scraping? if so, can I work around it?
import urllib.request
url = ...
agent={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = urllib.request.Request(url,headers=agent)
data = urllib.request.urlopen(req)
Thanks!
PS, This is how I get the URL using Chrome:
First I browse to https://www.goodreads.com/book/show/385228.On_Liberty
Then I open Chrome's developer tools: three dots -> more tools -> developer tools. Choose the network tab.
Then I go to the bottom of the page (just after the last review) and click "next".
In the tool window choose the request and in the header I get the Request URL: https://www.goodreads.com/book/reviews/385228?csm_scope=&hide_last_page=true&language_code=en&page=2&authenticity_token=XXX
Can you try to update your headers to include one more item, like:
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3',
'X-Requested-With': 'XMLHttpRequest',
}
req = urllib.request.Request(url,headers= headers)
I managed to get 200 OK back when adding that header, however, the response you'll get back from this endpoint might not really be what you need in the end, since it is a piece of JavaScript code which in return updates the HTML page. You can still use it in some way, but it's very dirty approach and might complicate things a lot.
What information do you need exactly? There might be a different approach than using that "problematic" response from your second URL.

Python urllib or requests library get stuck to open certain URLs?

I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )
The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

Getting 404 despite supplying required data using POST when trying to login to an website

I need to login into [a website] and use the session to scrape some data. However when using POST, I always get status 404.
Here is what I have already tried:
import requests
PW="password"
UN="username"
payload={"Login":UN,"Password":PW,"submit":"Kirjaudu+sisään"}
url="[a website]"
s=requests.session()
data=s.post(url,data=payload)
print(data)
The output is:
<Response [404]>
I have also tried supplying a Firefox user agent for the site:
s.post(url,data=payload,headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"})
It did not make a difference.
Firstly, the post requests should go to https://wilma-lukiot.gradia.fi/login
Secondly, there is a fourth field in the form, a SESSIONID, you need to send that too.
Probably the best way to get it is first load https://wilma-lukiot.gradia.fi, parse it to get the SESSIONID, just then send a post (in the same session) to the login endpoint.

how to make an xml http post request from the information provided by the google chrome inspector

This question is no duplicate, as adding user-agent to a header isn't fixing anything.
I've been trying to get the response from this URL. It's an XML feed, not an HTML file. It's a live feed updated every second from cashpoint.com's live soccer page. I can just fine get the HTML page from the last-mentioned page, but from the first mentioned URL I can't retrieve the XML data. I can inspect it with google chrome inspector and see the response just fine. But it returns b''. Tried both get and post.
EDIT: Tried to add some more headers but it still doesn't work.
Shouldn't it be possible it retrieve this information if the inspector can see it?
Below is my code and some pictures (if you are too busy to check the links).
import requests
class GetFeed():
def __init__(self):
pass
def live_odds(self):
live_index_page = 'https://www.cashpoint.dk/en/live/index.html'
live_oddsupdate = 'https://www.cashpoint.dk/index.php?r=games/oddsupdate'
r = requests.get(live_oddsupdate)
print(r.text)
feed = GetFeed()
feed.live_odds()
Well, for one, in the Chrome console you can see that it's a POST request, and it appears you're performing a GET request in your Python code.
You need to have some data and some headers in a post request. Try this:
url = 'https://www.cashpoint.dk/index.php?r=games/oddsupdate'
headers = {
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "_ga=GA1.2.517291307.1531264976; _gid=GA1.2.1421702183.1531264976; _pk_id.155.9810=7984b0a0e139caba.1531264979.1.1531264979.1531264979.; cookieConsent=1; cpLanguage=en; langid=2; ad_network=DIRECT; PHPSESSID=f4mbbfd8adb3m59gfc1pelo126"
}
data = "parameters%5Baction%5D=odds_update&parameters%5Bgame_category%5D=live&parameters%5Bsport_id%5D=&parameters%5Btimestamp%5D=1531268162&parameters%5Bgameids%5D=%5B905814%2C905813%2C905815%2C905818%2C905792%5D&formToken=c3fed3ea6b46dae171a6f1a6d21db14fcc21474c"
response = requests.post(url, data=data, headers=headers)
print response.content
Just tested this and it works. The point here is that ALL of this information can be found on the exact same xhr network inspection in google chrome. Next time, please read up on xmlhttprequests before you post a question.

Categories