I have an URL of the following form https://website.com/chill & relax/folder/file.txt?a=1&b=2 (The link is an dummy example, it's not meant to be working)
When I paste this URL in Firefox, I can fetch the wanted file.txt, but when I try to retrieve the file using python and requests, it doesn't work:
>>> import requests
>>> url = "https://website.com/chill & relax/folder/file.txt?a=1&b=2"
>>> requests.get(url)
Traceback (most recent call last):
[...]
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x00982B70>:
Failed to establish a new connection: [WinError 10060] [...]
How come Firefox can request the file but not requests? How can I replace the spaces and "&" signs only in the path part of my URL?
EDIT: I now believe that requests can actually perform a requests with URLs that contain spaces. I think this issue is linked to my proxy, Firefox can work with my proxy, but requests commands executed within PyCharm are stopped by my proxy.
For path encoding you can use this:
from requests.utils import requote_uri
url = requote_uri("https://website.com/chill & relax/folder/file.txt?a=1&b=2")
But link anyway not working for me even in browser. Does the link is actually valid?
Simply URL encode your path then append the parameters.
import urllib.parse
url = urllib.parse.quote("https://website.com/chill & relax/folder/file.txt")
url += "?a=1&b=2"
r = requests.get(url)
Related
I use chrome Momentum extension to customize my browser new tab and would like to write a python script to get its daily dashboard wallpaper
By now I know I can reach desired page through the url
chrome-extension://laookkfknpbbblfpciffpaejjkokdgca/dashboard.html
However, when I try to call urllib.request.urlopen with this url the following error is raised:
urllib.error.URLError: <urlopen error unknown url type: chrome-extension>
Is possible to include custom protocols to be open by urllib?
Or would there be another way to get page html result?
If the file exists locally and not on the web then using urllib won't do you much good since it's not a URL.
use webbrowser lib instead and provide a path to your file:
def auto_open():
"""
This method takes the absolute path to the html file and opens it directly in the browser.
"""
html_page = 'path/to/your/file'
# open in a new tab.
new = 2
webbrowser.open(html_page, new=new)
Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.
You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7
I was trying to install urllib to my python 3.6.1 using pip method, but I am unable to fix the error output.
The error appears to be like this:
I first searched online and found out that one possible reason is that Python3 is unable to identify 0, I need to change the last digit to something, therefore, I tried to open the setup.py file in the folder.
I tried to access the hidden folders on my mac following the path listed in the error, but I am unable to find any pip-build-zur37k_r folder in my mac, I turned all the hidden fildes to visible.
I want to extract information using urllib.request library and BeautifulSoup, and when I run the following code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
The error appears to be like:
The code should return to me the following information:
<h1> An Interesting Title </h1>
Your error says certificate verification failed. So it is a problem with the website, not your code. The call to urlopen() works for me, but maybe you have a proxy server that is fussier about certificates.
The url you are hitting is not having any SSL certificate so when you want to request such site you'll need to overlook the ssl check. As below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
html = urlopen("https://www.pythonscraping.com/pages/page1.html",context=ctx)
bsObj = BeautifulSoup(html.read()) print(bsObj.h1)
So you'll get the end result as expected.
I'm trying to open a local file using urllib2. How can I go about doing this? When I try the following line with urllib:
resp = urllib.urlopen(url)
it works correctly, but when I switch it to:
resp = urllib2.urlopen(url)
I get:
ValueError: unknown url type: /path/to/file
where that file definitely does exit.
Thanks!
Just put "file://" in front of the path
>>> import urllib2
>>> urllib2.urlopen("file:///etc/debian_version").read()
'wheezy/sid\n'
In urllib.urlopen method: If the URL parameter does not have a scheme identifier, it will opens a local file. but the urllib2 doesn't behave like this.
So, the urllib2 method can't process it.
It's always be good to include the 'file://' schema identifier in both of the method call for the url parameter.
I had the same issue and actually, I just realized that if you download the source of the page, and then open it on chrome your browser will show you the exact local path on the url bar. Good luck!
Basically, I am trying to download a URL using urllib2 in python.
the code is the following:
import urllib2
req = urllib2.Request('www.tattoo-cover.co.uk')
req.add_header('User-agent','Mozilla/5.0')
result = urllib2.urlopen(req)
it outputs ValueError and the program crushes for the URL in the example.
When I access the url in a browser, it works fine.
Any ideas how to handle the problem?
UPDATE:
thanks for Ben James and sth the problem is detected => add 'http://'
Now the question is refined:
Is it possible to handle such cases automatically with some builtin function or I have to do error handling with subsequent string concatenation?
When you enter a URL in a browser without the protocol, it defaults to HTTP. urllib2 won't make that assumption for you; you need to prefix it with http://.
You have to use a complete URL including the protocol, not just specify a host name.
The correct URL would be http://www.tattoo-cover.co.uk/.
You can use the method urlparse from urllib (Python 3) to check the presence of an addressing scheme (http, https, ftp) and concatenate the scheme in case it is not present:
In [1]: from urllib.parse import urlparse
..:
..: url = 'www.myurl.com'
..: if not urlparse(url).scheme:
..: url = 'http://' + url
..:
..: url
Out[1]: 'http://www.myurl.com'
You can use the urlparse function for that I think
:
Python User Documentation