This might be a stupid question but can i fetch a url with urllib2 without declaring the url scheme like http or https
To clarify instead of writing 'http://blahblah.com' i just want to write 'blahblah.com', is this possible ?
import urllib2
def open_url_with_default_protocol(*args, **kwargs):
# Use the HTTP scheme by default if none is given
# pass through all other arguments to urllib2.urlopen
default_scheme = 'http://'
url = args[0]
scheme, address = urllib2.splittype(url)
if not scheme:
# Replace the url in the args tuple by a URL with the default scheme
args = (default_scheme + args[0],) + args[1:]
return urllib2.urlopen(*args, **kwargs)
So you can do:
>>> open_url_with_default_protocol('http://google.com')
<addinfourl at 4496800872 whose fp = <socket._fileobject object at 0x10bd92b50>>
>>> open_url_with_default_protocol('google.com')
<addinfourl at 4331750464 whose fp = <socket._fileobject object at 0x1027960d0>>
Note that this function will still fail if you pass it a URL of the form '//google.com', because it assumes that if there is no scheme, there is no leading double forward slash.
Related
I'll try to clarify what I mean.
Let's say I have this url:
https://test-api.com/users?age=17&sex=M
There's 2 fields: age and sex. The age field is required but the sex field is optional.
Let's say I want to make a bunch of tests and I use this code:
import requests
def testUserQuery(user_age, user_sex):
url = f'https://test-api.com/users?age={user_age}&sex={user_sex}'
response = requests.get(url)
test_query = testUserQuery(17)
Now, assuming that I can't go into the actual code of the API itself and change how empty fields are interpreted...
How can I make this test and leave the user_sex field blank?
In other words, is there a special universal symbol (like "&" which means "and" for every URL in the world) that I can put for user_sex that'll force the API to ignore the field and not cause errors?
Otherwise, I would have to do this:
import requests
def testUserQuery(user_age, user_sex=None):
if user_sex is None:
url = f'https://test-api.com/users?age={user_age}'
elif user_sex is not None:
url = f'https://test-api.com/users?age={user_age}&sex={user_sex}'
response = requests.get(url)
test_query = testUserQuery(17)
Imagine if I'm dealing with 10 optional fields. I don't think it would be very efficient to make multiple elif statements to change the URL for every single possible case where an optional field is empty.
I hope I'm being clear, sorry if this sounds confusing.
Here's a simple way to do this by utilising the params parameter:
import requests
URL = 'https://test-api.com/users'
def testUserQuery(**params):
return requests.get(URL, params=params)
testUserQuery(age=21, sex='male')
testUserQuery(age=21)
In other words, all you have to do is match the parameter names with those that are understood by the API. No need to manipulate the URL
One way to dynamically achieve this is by changing testUserQuery to accept its arguments as **kwargs then using urllib.parse.urlencode to dynamically build the query string.
from urllib.parse import urlencode
def testUserQuery(base_url='https://test-api.com/users', **kwargs):
params = urlencode({k: v for k, v in kwargs.items() if v is not None})
url = f"{base_url}{'?' + params if params else ''}"
print(url)
testUserQuery()
testUserQuery(a=1)
testUserQuery(a=1, b=2)
This outputs
https://test-api.com/users
https://test-api.com/users?a=1
https://test-api.com/users?a=1&b=2
I have a requests.cookies.RequestCookieJar object which contains multiple cookies from different domain/path. How can I extract a cookies string for a particular domain/path following the rules mentioned in here?
For example
>>> r = requests.get("https://stackoverflow.com")
>>> print(r.cookies)
<RequestsCookieJar[<Cookie prov=4df137f9-848e-01c3-f01b-35ec61022540 for .stackoverflow.com/>]>
# the function I expect
>>> getCookies(r.cookies, "stackoverflow.com")
"prov=4df137f9-848e-01c3-f01b-35ec61022540"
>>> getCookies(r.cookies, "meta.stackoverflow.com")
"prov=4df137f9-848e-01c3-f01b-35ec61022540"
# meta.stackoverflow.com is also satisfied as it is subdomain of .stackoverflow.com
>>> getCookies(r.cookies, "google.com")
""
# r.cookies does not contains any cookie for google.com, so it return empty string
I think you need to work with a Python dictionary of the cookies. (See my comment above.)
def getCookies(cookie_jar, domain):
cookie_dict = cookie_jar.get_dict(domain=domain)
found = ['%s=%s' % (name, value) for (name, value) in cookie_dict.items()]
return ';'.join(found)
Your example:
>>> r = requests.get("https://stackoverflow.com")
>>> getCookies(r.cookies, ".stackoverflow.com")
"prov=4df137f9-848e-01c3-f01b-35ec61022540"
NEW ANSWER
Ok, so I still don't get exactly what it is you are trying to achieve.
If you want to extract the originating url from a requests.RequestCookieJar object (so that you could then check if there is a match with a given subdomain) that is (as far as I know) impossible.
However, you could off course do something like:
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import requests
import re
class getCookies():
def __init__(self, url):
self.cookiejar = requests.get(url).cookies
self.url = url
def check_domain(self, domain):
try:
base_domain = re.compile("(?<=\.).+\..+$").search(domain).group()
except AttributeError:
base_domain = domain
if base_domain in self.url:
print("\"prov=" + str(dict(self.cookiejar)["prov"]) + "\"")
else:
print("No cookies for " + domain + " in this jar!")
Then if you do:
new_instance = getCookies("https://stackoverflow.com")
You could then do:
new_instance.check_domain("meta.stackoverflow.com")
Which would give the output:
"prov=5d4fda78-d042-2ee9-9a85-f507df184094"
While:
new_instance.check_domain("google.com")
Would output:
"No cookies for google.com in this jar!"
Then, if you (if needed) fine-tune the regex & create a list of urls, you could first loop through the list to create many instances and save them in eg a list or dict. In a second loop you could check another list of urls to see if their cookies might be present in any of the instances.
OLD ANSWER
The docs you link to explain:
items()
Dict-like items() that returns a list of name-value
tuples from the jar. Allows client-code to call
dict(RequestsCookieJar) and get a vanilla python dict of key value
pairs.
I think what you are looking for is:
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import requests
def getCookies(url):
r = requests.get(url)
print("\"prov=" + str(dict(r.cookies)["prov"]) + "\"")
Now I can run it like this:
>>> getCookies("https://stackoverflow.com")
"prov=f7712c78-b489-ee5f-5e8f-93c85ca06475"
actually , when I just have the problem as you are. but when I access the Class Define
class RequestsCookieJar(cookielib.CookieJar, MutableMapping):
I found a func called def get_dict(self, domain=None, path=None):
you can simply write code like this
raw = "rawCookide"
print(len(cookie))
mycookie = SimpleCookie()
mycookie.load(raw)
UCookie={}
for key, morsel in mycookie.items():
UCookie[key] = morsel.value
The following code is not promised to be "forward compatible" because I am accessing attributes of classes that were intentionally hidden (kind of) by their authors; however, if you must get into the attributes of a cookie, take a look here:
import http.cookies
import requests
import json
import sys
import os
aresponse = requests.get('https://www.att.com')
requestscookiejar = aresponse.cookies
for cdomain,cooks in requestscookiejar._cookies.items():
for cpath, cookgrp in cooks.items():
for cname,cattribs in cookgrp.items():
print(cattribs.version)
print(cattribs.name)
print(cattribs.value)
print(cattribs.port)
print(cattribs.port_specified)
print(cattribs.domain)
print(cattribs.domain_specified)
print(cattribs.domain_initial_dot)
print(cattribs.path)
print(cattribs.path_specified)
print(cattribs.secure)
print(cattribs.expires)
print(cattribs.discard)
print(cattribs.comment)
print(cattribs.comment_url)
print(cattribs.rfc2109)
print(cattribs._rest)
When a person needs to access the simple attributes of cookies is it likely less complicated to go after the following way. This avoids the use of RequestsCookieJar. Here we construct a single SimpleCookie instance by reading from the headers attribute of a response object instead of the cookies attribute. The name SimpleCookie would seem to imply a single cookie but that isn't what a simple cookie is. Try it out:
import http.cookies
import requests
import json
import sys
import os
def parse_cookies(http_response):
cookie_grp = http.cookies.SimpleCookie()
for h,v in http_response.headers.items():
if 'set-cookie' in h.lower():
for cook in v.split(','):
cookie_grp.load(cook)
return cookie_grp
aresponse = requests.get('https://www.att.com')
cookies = parse_cookies(aresponse)
print(str(cookies))
You can get list of domains in ResponseCookieJar and then dump the cookies for each domain with the following code:
import requests
response = requests.get("https://stackoverflow.com")
cjar = response.cookies
for domain in cjar.list_domains():
print(f'Cookies for {domain}: {cjar.get_dict(domain=domain)}')
Outputs:
Cookies for domain .stackoverflow.com: {'prov': 'efe8c1b7-ddbd-4ad5-9060-89ea6c29479e'}
In this example, only one domain is listed. It would have multiple lines in output if there were cookies for multiple domains in the Jar.
For many usecases, the cookie jar can be serialized by simply ignoring domains by calling:
dCookies = cjar.get_dict()
We can easily extract cookies string for a particular domain/path using functions already available in requests lib.
import requests
from requests.models import Request
from requests.cookies import get_cookie_header
session = requests.session()
r1 = session.get("https://www.google.com")
r2 = session.get("https://stackoverflow.com")
cookie_header1 = get_cookie_header(session.cookies, Request(method="GET", url="https://www.google.com"))
# '1P_JAR=2022-02-19-18; NID=511=Hz9Mlgl7DtS4uhTqjGOEolNwzciYlUtspJYxQ0GWOfEm9u9x-_nJ1jpawixONmVuyua59DFBvpQZkPzNAeZdnJjwiB2ky4AEFYVV'
cookie_header2 = get_cookie_header(session.cookies, Request(method="GET", url="https://stackoverflow.com"))
# 'prov=883c41a4-603b-898c-1d14-26e30e3c8774'
Request is used to prepare a :class:PreparedRequest <PreparedRequest>, which is sent to the server.
What you need is get_dict() method
a_session = requests.Session()
a_session.get('https://google.com/')
session_cookies = a_session.cookies
cookies_dictionary = session_cookies.get_dict()
# Now just print it or convert to json
as_string = json.dumps(cookies_dictionary)
print(cookies_dictionary)
I have urls formatted as:
google.com
www.google.com
http://google.com
http://www.google.com
I would like to convert all type of links to a uniform format, starting with http://
http://google.com
How can I prepend URLs with http:// using Python?
Python do have builtin functions to treat that correctly, like
p = urlparse.urlparse(my_url, 'http')
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
if not netloc.startswith('www.'):
netloc = 'www.' + netloc
p = urlparse.ParseResult('http', netloc, path, *p[3:])
print(p.geturl())
If you want to remove (or add) the www part, you have to edit the .netloc field of the resulting object before calling .geturl().
Because ParseResult is a namedtuple, you cannot edit it in-place, but have to create a new object.
PS:
For Python3, it should be urllib.parse.urlparse
I found it easy to detect the protocol with regex and then append it if missing:
import re
def formaturl(url):
if not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url
url = 'test.com'
print(formaturl(url)) # http://test.com
url = 'https://test.com'
print(formaturl(url)) # https://test.com
I hope it helps!
For the formats that you mention in your question, you can do something as simple as:
def convert(url):
if url.startswith('http://www.'):
return 'http://' + url[len('http://www.'):]
if url.startswith('www.'):
return 'http://' + url[len('www.'):]
if not url.startswith('http://'):
return 'http://' + url
return url
But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).
If you URLs are a string type you could just concatenate.
one = "https://"
two = "www.privateproperty.co.za"
link = "".join((one, two))
def fix_url(orig_link):
# force scheme
split_comps = urlsplit(orig_link, scheme='https')
# fix netloc (can happen when there is no scheme)
if not len(split_comps.netloc):
if len(split_comps.path):
# override components with fixed netloc and path
split_comps = SplitResult(scheme='https',netloc=split_comps.path,path='',query=split_comps.query,fragment=split_comps.fragment)
else: # no netloc, no path
raise ValueError
return urlunsplit(split_comps)
I need to get the content-type of an internet(intranet) resource not a local file. How can I get the MIME type from a resource behind an URL:
I tried this:
res = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry")
http_message = res.info()
message = http_message.getplist()
I get:
['charset=UTF-8']
How can I get the Content-Type, can be done using urllib and how or if not what is the other way?
A Python3 solution to this:
import urllib.request
with urllib.request.urlopen('http://www.google.com') as response:
info = response.info()
print(info.get_content_type()) # -> text/html
print(info.get_content_maintype()) # -> text
print(info.get_content_subtype()) # -> html
res = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry" )
http_message = res.info()
full = http_message.type # 'text/plain'
main = http_message.maintype # 'text'
Update: since info() function is deprecated in Python 3.9, you can read about the preferred type called headers here
import urllib
r = urllib.request.urlopen(url)
header = r.headers # type is email.message.EmailMessage
contentType = header.get_content_type() # or header.get('content-type')
contentLength = header.get('content-length')
filename = header.get_filename()
also, a good way to quickly get the mimetype without actually loading the url
import mimetypes
contentType, encoding = mimetypes.guess_type(url)
The second method does not guarantee an answer but is a quick and dirty trick since it's just looking at the URL string rather than actually opening the URL.
I need to add custom parameters to an URL query string using Python
Example:
This is the URL that the browser is fetching (GET):
/scr.cgi?q=1&ln=0
then some python commands are executed, and as a result I need to set following URL in the browser:
/scr.cgi?q=1&ln=0&SOMESTRING=1
Is there some standard approach?
You can use urlsplit() and urlunsplit() to break apart and rebuild a URL, then use urlencode() on the parsed query string:
from urllib import urlencode
from urlparse import parse_qs, urlsplit, urlunsplit
def set_query_parameter(url, param_name, param_value):
"""Given a URL, set or replace a query parameter and return the
modified URL.
>>> set_query_parameter('http://example.com?foo=bar&biz=baz', 'foo', 'stuff')
'http://example.com?foo=stuff&biz=baz'
"""
scheme, netloc, path, query_string, fragment = urlsplit(url)
query_params = parse_qs(query_string)
query_params[param_name] = [param_value]
new_query_string = urlencode(query_params, doseq=True)
return urlunsplit((scheme, netloc, path, new_query_string, fragment))
Use it as follows:
>>> set_query_parameter("/scr.cgi?q=1&ln=0", "SOMESTRING", 1)
'/scr.cgi?q=1&ln=0&SOMESTRING=1'
Use urlsplit() to extract the query string, parse_qsl() to parse it (or parse_qs() if you don't care about argument order), add the new argument, urlencode() to turn it back into a query string, urlunsplit() to fuse it back into a single URL, then redirect the client.
You can use python's url manipulation library furl.
import furl
f = furl.furl("/scr.cgi?q=1&ln=0")
f.args['SOMESTRING'] = 1
print(f.url)
import urllib
url = "/scr.cgi?q=1&ln=0"
param = urllib.urlencode({'SOME&STRING':1})
url = url.endswith('&') and (url + param) or (url + '&' + param)
the docs