I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format.
In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.
For example:
["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]
As you see the protocol might differ from https to http or even with or without www.
Not sure if there are any other variations.
is there any python tool that can determine the right protocol?
If not and I have to build the logic by myself what are the cases that I should take into account?
For option 2, this is what I have so far:
def identify_protocol(url):
try:
r = requests.get("https://" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("http//" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
return r.url, r.status_code
except:
return None, None
is there any other possibility I should take into account?
There is no way to determine the protocol/full domain from the fragment directly, the information simply isn't there. In order to find it you would either need:
a database of the correct protocol/domains, which you can lookup your domain fragment in
to make the request and see what the server tells you
If you do (2) you can of course gradually build your own database to avoid needing the request in future.
On many https servers, if you attempt a http connection you will be redirected to https. If you are not, then you can reliably use the http. If the http fails, then you could try again with https and see if it works.
The same applies to the domain: if the site usually redirects, you can perform the request using the original domain and see where you are redirected.
An example using requests:
>>> import requests
>>> r = requests.get('http://bol.com')
>>> r
<Response [200]>
>>> r.url
'https://www.bol.com/nl/nl/'
As you can see the request object url parameter has the final destination URL, plus protocol.
As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().
Code:
import urllib.request as request
import urllib.parse as parse
def find_redirect_location(url, proxy=None):
parsed_url = parse.urlsplit(url.strip())
url = parse.urlunsplit((
parsed_url.scheme or "http",
parsed_url.netloc or parsed_url.path,
parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
parsed_url.query,
parsed_url.fragment
))
if proxy:
handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
else:
opener = request.build_opener()
with opener.open(url) as response:
return response.url
Then you can just call this function on every url in list:
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(map(find_redirect_location, urls))
You can also use proxies:
from itertools import cycle
urls = ["bol.com ","www.dopper.com", "https://google.com"]
proxies = ["http://localhost:8888"]
final_urls = list(map(find_redirect_location, urls, cycle(proxies)))
To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))
Related
I'm little confused about requests module, especially proxies.
From documentation:
PROXIES
Dictionary mapping protocol to the URL of the proxy (e.g. {‘http’:
‘foo.bar:3128’}) to be used on each Request.
May there be more proxies of one type in the dictionary? I mean is it possible to put there list of proxies and requests module will try them and use only those which are working?
Or there can be only one proxy address for example for http?
Using the proxies parameter is limited by the very nature of a python dictionary (i.e. each key must be unique).
import requests
url = 'http://google.com'
proxies = {'https': '84.22.41.1:3128',
'http': '185.26.183.14:80',
'http': '178.33.230.114:3128'}
if __name__ == '__main__':
print url
print proxies
response = requests.get(url, proxies=proxies)
if response.status_code == 200:
print response.text
else:
print 'Response ERROR', response.status_code
outputs
http://google.com
{'http': '178.33.230.114:3128', 'https': '84.22.41.1:3128'}
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
...more html...
As you can see, the value of the http protocol key in the proxies dictionary corresponds to the last encountered in its assignment (i.e. 178.33.230.114:3128). Try swapping the http entries around.
So, the answer is no, you cannot specify multiple proxies for the same protocol using a simple dictionary.
I have tried using an iterable as a value, which would make sense to me
proxies = {'https': '84.22.41.1:3128',
'http': ('178.33.230.114:3128', '185.26.183.14:80', )}
but with no luck, it produces an error
Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.
I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using file extension is probably not the best idea.
Is it possible to get content-type and content length/size without fetching the whole content/page?
Here is my code:
requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
session.max_redirects = self.maxRedirects
else:
session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
currentUrl = response.url
currentUrlData = urlparse.urlparse(currentUrl)
currentUrlDomain = currentUrlData.netloc
domainWWW = 'www.' + str(urlDomain)
headers = response.headers
contentType = str(headers['content-type'])
except:
logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
logging.exception("Get page exception:")
response = None
Yes.
You can use the Session.head method to create HEAD requests:
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
A HEAD request similar to GET request, except that the message body would not be sent.
Here is a quote from Wikipedia:
HEAD
Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
Use requests.head() for this. It will not return the message body. You should use head method if you are interested only in the headers. Check this link for detail.
h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')
Sorry, my mistake, I should read documentation better. Here is the answer:
http://docs.python-requests.org/en/latest/user/advanced/#advanced (Body Content Workflow)
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
r.connection.close()
# log request too long
Because requests.head() does NOT auto redirect, so a URL is redirected, requests.head() will get 0 for Content-Length. So make sure allow_redirects=True is added.
r = requests.head(url, allow_redirects=True)
length = r.headers['Content-Length']
Refer to Requests Redirection And History
There's a lot of stuff out there on urllib2 and POST calls, but I'm stuck on a problem.
I'm trying to do a simple POST call to a service:
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
content = urllib2.urlopen(url=url, data=data).read()
print content
I can see the server logs and it says that I'm doing GET calls, when I'm sending the data
argument to urlopen.
The library is raising an 404 error (not found), which is correct for a GET call, POST calls are processed well (I'm also trying with a POST within a HTML form).
Do it in stages, and modify the object, like this:
# make a string with the request type in it:
method = "POST"
# create a handler. you can specify different handlers here (file uploads etc)
# but we go for the default
handler = urllib2.HTTPHandler()
# create an openerdirector instance
opener = urllib2.build_opener(handler)
# build a request
data = urllib.urlencode(dictionary_of_POST_fields_or_None)
request = urllib2.Request(url, data=data)
# add any other information you want
request.add_header("Content-Type",'application/json')
# overload the get method function with a small anonymous function...
request.get_method = lambda: method
# try it; don't forget to catch the result
try:
connection = opener.open(request)
except urllib2.HTTPError,e:
connection = e
# check. Substitute with appropriate HTTP code.
if connection.code == 200:
data = connection.read()
else:
# handle the error case. connection.read() will still contain data
# if any was returned, but it probably won't be of any use
This way allows you to extend to making PUT, DELETE, HEAD and OPTIONS requests too, simply by substituting the value of method or even wrapping it up in a function. Depending on what you're trying to do, you may also need a different HTTP handler, e.g. for multi file upload.
This may have been answered before: Python URLLib / URLLib2 POST.
Your server is likely performing a 302 redirect from http://myserver/post_service to http://myserver/post_service/. When the 302 redirect is performed, the request changes from POST to GET (see Issue 1401). Try changing url to http://myserver/post_service/.
Have a read of the urllib Missing Manual. Pulled from there is the following simple example of a POST request.
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe', 'age' : '10'})
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
print response.read()
As suggested by #Michael Kent do consider requests, it's great.
EDIT: This said, I do not know why passing data to urlopen() does not result in a POST request; It should. I suspect your server is redirecting, or misbehaving.
The requests module may ease your pain.
url = 'http://myserver/post_service'
data = dict(name='joe', age='10')
r = requests.post(url, data=data, allow_redirects=True)
print r.content
it should be sending a POST if you provide a data parameter (like you are doing):
from the docs:
"the HTTP request will be a POST instead of a GET when the data parameter is provided"
so.. add some debug output to see what's up from the client side.
you can modify your code to this and try again:
import urllib
import urllib2
url = 'http://myserver/post_service'
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
content = opener.open(url, data=data).read()
Try this instead:
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
req = urllib2.Request(url=url,data=data)
content = urllib2.urlopen(req).read()
print content
url="https://myserver/post_service"
data["name"] = "joe"
data["age"] = "20"
data_encoded = urllib2.urlencode(data)
print urllib2.urlopen(url + "?" + data_encoded).read()
May be this can help
I'm playing around trying to write a client for a site which provides data as an HTTP stream (aka HTTP server push). However, urllib2.urlopen() grabs the stream in its current state and then closes the connection. I tried skipping urllib2 and using httplib directly, but this seems to have the same behaviour.
The request is a POST request with a set of five parameters. There are no cookies or authentication required, however.
Is there a way to get the stream to stay open, so it can be checked each program loop for new contents, rather than waiting for the whole thing to be redownloaded every few seconds, introducing lag?
You could try the requests lib.
import requests
r = requests.get('http://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print line
You also could add parameters:
import requests
settings = { 'interval': '1000', 'count':'50' }
url = 'http://agent.mtconnect.org/sample'
r = requests.get(url, params=settings, stream=True)
for line in r.iter_lines():
if line:
print line
Do you need to actually parse the response headers, or are you mainly interested in the content? And is your HTTP request complex, making you set cookies and other headers, or will a very simple request suffice?
If you only care about the body of the HTTP response and don't have a very fancy request, you should consider simply using a socket connection:
import socket
SERVER_ADDR = ("example.com", 80)
sock = socket.create_connection(SERVER_ADDR)
f = sock.makefile("r+", bufsize=0)
f.write("GET / HTTP/1.0\r\n"
+ "Host: example.com\r\n" # you can put other headers here too
+ "\r\n")
# skip headers
while f.readline() != "\r\n":
pass
# keep reading forever
while True:
line = f.readline() # blocks until more data is available
if not line:
break # we ran out of data!
print line
sock.close()
One way to do it using urllib2 is (assuming this site also requires Basic Auth):
import urllib2
p_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
url = 'http://streamingsite.com'
p_mgr.add_password(None, url, 'login', 'password')
auth = urllib2.HTTPBasicAuthHandler(p_mgr)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)
f = opener.open('http://streamingsite.com')
while True:
data = f.readline()
I'm trying to send a SOAP request using SOAPpy as the client. I've found some documentation stating how to add a cookie by extending SOAPpy.HTTPTransport, but I can't seem to get it to work.
I tried to use the example here,
but the server I'm trying to send the request to started throwing 415 errors, so I'm trying to accomplish this without using ClientCookie, or by figuring out why the server is throwing 415's when I do use it. I suspect it might be because ClientCookie uses urllib2 & http/1.1, whereas SOAPpy uses urllib & http/1.0
Does someone know how to make ClientCookie use http/1.0, if that is even the problem, or a way to add a cookie to the SOAPpy headers without using ClientCookie? If tried this code using other services, it only seems to throw errors when sending requests to Microsoft servers.
I'm still finding my footing with python, so it could just be me doing something dumb.
import sys, os, string
from SOAPpy import WSDL,HTTPTransport,Config,SOAPAddress,Types
import ClientCookie
Config.cookieJar = ClientCookie.MozillaCookieJar()
class CookieTransport(HTTPTransport):
def call(self, addr, data, namespace, soapaction = None, encoding = None,
http_proxy = None, config = Config):
if not isinstance(addr, SOAPAddress):
addr = SOAPAddress(addr, config)
cookie_cutter = ClientCookie.HTTPCookieProcessor(config.cookieJar)
hh = ClientCookie.HTTPHandler()
hh.set_http_debuglevel(1)
# TODO proxy support
opener = ClientCookie.build_opener(cookie_cutter, hh)
t = 'text/xml';
if encoding != None:
t += '; charset="%s"' % encoding
opener.addheaders = [("Content-Type", t),
("Cookie", "Username=foobar"), # ClientCookie should handle
("SOAPAction" , "%s" % (soapaction))]
response = opener.open(addr.proto + "://" + addr.host + addr.path, data)
data = response.read()
# get the new namespace
if namespace is None:
new_ns = None
else:
new_ns = self.getNS(namespace, data)
print '\n' * 4 , '-'*50
# return response payload
return data, new_ns
url = 'http://www.authorstream.com/Services/Test.asmx?WSDL'
proxy = WSDL.Proxy(url, transport=CookieTransport)
print proxy.GetList()
Error 415 is because of incorrect content-type header.
Install httpfox for firefox or whatever tool (wireshark, Charles or Fiddler) to track what headers are you sending. Try Content-Type: application/xml.
...
t = 'application/xml';
if encoding != None:
t += '; charset="%s"' % encoding
...
If you trying to send file to the web server use Content-Type:application/x-www-form-urlencoded
A nice hack to use cookies with SOAPpy calls
Using Cookies with SOAPpy calls