Name resolving in http requests

Name resolving in http requests - python

I am trying to a simple make a http request to a server inside my company, from a dev server. I figured out that depending on the origin / destination server, I might, or not, to be forced to use qualified name of the destination server, like srvdestination.com.company.world instead of just srvdestination.
I am ok with this, but I don't understand how come my DB connection works?
Let's say I have srvorigin. Now, to make http request, I must use qualified name srvdestination.com.company.world. However, for database connection, the connection string with un-qualified name is enough psycopg.connect(host='srvdestination', ...) I understand that protocols are different, but how psycopg2 does to resolve the real name?

First it all depend on how the name resolution subsystem of your OS is configured. If you are on Unix (you did not specify), this is governed by /etc/resolv.conf. Here you can provide the OS with a search list: if a name has not "enough" dots (the number is configurable) then a suffix is added to retry resolution.
The library you use to do the HTTP request may not query the OS for name resolution and do its DNS resolution itself. In which case, it can only work with the information you give it (but it could as well re-use the OS /etc/resolv.conf and information in it), hence the need to use the full name.
On the contrary, the psycopg2 may use the OS resolution mechanism and hence dealing with "short" names just fine.
Both libraries should have documentation on how they handle hostnames... or otherwise you need to study their source code. I guess psycopg2 is a wrapper around the default libpq standard library, written in C if I am not mistaken, which hence certainly use the standard OS resolution process.
I can understand the curiosity around this difference but anyway my advice is to keep short names when you type commands on the shell and equivalent (and even there it could be a problem), but always use FQDNs (Fully Qualified Domain Names) in your program and configuration files. You will avoid a lot of problems.

Related

How to reliably check if a domain has been registered or is available?

Objective
I need a reliable way to check in Python if a domain of any TLD has been registered or is available. The bold phrases are the key points that I'm struggling with.
What I tried?
WHOIS is the obvious way to do the check and an existing Python library like the popular python-whois was my first try. The problem is that it doesn't seem to be able to retrieve information for some of the TLDs, e.g. .run, while it works mostly fine for older ones, e.g. .com.
So if python-whois is not reliable, maybe just a wrapper for the Linux's whois would be better. I tried whois library and unfortunately it supports only a limited set of TLDs, apparently to make sure it can always parse the results.
As I don't really need to parse the results, I ripped the code out of the whois library and tried to do the query by calling Linux's whois myself:
p = subprocess.Popen(['whois', 'example.com'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
r = p.communicate()[0]
print(r.decode())
That works much better. Except it's not that reliable either. I tried one particular domain and got "Your connection limit exceeded. Please slow down and try again later." Well, it's not me who is exceeding the limit. Being behind a single IP in a huge office means that somebody else might hit the limit before I make a query.
Another thought was not to use WHOIS and instead do a DNS lookup. However, I need to deal with domains that are registered or in the protected phase after expiry and don't have DNS records so this is apparently not possible.
Last idea was to do the queries via an API of some 3rd party service. The problem is trust in those services as they might snatch an available domain that I check.
Similar questions
There are already similar questions:
a stable way to check domain availability with pywhois
Testing domain-name availability with pythonwhois
...but they either deal only with a limited set of TLDs or are not that bothered by reliability.

If you do not have specific access (like being a registrar), and if you do not target a specific TLD (as some TLDs do have a specific public service called domain availability), the only tool that makes sense is to query whois servers.
You have then at least the following two problems:
use the appropriate whois server based on the given domain name
taking into account that whois servers are rate-limited so if you are bulk querying them without care you will first hit delays and then even risk your IP to be blacklisted, for some time.
For the second point the usual methods apply (handling delays on your side, using multiple endpoints, etc.)
For the first point, in another of my reply here: https://unix.stackexchange.com/a/407030/211833 you could find some explanations of what you observe depending on the wrapper around whois you use and some counter measures. See also my other reply here: https://webmasters.stackexchange.com/a/111639/75842 and specifically point
2.
Note that depending on your specific requirements and if you are able to at least change part of them, you may have other solutions. For example, for gTLDs, if you tolerate 24 hours delay, you may use the published zonefiles of registries to find domain names registered (those published so not all of them).
Also, why you are right in a generic sense that using a third party has its weaknesses, if you find a worthy registrar that both has access to many registries and that provides you with an API, you could then use it for your needs.
In short, I do not believe you can achieve this task with all cases (100% reliability, 100% TLDs, etc.). You will need some compromises but they depend on your initial needs.
Also very important: do not shell out to run a whois command, this will create many security and performance problems. Use the appropriate libraries from your programming language to do whois queries or just open a TCP socket on port 43 and send your queries on one line terminated by CR+LF, reading back a blob of text, this is basically only what is defined in RFC3912.

Access a page that require Safenet USB Token from urllib2 ot httplib

When I have a software certificate I do like this.
import httplib
CLIENT_CERT_FILE = '/path/to/certificate.pem'
connection = httplib.HTTPSConnection('url-to-open', cert_file=CLIENT_CERT_FILE)
connection.request('GET', '/')
response = connection.getresponse()
print response.status
data = response.read()
print data
How can I do the same with a Safenet USB Token ?

TL;DR there are significant caveats and security issues with doing this in Python.
A working "solution" involves using a PKCS#11 library to read the certificate from the key, then somehow persisting the certificate on the disk, and finally passing the resulting file path to the request object.
There will also be differences with each security stick's particularities. Some sticks do not offer to store a certificate along with its private key (aka a .pfx or .p12 file) which will essentially make this solution unworkable. I didn't have access to a Safenet stick, so used my own, please bear this in mind.
A solution for this requires quite a bit of work. Your use of a security dongle means that your client certificates are located onto the dongle itself. So, in order to achieve the same level of functionality, you need to write code to extract the certificate from there and feed it to your request object.
1. HTTPS-capable libraries in Python
Your requirement of using httplib (http.client for python 3.x) or urllib introduces a big caveat that the certificate used in the request has to be a file on the disk (and the same can be said of all libraries building in top of them, e.g. requests). See cnelson's answer to How to open ssl socket using certificate stored in string variables in python for the reason (in short: it's because python's ssl library makes use of a native C library which does not offer passing in-memory objects as the certificate). Also see the next answer from Dima Tisnek detailing possible workarounds with varying degrees of hackmanship.
If writing your certificate (even temporarily) on the disk is a non-starter for you, as it may very well be since you use a security stick, then it's not starting off looking good.
2. Getting the certificate from the security stick
Your biggest challenge is to get your hand on the certificate, which is currently nestled inside the security stick. Safenet sticks, like many others, are at the core a PKCS#11 capable SmartCard. I suggest you familiarise yourself with the concepts, but in essence, SmartCard is a standardised chip design, and PKCS#11 is a standardised protocol to interface with it. "Standardised" comes with caveats of course since many vendors come up with their own implementations, but it could probably be standardised enough for your purpose. The trick here will be to use available PKCS#11 interfaces on the stick to extract the certificate's attributes. This is what web browsers essentially do when using the stick to authenticate on websites using the stored certificate, so you need to have your python program do a similar thing.
2.1 Selecting a PKCS#11 library
Unfortunately, there are only a few libraries that come up when searching for "python pkcs11". I have no vested interest in either of them, and there may exist other less prominent ones.
python-pkcs11 (pypi, github, reference) offers a "high level, pythonic implementation of PKCS#11". It may be easier to use overall, but may lack compatibility and/or features depending on what you want to do, however I suspect simply retrieving certificates may be alright.
PyKCS11 (pypi, github, reference) on the other hand is a wrapper around a native PKCS#11 library, to which it will defer the calls. This one is lower-level, but looks more complete, plus may have the advantage to offer using your particular vendor's implementation if relevant.
2.2 Example code
For the example, I'll be using the user-friendlier API of python-pkcs11. Please bear in mind that this code is not thoroughly tested (and has been simplified in parts) and serves as illustrating the general idea.
import pkcs11
import asn1crypto.pem
import urllib.request
import tempfile
import ssl
import os
# this is OpenSC's implementation of PKCS#11
# other security sticks may come with another implementation.
# choose the most appropriate one
lib = pkcs11.lib('/usr/lib/pkcs11/opensc-pkcs11.so')
# tokens may be identified with various names, ids...
# it's probably rare that more than one at a time would be plugged in
token = lib.get_token(token_serial='<token_serial_value>')
pem = None
with token.open() as sess:
pkcs11_certificates = sess.get_objects(
{
pkcs11.Attribute.CLASS: pkcs11.ObjectClass.CERTIFICATE,
pkcs11.Attribute.LABEL: "Cardholder certificate"
})
# hopefully the selector above is sufficient
assert len(pkcs11_certificates) == 1
pkcs11_cert = pkcs11_certificates[0]
der_encoded_certificate = pkcs11_cert.__getitem__(pkcs11.Attribute.VALUE)
# the ssl library expects to be given PEM armored certificates
pem_armored_certificate = asn1crypto.pem.armor("CERTIFICATE",
der_encoded_certificate)
# this is the ugly part: persisting the certificate on disk
# i deliberately did not go with a sophisticated solution here since it's
# such a big caveat to have to do this...
certfile = tempfile.mkstemp()
with open(certfile[1], 'w') as certfile_handle:
certfile_handle.write(pem_armored_certificate.decode("utf-8"))
# this will instruct the ssl library to provide the certificate
# if asked by the server.
sslctx = ssl.create_default_context()
sslctx.load_cert_chain(certfile=certfile[1])
# if your certificate does not contain the private key, find it elsewhere
# sslctx.load_cert_chain(certfile=certfile[1],
# keyfile="/path/to/privatekey.pem",
# password="<private_key_password_if_applicable>")
response = urllib.request.urlopen("https://ssl_website", context=sslctx)
# Cleanup and delete the "temporary" certificate from disk
os.remove(certfile[1])
3. Conclusion
I'd say that Python is not going to be the best bet for doing ssl client authentication using security sticks. The fact that most ssl libraries require the certificate to be present on the disk works directly against the benefits (and sometimes, requirements) of the use of a security stick in the first place. I'm well aware that this answer does not provide a full solution to this problem, but hopefully exposes the challenges in enough detail to make an educated decision on whether to pursue this further or to find another way.
In any case, good luck.

Does twisted epollreactor use non-blocking dns lookup?

It seems obvious that it would use the twisted names api and not any blocking way to resolve host names.
However digging in the source code, I have been unable to find the place where the name resolution occurs. Could someone point me to the relevant source code where the host resolution occurs ( when trying to do a connectTCP, for example).
I really need to be sure that connectTCP wont use blocking DNS resolution.

It seems obvious, doesn't it?
Unfortunately:
Name resolution is not always configured in the obvious way. You think you just have to read /etc/resolv.conf? Even in the specific case of Linux and DNS, you might have to look in an arbitrary number of files looking for name servers.
Name resolution is much more complex than just DNS. You have to do mDNS resolution, possibly look up some LDAP computer records, and then you have to honor local configuration dictating the ordering between these such as /etc/nsswitch.conf.
Name resolution is not exposed via a standard or useful non-blocking API. Even the glibc-specific getaddrinfo_a exposes its non-blockingness via SIGIO, not just a file descriptor you can watch. Which means that, like POSIX AIO, it's probably just a kernel thread behind your back anyway.
For these reasons, among others, Twisted defaults to using a resolver that just calls gethostbyname in a thread.
However, if you know that for your application it is appropriate to have DNS-only hostname resolution, and you'd like to use twisted.names rather than your platform resolver - in other words, if scale matters more to you than esoteric name-resolution use-cases - that is supported. You can install a resolver from twisted.names.client onto the reactor, appropriately configured for your application and all future built-in name resolutions will be made with that resolver.

I'm not massively familiar with twisted, I only recently started used it. It looks like it doesn't block though, but only on platforms that support threading.
In twisted.internet.base in ReactorBase it looks like it does the resolving through it's resolve method which returns a deferred from self.resolver.getHostByName.
self.resolver is an instance of BlockingResolver by default which does block, but it looks like that if the platform supports threading the resolver instance is replaced by ThreadedResolver in the ReactorBase._initThreads method.

python: how to tell socket.gethostbyaddr() which dns server to use

is there any way to specify dns server should be used by socket.gethostbyaddr()?

Please correct me, if I'm wrong, but isn't this operating system's responsibility? gethostbyaddr is just a part of libc and according to man:
The gethostbyname(), gethostbyname2() and gethostbyaddr() functions each return a
pointer to an object with the following structure describing an internet host refer-
enced by name or by address, respectively. This structure contains either the infor-
mation obtained from the name server, named(8), or broken-out fields from a line in
/etc/hosts. If the local name server is not running these routines do a lookup in
/etc/hosts.
So I would say there's no way of simply telling Python (from the code's point of view) to use a particular DNS, since it's part of system's configuration.

Take a look at PyDNS.

Twisted FTPFileListProtocol and file names with spaces

I am using Python and the Twisted framework to connect to an FTP site to perform various automated tasks. Our FTP server happens to be Pure-FTPd, if that's relevant.
When connecting and calling the list method on an FTPClient, the resulting FTPFileListProtocol's files collection does not contain any directories or file names that contain a space (' ').
Has anyone else seen this? Is the only solution to create a sub-class of FTPFileListProtocol and override its unknownLine method, parsing the file/directory names manually?

Firstly, if you're performing automated tasks on a retrieived FTP listing then you should probably be looking at NLST rather than LIST as noted in RFC 959 section 4.1.3:
NAME LIST (NLST)
...
This command is intended to return information that
can be used by a program to further process the
files automatically.
The Twisted documentation for LIST says:
It can cope with most common file listing formats.
This make me suspicious; I do not like solutions that "cope". LIST was intended for human consumption not machine processing.
If your target server supports them then you should prefer MLST and MLSD as defined in RFC 3659 section 7:
7. Listings for Machine Processing (MLST and MLSD)
The MLST and MLSD commands are intended to standardize the file and
directory information returned by the server-FTP process. These
commands differ from the LIST command in that the format of the
replies is strictly defined although extensible.
However, these newer commands may not be available on your target server and I don't see them in Twisted. Therefore NLST is probably your best bet.
As to the nub of your problem, there are three likely causes:
The processing of the returned results is incorrect (Twisted may be at fault, as you suggest, or perhaps elsewhere)
The server is buggy and not sending a correct (complete) response
The wrong command is being sent (unlikely with straight NLST/LIST, but some servers react differently if arguments are supplied to these commands)
You can eliminate (2) and (3) and prove that the cause is (1) by looking at what is sent over the wire. If this option is not available to you as part of the Twisted API or the Pure-FTPD server logging configuration, then you may need to break out a network sniffer such as tcpdump, snoop or WireShark (assuming you're allowed to do this in your environment). Note that you will need to trace not only the control connection (port 21) but also the data connection (since that carries the results of the LIST/NLST command). WireShark is nice since it will perform the protocol-level analysis for you.
Good luck.

This is somehow expected. FTPFileListProtocol isn't able to understand every FTP output, because, well, some are wacky. As explained in the docstring:
If you need different evil for a wacky FTP server, you can
override either C{fileLinePattern} or C{parseDirectoryLine()}.
In this case, it may be a bug: maybe you can improve fileLinePattern and makes it understand filename with spaces. If so, you're welcome to open a bug in the Twisted tracker.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.