Does twisted epollreactor use non-blocking dns lookup? - python

It seems obvious that it would use the twisted names api and not any blocking way to resolve host names.
However digging in the source code, I have been unable to find the place where the name resolution occurs. Could someone point me to the relevant source code where the host resolution occurs ( when trying to do a connectTCP, for example).
I really need to be sure that connectTCP wont use blocking DNS resolution.

It seems obvious, doesn't it?
Unfortunately:
Name resolution is not always configured in the obvious way. You think you just have to read /etc/resolv.conf? Even in the specific case of Linux and DNS, you might have to look in an arbitrary number of files looking for name servers.
Name resolution is much more complex than just DNS. You have to do mDNS resolution, possibly look up some LDAP computer records, and then you have to honor local configuration dictating the ordering between these such as /etc/nsswitch.conf.
Name resolution is not exposed via a standard or useful non-blocking API. Even the glibc-specific getaddrinfo_a exposes its non-blockingness via SIGIO, not just a file descriptor you can watch. Which means that, like POSIX AIO, it's probably just a kernel thread behind your back anyway.
For these reasons, among others, Twisted defaults to using a resolver that just calls gethostbyname in a thread.
However, if you know that for your application it is appropriate to have DNS-only hostname resolution, and you'd like to use twisted.names rather than your platform resolver - in other words, if scale matters more to you than esoteric name-resolution use-cases - that is supported. You can install a resolver from twisted.names.client onto the reactor, appropriately configured for your application and all future built-in name resolutions will be made with that resolver.

I'm not massively familiar with twisted, I only recently started used it. It looks like it doesn't block though, but only on platforms that support threading.
In twisted.internet.base in ReactorBase it looks like it does the resolving through it's resolve method which returns a deferred from self.resolver.getHostByName.
self.resolver is an instance of BlockingResolver by default which does block, but it looks like that if the platform supports threading the resolver instance is replaced by ThreadedResolver in the ReactorBase._initThreads method.

Related

Name resolving in http requests

I am trying to a simple make a http request to a server inside my company, from a dev server. I figured out that depending on the origin / destination server, I might, or not, to be forced to use qualified name of the destination server, like srvdestination.com.company.world instead of just srvdestination.
I am ok with this, but I don't understand how come my DB connection works?
Let's say I have srvorigin. Now, to make http request, I must use qualified name srvdestination.com.company.world. However, for database connection, the connection string with un-qualified name is enough psycopg.connect(host='srvdestination', ...) I understand that protocols are different, but how psycopg2 does to resolve the real name?
First it all depend on how the name resolution subsystem of your OS is configured. If you are on Unix (you did not specify), this is governed by /etc/resolv.conf. Here you can provide the OS with a search list: if a name has not "enough" dots (the number is configurable) then a suffix is added to retry resolution.
The library you use to do the HTTP request may not query the OS for name resolution and do its DNS resolution itself. In which case, it can only work with the information you give it (but it could as well re-use the OS /etc/resolv.conf and information in it), hence the need to use the full name.
On the contrary, the psycopg2 may use the OS resolution mechanism and hence dealing with "short" names just fine.
Both libraries should have documentation on how they handle hostnames... or otherwise you need to study their source code. I guess psycopg2 is a wrapper around the default libpq standard library, written in C if I am not mistaken, which hence certainly use the standard OS resolution process.
I can understand the curiosity around this difference but anyway my advice is to keep short names when you type commands on the shell and equivalent (and even there it could be a problem), but always use FQDNs (Fully Qualified Domain Names) in your program and configuration files. You will avoid a lot of problems.

How to reliably check if a domain has been registered or is available?

Objective
I need a reliable way to check in Python if a domain of any TLD has been registered or is available. The bold phrases are the key points that I'm struggling with.
What I tried?
WHOIS is the obvious way to do the check and an existing Python library like the popular python-whois was my first try. The problem is that it doesn't seem to be able to retrieve information for some of the TLDs, e.g. .run, while it works mostly fine for older ones, e.g. .com.
So if python-whois is not reliable, maybe just a wrapper for the Linux's whois would be better. I tried whois library and unfortunately it supports only a limited set of TLDs, apparently to make sure it can always parse the results.
As I don't really need to parse the results, I ripped the code out of the whois library and tried to do the query by calling Linux's whois myself:
p = subprocess.Popen(['whois', 'example.com'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
r = p.communicate()[0]
print(r.decode())
That works much better. Except it's not that reliable either. I tried one particular domain and got "Your connection limit exceeded. Please slow down and try again later." Well, it's not me who is exceeding the limit. Being behind a single IP in a huge office means that somebody else might hit the limit before I make a query.
Another thought was not to use WHOIS and instead do a DNS lookup. However, I need to deal with domains that are registered or in the protected phase after expiry and don't have DNS records so this is apparently not possible.
Last idea was to do the queries via an API of some 3rd party service. The problem is trust in those services as they might snatch an available domain that I check.
Similar questions
There are already similar questions:
a stable way to check domain availability with pywhois
Testing domain-name availability with pythonwhois
...but they either deal only with a limited set of TLDs or are not that bothered by reliability.
If you do not have specific access (like being a registrar), and if you do not target a specific TLD (as some TLDs do have a specific public service called domain availability), the only tool that makes sense is to query whois servers.
You have then at least the following two problems:
use the appropriate whois server based on the given domain name
taking into account that whois servers are rate-limited so if you are bulk querying them without care you will first hit delays and then even risk your IP to be blacklisted, for some time.
For the second point the usual methods apply (handling delays on your side, using multiple endpoints, etc.)
For the first point, in another of my reply here: https://unix.stackexchange.com/a/407030/211833 you could find some explanations of what you observe depending on the wrapper around whois you use and some counter measures. See also my other reply here: https://webmasters.stackexchange.com/a/111639/75842 and specifically point
2.
Note that depending on your specific requirements and if you are able to at least change part of them, you may have other solutions. For example, for gTLDs, if you tolerate 24 hours delay, you may use the published zonefiles of registries to find domain names registered (those published so not all of them).
Also, why you are right in a generic sense that using a third party has its weaknesses, if you find a worthy registrar that both has access to many registries and that provides you with an API, you could then use it for your needs.
In short, I do not believe you can achieve this task with all cases (100% reliability, 100% TLDs, etc.). You will need some compromises but they depend on your initial needs.
Also very important: do not shell out to run a whois command, this will create many security and performance problems. Use the appropriate libraries from your programming language to do whois queries or just open a TCP socket on port 43 and send your queries on one line terminated by CR+LF, reading back a blob of text, this is basically only what is defined in RFC3912.

Django: Concurrent access to settings.py

I am not sure whether I have to care about concurrency, but I didn't find any documentation about it.
I have some data stored at my settings.py like ip addresses and each user can take one or give one back. So I have read and write operations and I want that only one user read the file at the same moment.
How could I handle this?
And yes, I want to store the data at the settings.py. I found also the module django-concurrency. But I couldn't find anything at the documentation.
as e4c5 mentioned, conventionally settings.py is pretty light on logic. The loading mechanism for settings is pretty obscure and, I personally, like to stay away from things that are difficult to understand and interact with :)
You absolutely have to care about concurrency. How are you running your application? It's tricky because in the dev env you have a simple server and usually handle only a handful of requests at the same time (and a couple years ago the dev server was single threaded)
If you're running your application using a forking server, how will you share data between processes? one process won't even see the other processes settings.py changes. I'm not even sure of how it would look like with a threading server, but it would probably at least require a source code audit of your web server to understand the specifics of how requests are handled and how memory is shared.
Using a DB is by far the easiest solution, (you should be able to use an in memory db as an option too memcache/redis/etc). DB's provide concurrency support out the box and will be a lot more easier to reason about and provides primitives for concurrent accessing of data. And in the case of redis, which is single threaded you won't even have to worry about concurrent accesses to your shared IP addresses
And yes, I want to store the data at the settings.py.
No you definitely don't want to do that. the settings.py file is configuring django and any pluggable apps that you may use with it. it's not intended to be used as a place for dumping data. Data goes into a database.
And don't forget that the settings.py file is usually read only once.

How to get more search results than the server's sizelimit with Python LDAP?

I am using the python-ldap module to (amongst other things) search for groups, and am running into the server's size limit and getting a SIZELIMIT_EXCEEDED exception. I have tried both synchronous and asynchronous searches and hit the problem both ways.
You are supposed to be able to work round this by setting a paging control on the search, but according to the python-ldap docs these controls are not yet implemented for search_ext(). Is there a way to do this in Python? If the python-ldap library does not support it, is there another Python library that does?
Here are some links related to paging in python-ldap.
Documentation: http://www.python-ldap.org/doc/html/ldap-controls.html#ldap.controls.SimplePagedResultsControl
Example code using paging: http://www.novell.com/coolsolutions/tip/18274.html
More example code: http://google-apps-for-your-domain-ldap-sync.googlecode.com/svn/trunk/ldap_ctxt.py
After some discussion on the python-ldap-dev mailing list, I can answer my own question.
Page controls ARE supported by the Python lDAP module, but the docs had not been updated for search_ext to show that. The example linked by Gorgapor shows how to use the ldap.controls.SimplePagedResultsControl to read the results in pages.
However there is a gotcha. This will work with Microsoft Active Directory servers, but not with OpenLDAP servers (and possibly others, such as Sun's). The LDAP controls RFC is ambiguous as to whether paged controls should be allowed to override the server's sizelimit setting. On ActiveDirectory servers they can by default while on OpenLDAP they cannot, but I think there is a server setting that will allow them to.
So even if you implement the paged control, there is still no guarantee that it will get all the objects that you want. Sigh
Also paged controls are only available with LDAP v3, but I doubt that there are many v2 servers in use.

Twisted FTPFileListProtocol and file names with spaces

I am using Python and the Twisted framework to connect to an FTP site to perform various automated tasks. Our FTP server happens to be Pure-FTPd, if that's relevant.
When connecting and calling the list method on an FTPClient, the resulting FTPFileListProtocol's files collection does not contain any directories or file names that contain a space (' ').
Has anyone else seen this? Is the only solution to create a sub-class of FTPFileListProtocol and override its unknownLine method, parsing the file/directory names manually?
Firstly, if you're performing automated tasks on a retrieived FTP listing then you should probably be looking at NLST rather than LIST as noted in RFC 959 section 4.1.3:
NAME LIST (NLST)
...
This command is intended to return information that
can be used by a program to further process the
files automatically.
The Twisted documentation for LIST says:
It can cope with most common file listing formats.
This make me suspicious; I do not like solutions that "cope". LIST was intended for human consumption not machine processing.
If your target server supports them then you should prefer MLST and MLSD as defined in RFC 3659 section 7:
7. Listings for Machine Processing (MLST and MLSD)
The MLST and MLSD commands are intended to standardize the file and
directory information returned by the server-FTP process. These
commands differ from the LIST command in that the format of the
replies is strictly defined although extensible.
However, these newer commands may not be available on your target server and I don't see them in Twisted. Therefore NLST is probably your best bet.
As to the nub of your problem, there are three likely causes:
The processing of the returned results is incorrect (Twisted may be at fault, as you suggest, or perhaps elsewhere)
The server is buggy and not sending a correct (complete) response
The wrong command is being sent (unlikely with straight NLST/LIST, but some servers react differently if arguments are supplied to these commands)
You can eliminate (2) and (3) and prove that the cause is (1) by looking at what is sent over the wire. If this option is not available to you as part of the Twisted API or the Pure-FTPD server logging configuration, then you may need to break out a network sniffer such as tcpdump, snoop or WireShark (assuming you're allowed to do this in your environment). Note that you will need to trace not only the control connection (port 21) but also the data connection (since that carries the results of the LIST/NLST command). WireShark is nice since it will perform the protocol-level analysis for you.
Good luck.
This is somehow expected. FTPFileListProtocol isn't able to understand every FTP output, because, well, some are wacky. As explained in the docstring:
If you need different evil for a wacky FTP server, you can
override either C{fileLinePattern} or C{parseDirectoryLine()}.
In this case, it may be a bug: maybe you can improve fileLinePattern and makes it understand filename with spaces. If so, you're welcome to open a bug in the Twisted tracker.

Categories