I am using Python and the Twisted framework to connect to an FTP site to perform various automated tasks. Our FTP server happens to be Pure-FTPd, if that's relevant.
When connecting and calling the list method on an FTPClient, the resulting FTPFileListProtocol's files collection does not contain any directories or file names that contain a space (' ').
Has anyone else seen this? Is the only solution to create a sub-class of FTPFileListProtocol and override its unknownLine method, parsing the file/directory names manually?
Firstly, if you're performing automated tasks on a retrieived FTP listing then you should probably be looking at NLST rather than LIST as noted in RFC 959 section 4.1.3:
NAME LIST (NLST)
...
This command is intended to return information that
can be used by a program to further process the
files automatically.
The Twisted documentation for LIST says:
It can cope with most common file listing formats.
This make me suspicious; I do not like solutions that "cope". LIST was intended for human consumption not machine processing.
If your target server supports them then you should prefer MLST and MLSD as defined in RFC 3659 section 7:
7. Listings for Machine Processing (MLST and MLSD)
The MLST and MLSD commands are intended to standardize the file and
directory information returned by the server-FTP process. These
commands differ from the LIST command in that the format of the
replies is strictly defined although extensible.
However, these newer commands may not be available on your target server and I don't see them in Twisted. Therefore NLST is probably your best bet.
As to the nub of your problem, there are three likely causes:
The processing of the returned results is incorrect (Twisted may be at fault, as you suggest, or perhaps elsewhere)
The server is buggy and not sending a correct (complete) response
The wrong command is being sent (unlikely with straight NLST/LIST, but some servers react differently if arguments are supplied to these commands)
You can eliminate (2) and (3) and prove that the cause is (1) by looking at what is sent over the wire. If this option is not available to you as part of the Twisted API or the Pure-FTPD server logging configuration, then you may need to break out a network sniffer such as tcpdump, snoop or WireShark (assuming you're allowed to do this in your environment). Note that you will need to trace not only the control connection (port 21) but also the data connection (since that carries the results of the LIST/NLST command). WireShark is nice since it will perform the protocol-level analysis for you.
Good luck.
This is somehow expected. FTPFileListProtocol isn't able to understand every FTP output, because, well, some are wacky. As explained in the docstring:
If you need different evil for a wacky FTP server, you can
override either C{fileLinePattern} or C{parseDirectoryLine()}.
In this case, it may be a bug: maybe you can improve fileLinePattern and makes it understand filename with spaces. If so, you're welcome to open a bug in the Twisted tracker.
Related
I've setup a running Pymodbus server based on the 'Updating Server' (v2.5.3) example.
https://pymodbus.readthedocs.io/en/v2.5.3/source/example/updating_server.html
Everything works ok.
Now i want to trigger a function (which will simply increment a value by 1), when (any) client is requesting to read/poll the contents of the holdingregisters.
Console output, when client requests function code 3
My knowledge of Pymodbus is limited, so any help would be great.
There is likely a way to overload the StartTCPServer function, though looking through the source code it seems tricky since it's built on asyncio (not like that's bad, just less conducive to easily overloading this one thing you're trying to do).
However there are several modbus libraries you can use, and sourceperl's pyModbusTCP makes it very easy to do exactly what you're trying to do. Check out the example https://github.com/sourceperl/pyModbusTCP/blob/master/examples/server_change_log.py
There they use the built-in hooks for doing just that, but you can get easily overload any part of the server if you wanted to for example capture the incoming modbus message before it gets processed.
Objective
I need a reliable way to check in Python if a domain of any TLD has been registered or is available. The bold phrases are the key points that I'm struggling with.
What I tried?
WHOIS is the obvious way to do the check and an existing Python library like the popular python-whois was my first try. The problem is that it doesn't seem to be able to retrieve information for some of the TLDs, e.g. .run, while it works mostly fine for older ones, e.g. .com.
So if python-whois is not reliable, maybe just a wrapper for the Linux's whois would be better. I tried whois library and unfortunately it supports only a limited set of TLDs, apparently to make sure it can always parse the results.
As I don't really need to parse the results, I ripped the code out of the whois library and tried to do the query by calling Linux's whois myself:
p = subprocess.Popen(['whois', 'example.com'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
r = p.communicate()[0]
print(r.decode())
That works much better. Except it's not that reliable either. I tried one particular domain and got "Your connection limit exceeded. Please slow down and try again later." Well, it's not me who is exceeding the limit. Being behind a single IP in a huge office means that somebody else might hit the limit before I make a query.
Another thought was not to use WHOIS and instead do a DNS lookup. However, I need to deal with domains that are registered or in the protected phase after expiry and don't have DNS records so this is apparently not possible.
Last idea was to do the queries via an API of some 3rd party service. The problem is trust in those services as they might snatch an available domain that I check.
Similar questions
There are already similar questions:
a stable way to check domain availability with pywhois
Testing domain-name availability with pythonwhois
...but they either deal only with a limited set of TLDs or are not that bothered by reliability.
If you do not have specific access (like being a registrar), and if you do not target a specific TLD (as some TLDs do have a specific public service called domain availability), the only tool that makes sense is to query whois servers.
You have then at least the following two problems:
use the appropriate whois server based on the given domain name
taking into account that whois servers are rate-limited so if you are bulk querying them without care you will first hit delays and then even risk your IP to be blacklisted, for some time.
For the second point the usual methods apply (handling delays on your side, using multiple endpoints, etc.)
For the first point, in another of my reply here: https://unix.stackexchange.com/a/407030/211833 you could find some explanations of what you observe depending on the wrapper around whois you use and some counter measures. See also my other reply here: https://webmasters.stackexchange.com/a/111639/75842 and specifically point
2.
Note that depending on your specific requirements and if you are able to at least change part of them, you may have other solutions. For example, for gTLDs, if you tolerate 24 hours delay, you may use the published zonefiles of registries to find domain names registered (those published so not all of them).
Also, why you are right in a generic sense that using a third party has its weaknesses, if you find a worthy registrar that both has access to many registries and that provides you with an API, you could then use it for your needs.
In short, I do not believe you can achieve this task with all cases (100% reliability, 100% TLDs, etc.). You will need some compromises but they depend on your initial needs.
Also very important: do not shell out to run a whois command, this will create many security and performance problems. Use the appropriate libraries from your programming language to do whois queries or just open a TCP socket on port 43 and send your queries on one line terminated by CR+LF, reading back a blob of text, this is basically only what is defined in RFC3912.
Later note: the issues in the original posting below have been largely resolved.
Here's the background: For an introductory comp sci course, students develop html and server-side Python 2.7 scripts using a server provided by the instructors. That server is based on CGIHTTPRequestHandler, like the one at pointlessprogramming. When the students' html and scripts seem correct, they port those files to a remote, slow Apache server. Why support two servers? Well, the initial development using a local server has the benefit of reducing network issues and dependency on the remote, weak machine that is running Apache. Eventually porting to the Apache-running machine has the benefit of publishing their results for others to see.
For the local development to be most useful, the local server should closely resemble the Apache server. Currently there is an important difference: Apache requires that a script start its response with headers that include a content-type; if the script fails to provide such a header, Apache sends the client a 500 error ("Internal Server Error"), which too generic to help the students, who cannot use the server logs. CGIHTTPRequestHandler imposes no similar requirement. So it is common for a student to write header-free scripts that work with the local server, but get the baffling 500 error after copying files to the Apache server. It would be helpful to have a version of the local server that checks for a content-type header and gives a good error if there is none.
I seek advice about creating such a server. I am new to Python and to writing servers. Here are the issues that occur to me, but any helpful advice would be appreciated.
Is a content-type header required by the CGI standard? If so, other people might benefit from an answer to the main question here. Also, if so, I despair of finding a way to disable Apache's requirement. Maybe the relevant part of the CGI RFC is section 6.3.1 (CGI Response, Content-Type): "If an entity body is returned, the script MUST supply a Content-Type field in the response."
To make a local server that checks for the content-type header, perhaps I should sub-class CGIHTTPServer.CGIHTTPRequestHandler, to override run_cgi() with a version that issues an error for a missing header. I am looking at CGIHTTPServer.py __version__ = "0.4", which was installed with Python 2.7.3. But run_cgi() does a lot of processing, so it is a little unappealing to copy all its code, just to add a couple calls to a header-checking routine. Is there a better way?
If the answer to (2) is something like "No, overriding run_cgi() is recommended," I anticipate writing a version that invokes the desired script, then checks the script's output for headers before that output is sent to the client. There are apparently two places in the existing run_cgi() where the script is invoked:
3a. When run_cgi() is executed on a non-Unix system, the script is executed using Python's subprocess module. As a result, the standard output from the script will be available as an in-memory string, which I can presumably check for headers before the call to self.wfile.write. Does this sound right?
3b. But when run_cgi() is executed on a *nix system, the script is executed by a forked process. I think the child's stdout will write directly to self.wfile (I'm a little hazy on this), so I see no opportunity for the code in run_cgi() to check the output. Ugh. Any suggestions?
If analyzing the script's output is recommended, is email.parser the standard way to recognize whether there is a content-type header? Is another standard module recommended instead?
Is there a more appropriate forum for asking the main question ("How can a CGI server based on CGIHTTPRequestHandler require...")? It seems odd to ask if there is a better forum for asking programming questions than Stack Overflow, but I guess anything is possible.
Thanks for any help.
It seems obvious that it would use the twisted names api and not any blocking way to resolve host names.
However digging in the source code, I have been unable to find the place where the name resolution occurs. Could someone point me to the relevant source code where the host resolution occurs ( when trying to do a connectTCP, for example).
I really need to be sure that connectTCP wont use blocking DNS resolution.
It seems obvious, doesn't it?
Unfortunately:
Name resolution is not always configured in the obvious way. You think you just have to read /etc/resolv.conf? Even in the specific case of Linux and DNS, you might have to look in an arbitrary number of files looking for name servers.
Name resolution is much more complex than just DNS. You have to do mDNS resolution, possibly look up some LDAP computer records, and then you have to honor local configuration dictating the ordering between these such as /etc/nsswitch.conf.
Name resolution is not exposed via a standard or useful non-blocking API. Even the glibc-specific getaddrinfo_a exposes its non-blockingness via SIGIO, not just a file descriptor you can watch. Which means that, like POSIX AIO, it's probably just a kernel thread behind your back anyway.
For these reasons, among others, Twisted defaults to using a resolver that just calls gethostbyname in a thread.
However, if you know that for your application it is appropriate to have DNS-only hostname resolution, and you'd like to use twisted.names rather than your platform resolver - in other words, if scale matters more to you than esoteric name-resolution use-cases - that is supported. You can install a resolver from twisted.names.client onto the reactor, appropriately configured for your application and all future built-in name resolutions will be made with that resolver.
I'm not massively familiar with twisted, I only recently started used it. It looks like it doesn't block though, but only on platforms that support threading.
In twisted.internet.base in ReactorBase it looks like it does the resolving through it's resolve method which returns a deferred from self.resolver.getHostByName.
self.resolver is an instance of BlockingResolver by default which does block, but it looks like that if the platform supports threading the resolver instance is replaced by ThreadedResolver in the ReactorBase._initThreads method.
So, I want to develop a proxy server that when contacted checks the size of what it will be downloading to proxy (using head most likely) and if it's over a set size it splits the download of the request via pipelining and using Range into generally good sized (1 megabyte or possibly using a config file) segments. Then as it downloads it and rotates the pipes I want it to feed back to it's client what it gets (in order), so that if it is say a stream of media it will be able to play it easily. The goal is to split too large ones into pipelines and the smaller ones to leave them alone. I am sorta unsure where to start. I did find other proxy servers (polipo) that could do pipelining/multiplexing as mentioned but none worked as outlined above. So A. does anything exist that does it and B. how would i get started? (I would prefer to work in python if possible)
I would look at twisted http://twistedmatrix.com/trac/ it is an great event based networking library for python. It takes a bit of getting used to, but it does this kind of thing very well.