python ftp retrieve lines -- performance issues - python

I am trying to retrieve lines from a file through a FTP connection using the ftplib module of python. It takes about 10 mins to read a file of size 1 GB. I was wondering if there are any other ways to read the lines in a faster manner.
I should have included some code to show what I am doing:
ftp.HostName = 'xxx'
ftp.Userid = 'xxx' #so on
ftp.conn.retrlines('RETR ' + fileName, process)

Retrieving remote resources is usually bound by your bandwidth, and FTP protocol does a decent job of using it all.
Are you sure you aren't saturating your network connection? (what is the network link between client running ftplib and server you are downloading from?)
Back of the envelope calc:
1GB/10mins =~ 1.7 MB/sec =~ 13 Mbps
So you are downloading at 13 megabit. That's decent speed for a remote DSL/Cable/WAN connection, but obviously pretty low if this is all a local network.

Can you show some minimal code sample of what you are doing? FTP is for transporting files,
so retrieving lines from a remote file isn't necessarily as efficient as transferring the file in whole once and reading it locally.
Aside from that, have you verified, that you can be faster on this connection?
EDIT: if you try the following and it is not a bit faster, then you are limited by your OS or your connection:
ftp.conn.retrbinary('RETR ' + fileName, open(temp_file_name, 'wb').write)
The assumption here is, that FTP text mode might be somewhat less efficient (on the server side), which might be false or of miniscule relevance.

Related

Python FTP and Streams

I need to create a "turning table" platform. My server must be able to take a file from FTP A and send it to FTP B. I did a lot of file transfer systems, so I have no problem with ftplib, aspera, s3 and other transfer protocols.
The thing is that I have big files (150G) on FTP A. And many transfers will occur at the same time, from and to many FTP servers or other.
I don't want my platform to actually store these files in order to send them to another location. I don't want to load everything in memory either... I need to "stream" binary data from A to B, with minimal charge on my transfer platform.
I am looking at https://docs.python.org/2/library/io.html with ReadBuffer and WriteBuffer, but I can't find examples and the documentation is sorta cryptic for me...
Anyone has a starting point?
buff = io.open('/var/tmp/test', 'wb')
def loadbuff(data):
buff.write(data)
self.ftp.retrbinary('RETR ' + name, loadbuff, blocksize=8)
So my data is coming in buff, which is a <_io.BufferedWriter name='/var/tmp/test'> object, but how can I start reading from it while ftplib keeps downloading?
Hope I'm clear enough, any idea is welcomed.
Thanks

Get large files from FTP with python lib

I need to download some large files (>30GB per file) from a FTP server. I'm using ftplib from the python standardlib but there are some pitfalls: If i download a large file, i can not use the connection anymore if the file finishes. I get an EOF Error afterwards, so the connection is closed (due to timeout?) and for each succeeding file i will get an error 421.
From what i read, there are two connections. The data and control channel, where the data channel seems to work correctly (i can download the file completly) but the control channels times out in the meantime.
I also read that the ftplib (and other python ftp libraries) are not suited for large files and may only support files up to around 1GB.
There is a similar question to this topic here: How to download big file in python via ftp (with monitoring & reconnect)? which is not quite the same because my files are huge in comparison.
My current code looks like this:
import ftplib
import tempfile
ftp = ftplib.FTP_TLS()
ftp.connect(host=server, port=port)
ftp.login(user=user, passwd=password)
ftp.prot_p()
ftp.cwd(folder)
for file in ftp.nlst():
fd, local_filename = tempfile.mkstemp()
f = open(fd, "wb")
ftp.retrbinary('RETR %s' % file, callback=f.write, blocksize=8192)
f.close()
Is there any tweak to it or another library that i can use, which does support huge files?
If you experience issues with standard FTP, you can try using a different protocol that is specifically designed to handle such large files.
A number of suitable solutions exist. Rsync would probably be a good way to start.

python ftpclient limit connections

I have a bit of a problem with the ftplib from python. It seems that it uses, per default, two connections (one for sending commands, one for datatransfer?). However my ftpserver only accepts one connection at any given time. Since the only file that needs to be transfered is only about 1 MB large, the reasoning of being able to abort inflight commands does not apply here.
Previously the same job was done by the windows commandline ftp client. So I could just call this client from python, but I would really prefer a complete python solution.
Is there a way to tell ftplib, that it should limit itself to a single connection? In filezilla I'm able to "limit the maximum number of simultanious connections", ideally I would like to reproduce this functionality.
Thanks for your help.
It seems that it uses, per default, two connections (one for sending commands, one for datatransfer?).
That's how ftp works. You have a control connection (usually port 21) for commands and a data connection for data transfer, file listing etc and a dynamic port.
However my ftpserver only accepts one connection at any given time.
ftpserver might have a limit for multiple control connections, but it must still accept data connections. Could you please show from tcpdump, wireshark, logfiles etc why you think multiple connections are the problem?
In filezilla I'm able to "limit the maximum number of simultanious connections"
This is for the number of control connections only. Does it work with filezilla? Because I doubt that ftplib opens multiple control connections.

pysftp and paramiko stop uploading files after a few seconds

I am using python 3.4 and pysftp , (pysftp suspected to be working on 3.4)
Pysftp is a wrapper over paramiko.
I have no problem downloading files.
I can also upload small files.
When i am uploading files that take longer than a few seconds to complete however i get an error.
I monitored my interent connection, after about 3 seconds there is no more uploading taking place.
after ~5 minutes i recieve an EOFError
I also experimented with the paramiko module with the same results.
I can upload files using open ssh as well as filezilla without problem.
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
srv.put(file_name)
I would like to be able to upload files greater than a few kb... what am i missing?
It seems that paramiko is not adjusting window during file uploading. You can increase window_size manualy:
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
channel = srv.sftp_client.get_channel()
channel.lock.acquire()
channel.out_window_size += os.stat(file_name).st_size
channel.out_buffer_cv.notifyAll()
channel.lock.release()
srv.put(file_name)
It works for me, but sometimes it is not enough for large files so I add some extra bytes. I think, that some packets may be lost and it depends on the connection.

python s3 boto connection.close causes an error

I have code that writes files to s3. The code was working fine
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(BUCKET, validate=False)
k = Key(bucket)
k.key = self.filekey
k.set_metadata('Content-Type', 'text/javascript')
k.set_contents_from_string(json.dumps(self.output))
k.set_acl(FILE_ACL)
This was working just fine. Then I noticed I wasn't closing my connection so I added this line at the end:
conn.close()
Now, the file writes as before but, I'm seeing this error in my logs now
S3Connection instance has no attribute '_cache', unable to write file
Anyone see what I'm doing wrong here or know what's causing this? I noticed that none of the tutorials on boto show people closing connections but I know you should close your connections for IO operations as a general rule...
EDIT
A note about this, when I comment out conn.close() the error disappears
I can't find that error message in the latest boto source code, so unfortunately I can't tell you what caused it. Recently, we had problems when we were NOT calling conn.close(), so there definitely is at least one case where you must close the connection. Here's my understanding of what's going on:
S3Connection (well, its parent class) handles almost all connectivity details transparently, and you shouldn't have to think about closing resource, reconnecting, etc.. This is why most tutorials and docs don't mention closing resources. In fact, I only know of one situation where you should close resources explicitly, which I describe at the bottom. Read on!
Under the covers, boto uses httplib. This client library supports HTTP 1.1 Keep-Alive, so it can and should keep the socket open so that it can perform multiple requests over the same connection.
AWS will close your connection (socket) for two reasons:
According to the boto source code, "AWS starts timing things out after three minutes." Presumably "things" means "idle connections."
According to Best Practices for Using Amazon S3, "S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset')."
Fortunately, boto works around the first case by recycling stale connections well before three minutes are up. Unfortunately, boto doesn't handle the second case quite so transparently:
When AWS closes a connection, your end of the connection goes into CLOSE_WAIT, which means that the socket is waiting for the application to execute close(). S3Connection handles connectivity details so transparently that you cannot actually do this directly! It's best to prevent it from happening in the first place.
So, circling back to the original question of when you need to close explicitly, if your application runs for a long time, keeps a reference to (reuses) a boto connection for a long time, and makes many boto S3 requests over that connection (thus triggering a "connection reset" on the socket by AWS), then you may find that more and more sockets are in CLOSE_WAIT. You can check for this condition on linux by calling netstat | grep CLOSE_WAIT. To prevent this, make an explicit call to boto's connection.close before you've made 100 requests. We make hundreds of thousands of S3 requests in a long running process, and we call connection.close after every, say, 80 requests.

Categories