Get large files from FTP with python lib - python

I need to download some large files (>30GB per file) from a FTP server. I'm using ftplib from the python standardlib but there are some pitfalls: If i download a large file, i can not use the connection anymore if the file finishes. I get an EOF Error afterwards, so the connection is closed (due to timeout?) and for each succeeding file i will get an error 421.
From what i read, there are two connections. The data and control channel, where the data channel seems to work correctly (i can download the file completly) but the control channels times out in the meantime.
I also read that the ftplib (and other python ftp libraries) are not suited for large files and may only support files up to around 1GB.
There is a similar question to this topic here: How to download big file in python via ftp (with monitoring & reconnect)? which is not quite the same because my files are huge in comparison.
My current code looks like this:
import ftplib
import tempfile
ftp = ftplib.FTP_TLS()
ftp.connect(host=server, port=port)
ftp.login(user=user, passwd=password)
ftp.prot_p()
ftp.cwd(folder)
for file in ftp.nlst():
fd, local_filename = tempfile.mkstemp()
f = open(fd, "wb")
ftp.retrbinary('RETR %s' % file, callback=f.write, blocksize=8192)
f.close()
Is there any tweak to it or another library that i can use, which does support huge files?

If you experience issues with standard FTP, you can try using a different protocol that is specifically designed to handle such large files.
A number of suitable solutions exist. Rsync would probably be a good way to start.

Related

How to take data from SFTP server without downloading files in Python?

I have an SFTP server. I can take data by transferring/downloading files. Is there a way that I can do without downloading files?
My code is as below:
# Connection to the SFTP server
with pysftp.Connection(hostname, username, passowrd, port) as sftp:
with sftp.cd('directory'):
sftp.get('filename.txt')
This code downloads file to my local machine.
Yes and no. You can use the data from the remote (SFTP) server without storing the files to a local disk.
But you cannot use data locally without downloading them. That's impossible. You have to transfer the data to use them – at least to a memory of the local machine.
See A way to load big data on Python from SFTP server, not using my hard disk.
My answer there talks about Paramiko. But pysftp is a just a thin wrapper around Paramiko. Its Connection.open is directly mapped to underlying Paramiko's SFTPClient.open. So you can keep using pysftp:
with sftp.open('filename.txt', bufsize=32768) as f:
# use f as if you have opened a local file with open()
Though I'd recommend you not to: pysftp vs. Paramiko.

Python FTP and Streams

I need to create a "turning table" platform. My server must be able to take a file from FTP A and send it to FTP B. I did a lot of file transfer systems, so I have no problem with ftplib, aspera, s3 and other transfer protocols.
The thing is that I have big files (150G) on FTP A. And many transfers will occur at the same time, from and to many FTP servers or other.
I don't want my platform to actually store these files in order to send them to another location. I don't want to load everything in memory either... I need to "stream" binary data from A to B, with minimal charge on my transfer platform.
I am looking at https://docs.python.org/2/library/io.html with ReadBuffer and WriteBuffer, but I can't find examples and the documentation is sorta cryptic for me...
Anyone has a starting point?
buff = io.open('/var/tmp/test', 'wb')
def loadbuff(data):
buff.write(data)
self.ftp.retrbinary('RETR ' + name, loadbuff, blocksize=8)
So my data is coming in buff, which is a <_io.BufferedWriter name='/var/tmp/test'> object, but how can I start reading from it while ftplib keeps downloading?
Hope I'm clear enough, any idea is welcomed.
Thanks

pysftp and paramiko stop uploading files after a few seconds

I am using python 3.4 and pysftp , (pysftp suspected to be working on 3.4)
Pysftp is a wrapper over paramiko.
I have no problem downloading files.
I can also upload small files.
When i am uploading files that take longer than a few seconds to complete however i get an error.
I monitored my interent connection, after about 3 seconds there is no more uploading taking place.
after ~5 minutes i recieve an EOFError
I also experimented with the paramiko module with the same results.
I can upload files using open ssh as well as filezilla without problem.
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
srv.put(file_name)
I would like to be able to upload files greater than a few kb... what am i missing?
It seems that paramiko is not adjusting window during file uploading. You can increase window_size manualy:
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
channel = srv.sftp_client.get_channel()
channel.lock.acquire()
channel.out_window_size += os.stat(file_name).st_size
channel.out_buffer_cv.notifyAll()
channel.lock.release()
srv.put(file_name)
It works for me, but sometimes it is not enough for large files so I add some extra bytes. I think, that some packets may be lost and it depends on the connection.

Python FTP "ERRNO 10054" Sequential File Download

I've written some code to log on to a an AS/400 FTP site, move to a certain directory, and locate files I need to download. It works, but it seems like when there are MANY files to download I receive:
socket.error: [Errno 10054] An existing connection was
forcibly closed by the remote host
I log on and navigate to the appropriate directory successfully:
try:
newSession = ftplib.FTP(URL,username,password)
newSession.set_debuglevel(3)
newSession.cwd("SOME DIRECTORY")
except ftplib.all_errors, e:
print str(e).split(None,1)
sys.exit(0)
I grab a list of the files I need:
filesToDownload= filter(lambda x: "SOME_FILE_PREFIX" in x, newSession.nlst())
And here is where it is dying (specifically the newSession.retrbinary('RETR '+f,tempFileVar.write)):
for f in filesToDownload:
newLocalFileName = f + ".edi"
newLocalFilePath = os.path.join(directory,newLocalFileName)
tempFileVar = open(newLocalFilePath,'wb')
newSession.retrbinary('RETR '+f,tempFileVar.write)
tempFileVar.close()
It downloads upwards of 85% of the files I need before I'm hit with the Errno 10054 and I guess I'm just confused as to why it seems to arbitrarily die when so close to completion. My honest guess right now is too many requests to the FTP when trying to pull these files.
Here's a screenshot of the error as it appears on my command prompt:
Any advice or pointers would be awesome. I'm still trying to troubleshoot this.
There's no real answer to this I suppose, it seems like the client's FTP is at fault here it's incredibly unstable. Best I can do is a hacky work around catching the thrown socket error and resuming where I left off in the previous session before being forcibly disconnected. Client's IT team is looking into the problem on their end finally.
Sigh.

python ftp retrieve lines -- performance issues

I am trying to retrieve lines from a file through a FTP connection using the ftplib module of python. It takes about 10 mins to read a file of size 1 GB. I was wondering if there are any other ways to read the lines in a faster manner.
I should have included some code to show what I am doing:
ftp.HostName = 'xxx'
ftp.Userid = 'xxx' #so on
ftp.conn.retrlines('RETR ' + fileName, process)
Retrieving remote resources is usually bound by your bandwidth, and FTP protocol does a decent job of using it all.
Are you sure you aren't saturating your network connection? (what is the network link between client running ftplib and server you are downloading from?)
Back of the envelope calc:
1GB/10mins =~ 1.7 MB/sec =~ 13 Mbps
So you are downloading at 13 megabit. That's decent speed for a remote DSL/Cable/WAN connection, but obviously pretty low if this is all a local network.
Can you show some minimal code sample of what you are doing? FTP is for transporting files,
so retrieving lines from a remote file isn't necessarily as efficient as transferring the file in whole once and reading it locally.
Aside from that, have you verified, that you can be faster on this connection?
EDIT: if you try the following and it is not a bit faster, then you are limited by your OS or your connection:
ftp.conn.retrbinary('RETR ' + fileName, open(temp_file_name, 'wb').write)
The assumption here is, that FTP text mode might be somewhat less efficient (on the server side), which might be false or of miniscule relevance.

Categories