I am using python 3.4 and pysftp , (pysftp suspected to be working on 3.4)
Pysftp is a wrapper over paramiko.
I have no problem downloading files.
I can also upload small files.
When i am uploading files that take longer than a few seconds to complete however i get an error.
I monitored my interent connection, after about 3 seconds there is no more uploading taking place.
after ~5 minutes i recieve an EOFError
I also experimented with the paramiko module with the same results.
I can upload files using open ssh as well as filezilla without problem.
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
srv.put(file_name)
I would like to be able to upload files greater than a few kb... what am i missing?
It seems that paramiko is not adjusting window during file uploading. You can increase window_size manualy:
with pysftp.Connection(host="host",username="python",
password="pass",port=2222) as srv:
print('server connected')
channel = srv.sftp_client.get_channel()
channel.lock.acquire()
channel.out_window_size += os.stat(file_name).st_size
channel.out_buffer_cv.notifyAll()
channel.lock.release()
srv.put(file_name)
It works for me, but sometimes it is not enough for large files so I add some extra bytes. I think, that some packets may be lost and it depends on the connection.
Related
I have an SFTP server. I can take data by transferring/downloading files. Is there a way that I can do without downloading files?
My code is as below:
# Connection to the SFTP server
with pysftp.Connection(hostname, username, passowrd, port) as sftp:
with sftp.cd('directory'):
sftp.get('filename.txt')
This code downloads file to my local machine.
Yes and no. You can use the data from the remote (SFTP) server without storing the files to a local disk.
But you cannot use data locally without downloading them. That's impossible. You have to transfer the data to use them – at least to a memory of the local machine.
See A way to load big data on Python from SFTP server, not using my hard disk.
My answer there talks about Paramiko. But pysftp is a just a thin wrapper around Paramiko. Its Connection.open is directly mapped to underlying Paramiko's SFTPClient.open. So you can keep using pysftp:
with sftp.open('filename.txt', bufsize=32768) as f:
# use f as if you have opened a local file with open()
Though I'd recommend you not to: pysftp vs. Paramiko.
I need to download some large files (>30GB per file) from a FTP server. I'm using ftplib from the python standardlib but there are some pitfalls: If i download a large file, i can not use the connection anymore if the file finishes. I get an EOF Error afterwards, so the connection is closed (due to timeout?) and for each succeeding file i will get an error 421.
From what i read, there are two connections. The data and control channel, where the data channel seems to work correctly (i can download the file completly) but the control channels times out in the meantime.
I also read that the ftplib (and other python ftp libraries) are not suited for large files and may only support files up to around 1GB.
There is a similar question to this topic here: How to download big file in python via ftp (with monitoring & reconnect)? which is not quite the same because my files are huge in comparison.
My current code looks like this:
import ftplib
import tempfile
ftp = ftplib.FTP_TLS()
ftp.connect(host=server, port=port)
ftp.login(user=user, passwd=password)
ftp.prot_p()
ftp.cwd(folder)
for file in ftp.nlst():
fd, local_filename = tempfile.mkstemp()
f = open(fd, "wb")
ftp.retrbinary('RETR %s' % file, callback=f.write, blocksize=8192)
f.close()
Is there any tweak to it or another library that i can use, which does support huge files?
If you experience issues with standard FTP, you can try using a different protocol that is specifically designed to handle such large files.
A number of suitable solutions exist. Rsync would probably be a good way to start.
I have code that writes files to s3. The code was working fine
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(BUCKET, validate=False)
k = Key(bucket)
k.key = self.filekey
k.set_metadata('Content-Type', 'text/javascript')
k.set_contents_from_string(json.dumps(self.output))
k.set_acl(FILE_ACL)
This was working just fine. Then I noticed I wasn't closing my connection so I added this line at the end:
conn.close()
Now, the file writes as before but, I'm seeing this error in my logs now
S3Connection instance has no attribute '_cache', unable to write file
Anyone see what I'm doing wrong here or know what's causing this? I noticed that none of the tutorials on boto show people closing connections but I know you should close your connections for IO operations as a general rule...
EDIT
A note about this, when I comment out conn.close() the error disappears
I can't find that error message in the latest boto source code, so unfortunately I can't tell you what caused it. Recently, we had problems when we were NOT calling conn.close(), so there definitely is at least one case where you must close the connection. Here's my understanding of what's going on:
S3Connection (well, its parent class) handles almost all connectivity details transparently, and you shouldn't have to think about closing resource, reconnecting, etc.. This is why most tutorials and docs don't mention closing resources. In fact, I only know of one situation where you should close resources explicitly, which I describe at the bottom. Read on!
Under the covers, boto uses httplib. This client library supports HTTP 1.1 Keep-Alive, so it can and should keep the socket open so that it can perform multiple requests over the same connection.
AWS will close your connection (socket) for two reasons:
According to the boto source code, "AWS starts timing things out after three minutes." Presumably "things" means "idle connections."
According to Best Practices for Using Amazon S3, "S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset')."
Fortunately, boto works around the first case by recycling stale connections well before three minutes are up. Unfortunately, boto doesn't handle the second case quite so transparently:
When AWS closes a connection, your end of the connection goes into CLOSE_WAIT, which means that the socket is waiting for the application to execute close(). S3Connection handles connectivity details so transparently that you cannot actually do this directly! It's best to prevent it from happening in the first place.
So, circling back to the original question of when you need to close explicitly, if your application runs for a long time, keeps a reference to (reuses) a boto connection for a long time, and makes many boto S3 requests over that connection (thus triggering a "connection reset" on the socket by AWS), then you may find that more and more sockets are in CLOSE_WAIT. You can check for this condition on linux by calling netstat | grep CLOSE_WAIT. To prevent this, make an explicit call to boto's connection.close before you've made 100 requests. We make hundreds of thousands of S3 requests in a long running process, and we call connection.close after every, say, 80 requests.
I am trying to retrieve lines from a file through a FTP connection using the ftplib module of python. It takes about 10 mins to read a file of size 1 GB. I was wondering if there are any other ways to read the lines in a faster manner.
I should have included some code to show what I am doing:
ftp.HostName = 'xxx'
ftp.Userid = 'xxx' #so on
ftp.conn.retrlines('RETR ' + fileName, process)
Retrieving remote resources is usually bound by your bandwidth, and FTP protocol does a decent job of using it all.
Are you sure you aren't saturating your network connection? (what is the network link between client running ftplib and server you are downloading from?)
Back of the envelope calc:
1GB/10mins =~ 1.7 MB/sec =~ 13 Mbps
So you are downloading at 13 megabit. That's decent speed for a remote DSL/Cable/WAN connection, but obviously pretty low if this is all a local network.
Can you show some minimal code sample of what you are doing? FTP is for transporting files,
so retrieving lines from a remote file isn't necessarily as efficient as transferring the file in whole once and reading it locally.
Aside from that, have you verified, that you can be faster on this connection?
EDIT: if you try the following and it is not a bit faster, then you are limited by your OS or your connection:
ftp.conn.retrbinary('RETR ' + fileName, open(temp_file_name, 'wb').write)
The assumption here is, that FTP text mode might be somewhat less efficient (on the server side), which might be false or of miniscule relevance.
I'm using paramiko to connect to an SFTP server on which I have to download and process some files.
The server has a timeout set to 5 minutes, but some days it happens that the processing of the files can take longer than the timeout. So, when I want to change the working directory on the server to process some other files sftp.chdir(target_dir)) I get an exception that the connection has timed out:
File
buildbdist.win32eggparamikosftp://ftp.py,
line 138, in _write_all raise
EOFError()
To counter this I thought that activating the keep alive would be the best option so I used the "set_keepalive" on the transport to set it to 30 seconds:
ssh = paramiko.SSHClient()
ssh.set_missing_hostkey_policy(paramiko.AutoAddPolicy())
ssh.connect(ssh_server, port=ssh_port, username=ssh_user, password=password)
transport = ssh.get_transport()
transport.set_keepalive(30)
sftp = transport.open_sftp_client()
But nothing changes at all. The change has absolutely no effect. I don't know if I'm misunderstanding the concept of set_keepalive here or maybe the server (on which I have no access) ignores the keep alive packets.
Isn't this the right way to counter this problem or should I try a different approach? I don't like the idea of "manually" sending some ls command to the server to keep the session alive.
If the server is timing you out for inactivity, there's not much you can do from the client-side (other than perhaps send a simple command every now and again to keep your session from timing out).
Have you considered breaking apart your download and processing steps, so that you can download everything you need to start with, then process it either asynchronously, or after all downloads have completed?