Download chunk of the large file using pysftp in Python

Download chunk of the large file using pysftp in Python - python

I have one use case in which I want to read only top 5 rows of a large CSV file which is present in one of my sftp server and I don't want to download the complete file to just read the top 5 rows. I am using pysftp in Python to interact with my SFTP server. Do we have any way in which I can download only the chunk of the file instead of downloading the complete file in pysftp?
If there are any other libraries in Python or any technique I can use, please guide me. Thanks

First, do not use pysftp. It's dead unmaintained project. Use Paramiko instead. See pysftp vs. Paramiko.
If you want to read data from specific point in the file, you can open a file-like object representing the remote file using Paramiko SFTPClient.open method (or equivalent pysftp Connection.open) and then use it as if you were accessing data from any local file:
Use .seek to set read pointer to the desired offset.
Use .read to read data.
with sftp.open("/remote/path/file", "r", bufsize=32768) as f:
f.seek(offset)
data = f.read(count)
For the purpose of bufsize, see:
Writing to a file on SFTP server opened using Paramiko/pysftp "open" method is slow

Related

Transfer file from S3 to Windows server

I have just been introduced to Python (PySpark). I have a requirement to achieve the following steps:-
Extract data from a Hive table (on EMR) into a csv file on AWS S3
Transfer the csv file created on S3 (EMR cluster running Spark on YARN) to a
remote Windows server (at a certain folder path)
Once the file has been transferred, trigger a batch file that exists
on the Windows server at a certain folder path
The Windows batch script when triggered, updates/enriches the transferred file with additional information. So transfer/copy the updated csv file back to S3
The final step is to load the updated file to a Hive table once it is transferred back on S3. However, I have figured out how to extract the data from the table into a csv file on S3 and also to load the file to the table. However, I am struggling to get a bearing on how to perform the file transfer/copy between the servers, and most importantly, how to trigger the Windows Batch script on the remote machine.
Could someone please help me and point me in the right direction, and hint as to where I should be starting from ? I searched the internet but couldn't get a concrete answer. I understand that I have to use Boto3 library to interact with S3, however, if there is any other established solution please share those with me (code snippets, articles etc). Also any specific configurations that I might have to incorporate to achieve the result.
Thanks

How to preserve file mtime when uploading with Python ftplib

I am working on a Python tool to synchronize files between my local machines and a remote server. When I upload a file to the server the modification time property of that file on the server is set to the time of the upload process and not to the mtime of the source file, which I want to preserve. I am using FTP.storbinary() from the Python ftplib to perform the upload. My question: Is there a simple way to preserve the mtime when uploading or to set it after the upload? Thanks.

Short answer: no. The Python ftplib module offers no option to transport the time of the file. Furthermore, the FTP protocol as defined by rfc-959 has no provision to directly get not set the mtime of a file. It may be possible on some servers through SITE commands, but this is server dependant.
If it is possible for you, you should be able to pass a site command with the sendcmd method of a connection object. For example if the server accepts a special SITE SETDATE filename iso-8601-date-string you could use:
resp = ftp.sendcmd(f'SITE SETDATE {file_name} {date-string}')

Python: How to know if file is locked in FTP [duplicate]

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?

There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).

I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)

If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.

This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.

Can I stream a GZIP dataset directly from a server through HTTP using python?

I want to read data from a GZIP dataset file directly the internet without downloading the complete file. Considering the size of the dataset, is it possible in python to stream the data directly from the server through HTTP and read the data? I took a look at zlib and gzip packages. I'm a beginner to python, I want to know if this is possible using python or any other language, if possible any references to such code. Thanks in Advance!

How to get certain types of files from one server to another in Python?

Suppose I have one server called server 1. On this server there's a directory called dir1. dir1 has 3 files in them called neh_iu.dat_1, neh_hj.dat_2, jen_ak.dat_1.
I need to get ONLY the 'neh' files from server 1 to another server called server 2. server 2 is where I will be performing certain modifications on these files.
How do I get ONLY the 'neh' files in Python? I'm new to python. I'm aware of a module called paramiko which allows for file transfers but assuming that there are millions of 'neh' files in dir1, and that I don't know the full names of all of them, how can I get an automated process for it in Python?

If you really need to use python instead of bash (assuming you're on unix).
>>> import subprocess
>>> subprocess.call("tar cvzf /path/to/ftp-or-static-http/foo.tgz /path/to/dir/neh*")
This will create a tar file with all the neh*s files. Easy to be transferred between servers (it's only one file instead of millions).
Use FTP, SFTP, HTTP or any transfer protocol supported by your server and perform a request from the other server (curl or ftp).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.