Downloading partial file with ftplib

Downloading partial file with ftplib - python

Is there a way to download only partial piece of file from last line (end of file). Like if file is over 40 MB, and I would like to retrieve only last block let's say of 2042 bytes. Is there possible way to do this using python 3 with ftplib ?

Try using the FTP.retrbinary() method and supply the rest argument, which is an offset into the requested file. Since the offset is from the beginning of the file, you will need to calculate the offset using the size of the file and the desired number of bytes of data. Here's an example using debian's FTP server:
from ftplib import FTP
hostname = 'ftp.debian.org'
filename = 'README'
num_bytes = 500 # how many bytes to retrieve from end of file
ftp = FTP(hostname)
ftp.login()
ftp.cwd('debian')
cmd = 'RETR {}'.format(filename)
offset = max(ftp.size(filename) - num_bytes, 0)
ftp.retrbinary(cmd, open(filename, 'wb').write, rest=offset)
ftp.quit()
This will retrieve the last num_bytes bytes from the end of the requested file and write it to a file of the same name in the current directory.
The second argument to retrbinary() is a callback function, in this case it's the write() method of a writeable file. You can write your own callback to process the retrieved data.

Just use the rest argument to retrbinary to tell the server at which file offset it should start to transfer data. From the documentation:
FTP.retrbinary(command, callback[, maxblocksize[, rest]])
... rest means the same thing as in the transfercmd() method.
FTP.transfercmd(cmd[, rest])
... If optional rest is given, a REST command is sent to the server, passing rest as an argument. rest is usually a byte offset into the requested file, telling the server to restart sending the file’s bytes at the requested offset, skipping over the initial bytes.

Related

Why these Python send / receive socket functions work if invoked slowly, but fail if invoked quickly in a row?

I have a client and a server, where the server needs to send a number of text files to the client.
The send file function receives the socket and the path of the file to send:
CHUNKSIZE = 1_000_000
def send_file(sock, filepath):
with open(filepath, 'rb') as f:
sock.sendall(f'{os.path.getsize(filepath)}'.encode() + b'\r\n')
# Send the file in chunks so large files can be handled.
while True:
data = f.read(CHUNKSIZE)
if not data:
break
sock.send(data)
And the receive file function receives the client socket and the path where to save the incoming file:
CHUNKSIZE = 1_000_000
def receive_file(sock, filepath):
with sock.makefile('rb') as file_socket:
length = int(file_socket.readline())
# Read the data in chunks so it can handle large files.
with open(filepath, 'wb') as f:
while length:
chunk = min(length, CHUNKSIZE)
data = file_socket.read(chunk)
if not data:
break
f.write(data)
length -= len(data)
if length != 0:
print('Invalid download.')
else:
print('Done.')
It works by sending the file size as the first line, then sending the text file line by line.
Both are invoked in loops in the client and the server, so that files are sent and saved one by one.
It works fine if I put a breakpoint and invoke these functions slowly. But If I let the program run uninterrupted, it fails when reading the size of the second file:
File "/home/stark/Work/test/networking.py", line 29, in receive_file
length = int(file_socket.readline())
ValueError: invalid literal for int() with base 10: b'00,1851,-34,-58,782,-11.91,13.87,-99.55,1730,-16,-32,545,-12.12,19.70,-99.55,1564,-8,-10,177,-12.53,24.90,-99.55,1564,-8,-5,88,-12.53,25.99,-99.55,1564,-8,-3,43,-12.53,26.54,-99.55,0,60,0\r\n'
Clearly a lot more data is being received by that length = int(file_socket.readline()) line.
My questions: why is that? Shouldn't that line read only the size given that it's always sent with a trailing \n?
How can I fix this so that multiple files can be sent in a row?
Thanks!

It seems like you're reusing the same connection and what happens is your file_socket being buffered means... you've actually recved more from your socket then you'd think with your read loop.
I.e. the receiver consumes more data from your socket and next time you attempt to readline() you end up reading rest of the previous file up to the new line contained therein or of the next length information.
This also means your initial problem actually is you've skipped a while. Effect of which is next read line is not an int you expected and hence the observed failure.
You can say:
with sock.makefile('rb', buffering=0) as file_socket:
instead to force the file like access being unbuffered. Or actually handle the receiving and buffering and parsing of incoming bytes (understanding where one file ends and the next one begins) on your own (instead of file like wrapper and readline).

You have to understand that socket communication is based on TCP/IP, does not matter if it's same machine (you use loopback in such cases) or different machines. So, you've got some IP addresses between which the connection is established. Going further, it involves accessing your network adapter, ie takes relatively long in comparison to accessing eg. RAM. Additionally, the adapter itself manages when to send particular data frames (lower ISO/OSI layers). Basically, in case of TCP there's ACK required, but on standard PC this is usually not some industrial, real-time ethernet.
So, in your code, you've got a while True loop without any sleep and you don't check what does sock.send returns. Even if something goes wrong with particular data frame, you ignore it and try to send next. On first glance it appears that something has been cached and receiver received what was flushed once connection was re-established.
So, first thing which you should do is check if sock.send indeed returned number of bytes sent. If not, I believe the frame should be re-sent. Another thing which I strongly recommend in such cases is think of some custom protocol (this is usually called application layer in context of OSI/ISO stack). For example, you might have 4 types of frames: START, FILESIZE, DATA, END, assign unique ID and start each frame with the identifier. Then, START is gonna be empty, FILESIZE gonna contain single uint16, DATA is gonna contain {FILE NUMBER, LINE NUMBER, LINE_LENGTH, LINE} and END is gonna be empty. Then, once you've got entire frame on the client, you can safely assemble the information you received.

Why python-magic returns wrong mime-type if file size is too small?

In case of the file size is under 5000 bytes (InMemoryUploadedFile).
This code doesn't work
mime_type = magic.from_buffer(file.read(), mime=True)
It returns wrong mime_type.
For example, I have a file cv.docx with 4074 bytes size.
It returns a mime_type:
'application/x-empty'
instead of
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
Could you please suggest me any advices to solve this case?

I had this problem as well. It's very likely not to do with the file size, because I have tested magic.from_buffer on 90 byte text/plain files as well and it returned the right value.
The problem is that the file has somehow become empty. In my case, this is because the file was a stream and I had already read from the stream (remember if you read from a stream and read again, the second read will start where the first read finished -- unlike reading from the start of a file each time).
This example is from flask
mime_type1 = magic.from_buffer(request.stream.read(2048), mime=True) // returns text/plain
mime_type = magic.from_buffer(request.files["file"].stream.read(2048), mime=True) // returns application/x-empty because the stream has already been read from
It's hard to exactly diagnose without seeing your earlier code but check where else you are working with the file and comment those out.
You might need to do something like
file.seek(0)
mime_type = magic.from_buffer(file.read(), mime=True)

writing decompressed file to disk fetched from web server

I can get a file that has content-encoding as gzip.
So does that mean that the server is storing it as compressed file or it is also true for files stored as compressed zip or 7z files too?
and if so (where durl is a zip file)
>>> durl = 'https://db.tt/Kq0byWzW'
>>> dresp = requests.get(durl, allow_redirects=True, stream=True)
>>> dresp.headers['content-encoding']
'gzip'
>>> r = requests.get(durl, stream=True)
>>> data = r.raw.read(decode_content=True)
but data is coming out to be empty while I want to extract the zip file to disk on the go !!

So first of all durl is not a zip file, it is a drop box landing page. So what you are looking at is HTML which is being sent using gzip encoding. If you where to decode the data from the raw socket using gzip you would simply get the HTML. So the use of raw is really just hiding that you accidentally go an other file than the one you thought.
Based on https://plus.google.com/u/0/100262946444188999467/posts/VsxftxQnRam where you ask
Does anyone has any idea about writing compressed file directy to disk to decompressed state?
I take it you are really trying to fetch a zip and decompress it directly to a directory without first storing it. To do this you need to use https://docs.python.org/2/library/zipfile.html
Though at this point the problem becomes that the response from requests isn't actually seekable, which zipfile requires in order to work (one of the first things it will do is seek to the end of the file to determine how long it is).
To get around this you need to wrap the response in a file like object. Personally I would recommend using tempfile.SpooledTemporaryFile with a max size set. This way your code would switch to writing things to disk if the file was bigger than you expected.
import requests
import tempfile
import zipfile
KB = 1<<10
MB = 1<<20
url = '...' # Set url to the download link.
resp = requests.get(url, stream=True)
with tmp as tempfile.SpooledTemporaryFile(max_size=500*MB):
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
Same code using io.BytesIO:
resp = requests.get(url, stream=True)
tmp = io.BytesIO()
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)

You need the content from the requests file to write it.
Confirmed working:
import requests
durl = 'https://db.tt/Kq0byWzW'
dresp = requests.get(durl, allow_redirects=True, stream=True)
dresp.headers['content-encoding']
file = open('test.html', 'w')
file.write(dresp.text)

You have to differentiate between content-encoding (not to be confused with transfer-encoding) and content-type.
The gist of it is that content-type is the media-type (the real file-type) of the resource you are trying to get. And content-encoding is any kind of modification applied to it before sending it to the client.
So let's assume you'd like to get a resource named "foo.txt". It will probably have a content-type of text/plain.In addition to that, the data can be modified when sending over the wire. This is the content-encoding. So, with the above example, you can have a content-type of text/plain and a content-encoding of gzip. This means that before the server sends the file out onto the wire, it will compress it using gzip on the fly. So the only bytes which traverse the net are zipped. Not the raw-bytes of the original file (foo.txt).
It is the job of the client to process these headers accordingly.
Now, I am not 100% sure if requests, or the underlying python libs do this but chances are they do. If not, Python ships with a default gzip library, so you could do it on your own without a problem.
With the above in mind, to respond to your question: No, having a "content-encoding" of gzip does not mean that the remote resource is a zip-file. The field containing that information is content-type (based on your question this has probably a value of application/zip or application/x-7z-compressed depending of actual compression algorithm used).
If you cannot determine the real file-type based on the content-type field (f.ex. if it is application/octet-stream), you could just save the file to disk, and open it up with a hex editor. In the case of a 7z file you should see the byte sequence 37 7a bc af 27 1c somewhere. Most likely at the beginning of the file or at EOF-112 bytes. In the case of a gzip file, it should be 1f 8b at the beginning of the file.
Given that you have gzip in the content-encoding field: If you get a 7z file, you can be certain that requests has parsed content-encoding and properly decoded it for you. If you get a gzip file, it could mean two things. Either requests has not decoded anything, of the file is indeed a gzip file, as it could be a gzip file sent with the gzip encoding. Which would mean that it's doubly compressed. This would not make any sense, but, depending on the server this could still happen.
You could simply try to run gunzip on the console and see what you get.

http post, byte size limitation on post - python

I have been reading the contents of a file which is continuously updated. I'm Trying something like this.
offset = 0
now = datetime.now()
FileName = now.date()
logfile = open("FileName","a")
logfile.seek(offset)
data = logfile.read()
try:
http post
except:
Exceptions...
Now I want to read only the specific number of bytes from the file. Just because if I lose the Ethernet connection and get the connection again, it takes a long time to read the whole file. So that Can someone help me reg this?

You can use .read() with a numeric argument to read only a specific number of bytes, e.g. read(10) will read 10 bytes from the current position in the file.
http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

twisted image transfer from client to server gives bad format error

I was trying to send an image file using tcp from server to client. I tried opening the file, reading it and then transporting it using self.transport.write. On the client side, when I receive data, I open a file named Image in append mode, and write to it.
client:
class EchoClient(protocol.Protocol):
def dataReceived(self, data):
print 'writing to file'
f = open('image.png','a')
f.write(data)
f.close()
server (inherits Protocol):
//somewhere in the code
image = open(self.newdict[device_str] + attribute_str + '.png')
data = image.read()
image.close()
self.comm_protocol.transport.write(data)
Opening the file on client side gives bad format error. Any ideas what I am doing wrong ? Is the idea to stream the image as a string bad ? If so, is there some other way I can transfer data to the client ?

You have to open the file in binary mode, with the 'b' flag, like open(..., 'wb').
The reason that the file gets corrupted is that "text mode" does one of two things:
on UNIX, it does nothing.
on Windows, it just replaces \n with \r\n.
Now, if it's a text file, you can hardly tell the difference. But, if it's a binary file, that byte might doesn't mean "newline" any more. Generally, binary files are constructed from fixed-length structures, so sticking two bytes in where one is expected will cause all kinds of havoc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.