I have a simple server on my Windows PC written in python that reads files from a directory and then sends the file to the client via TCP.
Files like HTML and Javascript are received by the client correctly (sent and original file match).
The issue is that image data is truncated.
Oddly, different images are truncated at different lengths, but it's consistent per image.
For example, a specific 1MB JPG is always received as 95 bytes. Another image which should be 7KB, is received as 120 bytes.
Opening the truncated image files in notepad++, the data that is there is correct. (The only issue is that the file ends too soon).
I do not see a pattern for where the files end. The chars/bytes immediately before and after truncation are different per image.
I've tried three different ways for the server to read the files, but they all have the same result.
Here is a snippet of the reading and sending of files:
print ("Cache size=" + str(os.stat(filename).st_size))
#1st attempt, using readlines
fileobj = open(filename, "r")
cacheBuffer = fileobj.readlines()
for i in range(0, len(cacheBuffer)):
tcpCliSock.send(cacheBuffer[i])
#2nd attempt, using line, same result
with open(filename) as f:
for line in f:
tcpCliSock.send(f)
#3rd attempt, using f.read(), same result
with open(filename) as f:
tcpCliSock.send(f.read())
The script prints to the console the size of the file read, and the number of bytes matches the original image. So this proves the problem is in sending, right?
If the issue is with sending, what can I change to have the whole image sent properly?
Since you're dealing with images, which are binary files, you need to open the files in binary mode.
open(filename, 'rb')
From the Python documentation for open():
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
Since your server is running on Windows, as you read the file, Python is converting every \r\n it sees to \n. For text files, this is nice: You can write platform-independent code that only deals with \n characters. For binary files, this completely corrupts your data. That's why it's important to use 'b' when dealing with binary files, but also important to leave it off when dealing with text files.
Also, as TCP is a stream protocol, it's better to stream the data into the socket in smaller pieces. This avoids the need to read an entire file into memory, which will keep your memory usage down. Like this:
with open(filename, 'rb') as f:
while True:
data = f.read(4096)
if len(data) == 0:
break
tcpCliSock.send(data)
Related
Currently, I can't get the data being received from my other client software to write into a file that will append as well as add a space after each dump. I Tried quite a few different approaches but I'm left with this now and I'm a bit stumped.
At the moment I can no longer get a file to write and I'm not sure what I've done to destroy that part of my code.
while True:
data = s.recv(1024).decode('utf-8')
if data:
with open("data.txt", 'w') as f:
json.dump(data, f, ensure_ascii=False)
I am expecting a file will appear that will not be overwritten each time I receive new data, allowing me to develop my search and table features of my application.
What you are currently doing for each block:
Decode the block as UTF
Open a file, truncating the previous contents ('w' mode)
Re-encode the data
Dump it to the file
Why this is a bad way to do it:
Your blocks are not necessarily going to respect UTF code point boundaries. You need to accumulate all the data before you decode.
Not only are you tuncating theexisting file by using 'w' instead of 'a' mode, but opening and closing a file over and over is very inefficient and generally a bad idea.
You are not going to get the same result if the original block was off a UTF boundary. Worst case, your whole dataset will be trash.
You have no way of ending the stream. You probably want to close the file eventually and decode it.
How you should do it:
Open an output file (in binary mode)
Loop until the stream ends
Dump all your raw binary packets to a file
Close the file
Decode the file when you read it
Sample code:
with open('data.txt', 'wb') as file:
while True:
data = s.recv(1024)
if not data:
break
file.write(data)
If the binary stream contains UTF-8 encoded JSON data, that's what you will get in your file.
I want to open a file, decode the format of data (from base64 to ASCII), rewrite or save the decoded string, either back to the same file, or new one.
I have it opening, reading, decoding (and printing as a test) the decoded base64 string into readable format (ASCII I believe)
My goal is to now save this output to: either a "newfile.txt" document or back to the original "test.mcz" file ready for the next steps of my mission...
I know there are great online base64 decoders and they do work well for what I am doing - I use them often, but my goal is to write my own program as a learning exercise more than anything (also when my internet plays up I need an offline program)
Here's where I am so far (the original file is .mcz format it is a game save)
# PYTHON 3
import base64
f = open('test.mcz', 'r')
f_read = f.read()
# print(f_read) # was just as a test
new_f_read = base64.b64decode(f_read)
print (new_f_read)
This prints a butt-load of readable code that is what I need, but I don't want to have to just copy and paste this output from the Python shell into another editor, I want to save it to a file...for convenience.
Either back into the same test.mcz (I will be re-encoding to base64 again later on anyway) or to a new file - thus leaving my original as it was.
problem arises when I want to save/write this decoded output that is stored within the new_f_read variable...it's just been a headache, before I started I could visualise how it needed to be written, I got tripped up when I had to switch it all over to Python3 for some reason (Don't ask...) and I have tried so many variations from online examples - I wouldn't know where to start explaining what I've tried so far. I can't open the original file as both "r" AND "w" together so once Ive opened and decoded I cant reopen the original file as "w" because it just wipes the contents (which are still encoded anyway) -
I think I need to write functions to handle:
1. Open, read, save string to a variable
2. Manipulate string - decode
3. Write the new string to new or existing file
Sounds easy I know, but I am stuck...so here I am. If anyone shows examples, please take the time to explain what is going on, it seems pointless to me having code I don't understand. Apologies if this seems like a simple thing, help would be appreciated..Thanks
First, you can absolutely open a file for both reading and writing without truncating the contents. That's what the r+ mode is for (see https://docs.python.org/3/library/functions.html#open). If you do this, the model is (a) open the file, (b) read it, (c) seek back to the beginning with e.g. f.seek(0), (d) write it.
Secondly, you can simply open the file, read it, then close the file, and then reopen it, write it, and close it again, like this:
# open the file for reading, read the data, then close the file
with open('test.mcz', 'rb') as f:
f_read = f.read()
new_f_read = base64.b64decode(f_read)
# open the file for writing, write the data, then close the file
with open('test.mcz', 'wb') as f:
f.write(new_f_read)
This is probably the easiest solution.
The easiest thing is to open first a read file handle, close it then open a write handle. Read/Write handles are complicated because they have to have a pointer to where in the file you are and it add overhead that you don't need to use. You could do it if you wanted, but its a waste of time here.
Using the with operator to open files is recommended since the file will automatically close when you leave the with block.
import base64
with open('test.mcz', 'r') as f:
encode = base64.b64decode(f.read())
with open('test.mcz', 'wb') as f:
f.write(encode)
This is the same as
import base64
f = open('test.mcz', 'r'):
encode = base64.b64decode(f.read())
f.close()
f = open('test.mcz', 'wb'):
f.write(encode)
f.close()
I'm actually working on a project to send file using UDP, and since this protocol is not reliable I added some information on each packet which is the index of the data. So I can write the received data in the correct order.
I have problems to write bytes in a specific position in a file
this is the part of my code that handle writing new data :
while i < packet_num:
buf,address = recieve_packet(s,data_size+10)
i += 1
if buf:
print(buf)
index = int(buf[0:10].decode())
data = buf[10:]
f.seek(seek_pointer + index*data_size,0)
f.write(data)
list_index.append(index)
in this case the seek function has no effect and the data is just appended to the file. I'm using "a+b" mode to open the file.
Quoting from tutorialspoint.com,
Note that if the file is opened for appending using either 'a' or 'a+', any seek() operations will be undone at the next write.
"a" mode write operations append to the end of the file. What seek does is it sets the write/read pointer to a specific location in the file.
Therefore, when a write is called, it will write to the end of file, regardless of the read/write pointer.
However, because you've opened the file in a+b, you would be able to seek to a specific location and read it.
If you open using 'append' mode, all writes go to the end of the file. If ypu are already keeping track of where received data, then opening in w+b mode is all you need to do.
wb creates (or empties) the file, and allows writing (in binary, rather than text mode). w+b Does the same, but allows reading as well. If you want to open an existing file without truncating it, mode r+b will allow both reading and writing, while preserving the existing data (again, the b is for binary mode, which I expect is correct for your uses).
Here is the code
def main():
f = open("image.jpg", "rb")
filedata = f.read()
f.close()
print "Creating Test Image"
f = open("ftp_test.jpg", "w+")
f.write(filedata)
f.close()
print "Done!"
if __name__ == '__main__':
main()
Im not sure, why but here is the original image
and here is the resulting picture from the code
I'm not sure what to do so I decided to come to the experts since I'm only 14. I am also adding more to it like TCP communication. So I can send files over the internet.
You're reading the file in binary with rb, so write back in binary too, by using wb.
f = open("ftp_test.jpg", "wb+")
From the official docs:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.