I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).
Related
I am currently working on a side project where I am converting some code from python to rust.
In python, we can do something like:
Python code->
data = b'commit'+b' '+b'\x00'
print(data)
Output->
b'commit \x00'
Is there any way to achieve this in rust? As I need to concatenate some b'' and store them in a file.
Thanks in advance.
I tried using + operator but didn't work and shown error like: cannot add &[u8;6] with &[u8;1]
You have a number of options for combining binary strings. However it sounds the best option for your use case is to use the write! macro. It lets you write bytes the same way you would use the format! and println! macros. This has the benefit of requiring no additional allocation. The write! macro uses UTF-8 encoding (the same as strings in Rust).
One thing to keep in mind though is that the write! will perform lots of small calls using the std::io::Write trait. To prevent your program from slowing down due to the frequent calls to the OS when writing files you will want to buffer your data using a BufWriter.
As for zlip this is defiantly doable. In Rust, the most popular library for handling deflate, gzip, and zlib is the flate2 crate. Using this we can wrap a writer to do the compression inline with the rest of the writing process. This lets us save memory and keeps our code tidy.
use std::fs::File;
use std::io::{self, BufWriter, Write};
use flate2::write::ZlibEncoder;
use flate2::Compression;
pub fn save_commits(path: &str, results: &[&str]) -> io::Result<()> {
// Open file and wrap it in a buffered writer
let file = BufWriter::new(File::create(path)?);
// We can then wrap the file in a Zlib encoder so the data we write will
// be compressed with zlib as we go
let mut writer = ZlibEncoder::new(file, Compression::best());
// Write our header
write!(&mut writer, "commit {}\0", results.len())?;
// Write commit hashes
for hash in results {
write!(&mut writer, "{}\0", hash);
}
// This step is not required, but it lets us propogate the error instead of
// letting the final bytes buffered in the BufferedWriter from being flushed
// when it gets dropped.writer is dropped.
writer.by_ref().flush()?;
Ok(())
}
Additionally, you may find it helpful to read this answer to learn more about how binary strings work in Rust: https://stackoverflow.com/a/68231883/5987669
I have a bunch of questions in file handling in Python. Please help me sort them out.
Suppose I create a file something like this.
>>>f = open("text,txt", "w+")
>>>f.tell()
>>>0
f is a file object.
Can I assume it to be a file pointer?
If so what is f pointing to ? The empty space reserved for first byte in file structure?
Can I assume file structure to be zero indexed?
In microprocessors what I learnt is that the pointer always points to the next instruction. How is it in python? If I write a character say 'b' in the file, will my file pointer points to character 'b' or to the location next to 'b'?
You don't specify a version, and file objects behave a little bit differently between Python 2 and Python 3. The general idea is the same, but some of the specific details are different. The following assumes you're using Python 3, or that you're using the version of open from the io module in Python 2.6 or 2.7 rather than Python 2's builtin open.
It isn't a file pointer, although there's a good chance it is implemented in terms of one behind the scenes. Unlike C, Python does not expose the concept of pointers.
However, what you seem to be thinking of is the 'stream position', which is kindof similar to a pointer. This is the number reported by tell(), and which can be fed into seek(). For binary files, it is a byte offset from the start of the file. In text files, it is just 'an offset' which is meaningful to the file object - the docs call it an "opaque number" (ie, it has no defined physical meaning in terms of how the file is stored on disk). But in both cases, it is an offset from the start, and therefore the start is zero. This is only true if the file supports random access - which you usually will be, but be prepared to eventually run into a situation where you're not - in which case, seek and tell raise errors.
Like the instruction pointer in processors, the stream position is where the next operation will start from, rather than where the current one finished. So, yes, after you've written a string to the file, the current position will usually be one offset value past that.
When you've just opened a file, the offset will usually be zero or the end of the file (one higher than the maximum value you could read from without getting EOF). It will be zero if you've opened it in 'r' mode, the end if you've opened it in 'a' mode and the two are equivalent for 'w' and 'w+' modes since those truncate the file to zero bytes.
The file object is implemented using the C stadard library's stdio. So it contains a "file descriptor" (since it's based on stdio, "under the hood" it will contain a pointer to a struct FILE, which is what is commonly called a file pointer.). And you can use tell and seek. On the other hand, it is also an iterator and a context manager. So it has more funtcionality.
It is not a pointer, but rather a reference. Keep in mind that in Python f is a name, that references a file object.
If you are using file.seek(), it uses 0-based absolute positioning by default.
You are confusing a processor register with file handling. The question makes no sense.
There's nothing special about a file object. Just think of it as an object
the name f points to the file object on the heap, just like in l = [1, 2, 3] the name l points to the list object on the heap
From the documentation, there is no __getitem__ member, so this is not a meaningful question
I am trying to read only one file from a tar.gz file. All operations over tarfile object works fine, but when I read from concrete member, always StreamError is raised, check this code:
import tarfile
fd = tarfile.open('file.tar.gz', 'r|gz')
for member in fd.getmembers():
if not member.isfile():
continue
cfile = fd.extractfile(member)
print cfile.read()
cfile.close()
fd.close()
cfile.read() always causes "tarfile.StreamError: seeking backwards is not allowed"
I need to read contents to mem, not dumping to file (extractall works fine)
Thank you!
The problem is this line:
fd = tarfile.open('file.tar.gz', 'r|gz')
You don't want 'r|gz', you want 'r:gz'.
If I run your code on a trivial tarball, I can even print out the member and see test/foo, and then I get the same error on read that you get.
If I fix it to use 'r:gz', it works.
From the docs:
mode has to be a string of the form 'filemode[:compression]'
...
For special purposes, there is a second format for mode: 'filemode|[compression]'. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file… Use this variant in combination with e.g. sys.stdin, a socket file object or a tape device. However, such a TarFile object is limited in that it does not allow to be accessed randomly, see Examples.
'r|gz' is meant for when you have a non-seekable stream, and it only provides a subset of the operations. Unfortunately, it doesn't seem to document exactly which operations are allowed—and the link to Examples doesn't help, because none of the examples use this feature. So, you have to either read the source, or figure it out through trial and error.
But, since you have a normal, seekable file, you don't have to worry about that; just use 'r:gz'.
In addition to the file mode, I attempted to seek on a network stream.
I had the same error when trying to requests.get the file, so I extracted all to a tmp directory:
# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:
tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
with open(os.path.join(t, fn)) as payload:
print(payload.read())
I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.
In WSGI, post data is consumed by reading the file-like object environ['wsgi.input']. If a second element in the stack also wants to read post data it may hang the program by reading when there's nothing more to read.
How should I copy the POST data so it can be processed multiple times?
You could try putting a file-like replica of the stream back in the environment:
from cStringIO import StringIO
length = int(environ.get('CONTENT_LENGTH', '0'))
body = StringIO(environ['wsgi.input'].read(length))
environ['wsgi.input'] = body
Needing to do this is a bit of a smell, though. Ideally only one piece of code should be parsing the query string and post body, and delivering the results to other components.
Go have a look at WebOb package. It provides functionality that allows one to designate that wsgi.input should be made seekable. This has the effect of allowing you to rewind the input stream such that content can be replayed through different handler. Even if you don't use WebOb, the way it does this should be instructive as would trust Ian to have done this in an appropriate way. For search results in documentation go here.
If you're gonna read it in one fell swoop, you could always read it in, create a CStringIO file-like object of the stuff you've read and then assign it back, like this:
import cStringIO
import copy
lines = []
for line in environ['wsgi.input']:
lines.append(line)
newlines = copy.copy(lines)
environ['wsgi.input'] = cStringIO.StringIO(''.join(newlines))
There's most likely a more efficient way to do this, but I in general find wsgi's post stuff pretty brittle if you want to do anything non-trivial (like read post data muptiple times)...