Python like b-string concatenation in rust - python

I am currently working on a side project where I am converting some code from python to rust.
In python, we can do something like:
Python code->
data = b'commit'+b' '+b'\x00'
print(data)
Output->
b'commit \x00'
Is there any way to achieve this in rust? As I need to concatenate some b'' and store them in a file.
Thanks in advance.
I tried using + operator but didn't work and shown error like: cannot add &[u8;6] with &[u8;1]

You have a number of options for combining binary strings. However it sounds the best option for your use case is to use the write! macro. It lets you write bytes the same way you would use the format! and println! macros. This has the benefit of requiring no additional allocation. The write! macro uses UTF-8 encoding (the same as strings in Rust).
One thing to keep in mind though is that the write! will perform lots of small calls using the std::io::Write trait. To prevent your program from slowing down due to the frequent calls to the OS when writing files you will want to buffer your data using a BufWriter.
As for zlip this is defiantly doable. In Rust, the most popular library for handling deflate, gzip, and zlib is the flate2 crate. Using this we can wrap a writer to do the compression inline with the rest of the writing process. This lets us save memory and keeps our code tidy.
use std::fs::File;
use std::io::{self, BufWriter, Write};
use flate2::write::ZlibEncoder;
use flate2::Compression;
pub fn save_commits(path: &str, results: &[&str]) -> io::Result<()> {
// Open file and wrap it in a buffered writer
let file = BufWriter::new(File::create(path)?);
// We can then wrap the file in a Zlib encoder so the data we write will
// be compressed with zlib as we go
let mut writer = ZlibEncoder::new(file, Compression::best());
// Write our header
write!(&mut writer, "commit {}\0", results.len())?;
// Write commit hashes
for hash in results {
write!(&mut writer, "{}\0", hash);
}
// This step is not required, but it lets us propogate the error instead of
// letting the final bytes buffered in the BufferedWriter from being flushed
// when it gets dropped.writer is dropped.
writer.by_ref().flush()?;
Ok(())
}
Additionally, you may find it helpful to read this answer to learn more about how binary strings work in Rust: https://stackoverflow.com/a/68231883/5987669

Related

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

What is the proper way to get binary data from python 2.7 into a C module

I am reading binary data from a file in python and trying to send that data to a c module. In python the data is read like so
file = open("data", "rb")
data = file.read()
I want the data as a pointer to a buffer and a length in c if possible. I am using PyArg_ParseTuple to get the parameters in the c module. I noticed in python 3+ there is a y/y*/y# format specifier for binary data but I need the equivalent way of doing it in python 2.7.
Thanks
You should investigate the Buffer Api.
From the docs:
These functions can be used by an object to expose its data in a raw, byte-oriented format. Clients of the object can use the buffer interface to access the object data directly, without needing to copy it first.
Two examples of objects that support the buffer interface are strings and arrays. The string object exposes the character contents in the buffer interface’s byte-oriented form. An array can also expose its contents, but it should be noted that array elements may be multi-byte values.
For example (in C++):
void* ExtractBuffer(PyObject* bufferInterfaceObject, Py_buffer& bufferStruct)
{
if (PyObject_GetBuffer(bufferInterfaceObject, &bufferStruct, PyBUF_SIMPLE) == -1)
return 0;
return (void*)bufferStruct.buf;
}
Do not forget to release the bufferStruct when you no longer need it:
PyBuffer_Release(&bufferStruct);

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html
I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

Reading an entire binary file into Python

I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.

How can I read a python pickle database/file from C?

I am working on integrating with several music players. At the moment my favorite is exaile.
In the new version they are migrating the database format from SQLite3 to an internal Pickle format. I wanted to know if there is a way to access pickle format files without having to reverse engineer the format by hand.
I know there is the cPickle python module, but I am unaware if it is callable directly from C.
http://www.picklingtools.com/
There is a library called the PicklingTools which I help maintain which might be useful: it allows you to form data structures in C++ that you can then pickle/unpickle ... it is C++, not C, but that shouldn't be a problem these days (assuming you are using the gcc/g++ suite).
The library is a plain C++ library (there are examples of C++ and Python within the distribution showing how to use the library over sockets and files from both C++ and Python), but in general, the basics of pickling to files is available.
The basic idea is that the PicklingTools library gives you "python-like" data structures from C++ so that you can then serialize and deserialize to/from Python/C++. All (?) the basic types: int, long int,string, None, complex, dictionarys, lists, ordered dictionaries and tuples are supported. There are few hooks to do custom classes, but that part is a bit immature: the rest of the library is pretty stable and has been active for 8 (?) years.
Simple example:
#include "chooseser.h"
int main()
{
Val a_dict = Tab("{ 'a':1, 'b':[1,2.2,'three'], 'c':None }");
cout << a_dict["b"][0]; // value of 1
// Dump to a file
DumpValToFile(a_dict, "example.p0", SERIALIZE_P0);
// .. from Python, can load the dictionary with pickle.load(file('example.p0'))
// Get the result back
Val result;
LoadValFromFile(result, "example.p0", SERIALIZE_P0);
cout << result << endl;
}
There is further documentation (FAQ and User's Guide) on the web site.
Hope this is useful:
Gooday,
Richie
http://www.picklingtools.com/
Like Cristian told, you can rather easily embed python code in your C code, see the example here.
Using cPickle is dead easy as well on python you could use something like:
import cPickle
f = open('dbfile', 'rb')
db = cPickle.load(f)
f.close()
# handle db integration
f = open('dbfile', 'wb')
cPickle.dump(db, f)
f.close()
You can embed a Python interpreter in a C program, but I think that the easiest solution is to write a Python script that converts "pickles" in another format, e.g. an SQLite database.

Categories