I am currently working on a side project where I am converting some code from python to rust.
In python, we can do something like:
Python code->
data = b'commit'+b' '+b'\x00'
print(data)
Output->
b'commit \x00'
Is there any way to achieve this in rust? As I need to concatenate some b'' and store them in a file.
Thanks in advance.
I tried using + operator but didn't work and shown error like: cannot add &[u8;6] with &[u8;1]
You have a number of options for combining binary strings. However it sounds the best option for your use case is to use the write! macro. It lets you write bytes the same way you would use the format! and println! macros. This has the benefit of requiring no additional allocation. The write! macro uses UTF-8 encoding (the same as strings in Rust).
One thing to keep in mind though is that the write! will perform lots of small calls using the std::io::Write trait. To prevent your program from slowing down due to the frequent calls to the OS when writing files you will want to buffer your data using a BufWriter.
As for zlip this is defiantly doable. In Rust, the most popular library for handling deflate, gzip, and zlib is the flate2 crate. Using this we can wrap a writer to do the compression inline with the rest of the writing process. This lets us save memory and keeps our code tidy.
use std::fs::File;
use std::io::{self, BufWriter, Write};
use flate2::write::ZlibEncoder;
use flate2::Compression;
pub fn save_commits(path: &str, results: &[&str]) -> io::Result<()> {
// Open file and wrap it in a buffered writer
let file = BufWriter::new(File::create(path)?);
// We can then wrap the file in a Zlib encoder so the data we write will
// be compressed with zlib as we go
let mut writer = ZlibEncoder::new(file, Compression::best());
// Write our header
write!(&mut writer, "commit {}\0", results.len())?;
// Write commit hashes
for hash in results {
write!(&mut writer, "{}\0", hash);
}
// This step is not required, but it lets us propogate the error instead of
// letting the final bytes buffered in the BufferedWriter from being flushed
// when it gets dropped.writer is dropped.
writer.by_ref().flush()?;
Ok(())
}
Additionally, you may find it helpful to read this answer to learn more about how binary strings work in Rust: https://stackoverflow.com/a/68231883/5987669
from cffi import FFI
ffi = FFI()
header_path = '/usr/include/libelf.h'
with open(header_path) as f:
ffi.cdef(f.read())
lib = ffi.dlopen('/usr/local/lib/libelf.so')
The code above is the one I am actually struggling with. For using some functions of libelf, I need to wrap the library and the header. After long time of recherche this seems to be the right approach to do that.
But I get a parsing error:
cannot parse "#ifndef _LIBELF_H"
It seems that all kinds these expressions cause parsing errors. How can I solve this problem? Or is there another approach of wrapping both: library and header?
ffi.cdef() is not capable of handling preprocessor directives. The purpose of ffi.cdef() is to specify objects that are shared between python and C. It is not compiled (this example does not call any C compiler). Either you eliminate all preprocessor directives from your filestream f or you cherrypick those header parts that you actually need and copy-paste them into your ffi.cdef().
I am reading binary data from a file in python and trying to send that data to a c module. In python the data is read like so
file = open("data", "rb")
data = file.read()
I want the data as a pointer to a buffer and a length in c if possible. I am using PyArg_ParseTuple to get the parameters in the c module. I noticed in python 3+ there is a y/y*/y# format specifier for binary data but I need the equivalent way of doing it in python 2.7.
Thanks
You should investigate the Buffer Api.
From the docs:
These functions can be used by an object to expose its data in a raw, byte-oriented format. Clients of the object can use the buffer interface to access the object data directly, without needing to copy it first.
Two examples of objects that support the buffer interface are strings and arrays. The string object exposes the character contents in the buffer interface’s byte-oriented form. An array can also expose its contents, but it should be noted that array elements may be multi-byte values.
For example (in C++):
void* ExtractBuffer(PyObject* bufferInterfaceObject, Py_buffer& bufferStruct)
{
if (PyObject_GetBuffer(bufferInterfaceObject, &bufferStruct, PyBUF_SIMPLE) == -1)
return 0;
return (void*)bufferStruct.buf;
}
Do not forget to release the bufferStruct when you no longer need it:
PyBuffer_Release(&bufferStruct);
How to get Python source code representation of in-memory Python dictionary?
I decided to ask this question after reading Thomas Kluyver's comment on Rob Galanakis' blog post titled Why bother with python and config files? In his comment Thomas states
But if you want any way to change settings inside the application
(like a preferences dialog), there’s no good way to automatically
write a correct Python file.
Assuming it uses only "basic" Python types, you can write out the repr() of the structure, and then use ast.literal_eval() to read it back in after.
As the article says, you're better off using JSON/YAML or other formats, but if you seriously wanted to use a Python dict and are only using basic Python types...
Writing out (attempt to use pformat to try and make it more human readable):
from pprint import pformat # instead of using repr()
d = dict(enumerate('abcdefghijklmnopqrstuvwxyz'))
open('somefile.py').write(pformat(d))
Reading back:
from ast import literal_eval
d = literal_eval(open('somefile.py').read())
I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
I used this structure as python dict are pretty optimised and it makes programming easier.
for any word 'spam', the documents containig it can be given by :
index['spam'].keys()
and posting list for a document doc1 by:
index['spam']['doc1']
At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.
len(index.keys()) gives 229758
Code
import cPickle as pickle
f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f) # This takes ages
print 'Index loaded. You may now proceed to search'
How can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries.
Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?
Addendum
Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())
But should I migrate to a database for scalability ?
As for now I am marking Tim's answer as accepted.
PS :I don't want to use Lucene or Xapian ...
This question refers Storing an inverted index . I had to ask a new question because I wasn't able to delete the previous one.
Try the protocol argument when using cPickle.dump/cPickle.dumps. From cPickle.Pickler.__doc__:
Pickler(file, protocol=0) -- Create a pickler.
This takes a file-like object for writing a pickle data stream.
The optional proto argument tells the pickler to use the given
protocol; supported protocols are 0, 1, 2. The default
protocol is 0, to be backwards compatible. (Protocol 0 is the
only protocol that can be written to a file opened in text
mode and read back successfully. When using a protocol higher
than 0, make sure the file is opened in binary mode, both when
pickling and unpickling.)
Protocol 1 is more efficient than protocol 0; protocol 2 is
more efficient than protocol 1.
Specifying a negative protocol version selects the highest
protocol version supported. The higher the protocol used, the
more recent the version of Python needed to read the pickle
produced.
The file parameter must have a write() method that accepts a single
string argument. It can thus be an open file object, a StringIO
object, or any other custom object that meets this interface.
Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types.
Do you really need it to load all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.
klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.
>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True)
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem
>>> demo.dump()
>>> del demo
>>>
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>
klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc).
It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.
klepto provides access to customizing your encoding, by building a custom keymap.
>>> from klepto.keymaps import *
>>>
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.'
klepto also provides a lot of caching algorithms (like mru, lru, lfu, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.
You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.
>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]
However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.
Get klepto here: https://github.com/uqfoundation
A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; for example, pickle and cPickle. This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. In Python 3.0, the accelerated versions are considered implementation details of the pure Python versions. Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version. The pickle / cPickle pair received this treatment.
Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.
If your dictionary is huge and should only be compatible with Python 3.4 or higher, use:
pickle.dump(obj, file, protocol=4)
pickle.load(file, encoding="bytes")
or:
Pickler(file, 4).dump(obj)
Unpickler(file).load()
That said, in 2010 the json module was 25 times faster at encoding and 15 times faster at decoding simple types than pickle. My 2014 benchmark says marshal > pickle > json, but marshal's coupled to specific Python versions.
Have you tried using an alternative storage format such as YAML or JSON? Python supports JSON natively from Python 2.6 using the json module I think, and there are third party modules for YAML.
You may also try the shelve module.
Dependend on how long is 'long' you have to think about the trade-offs you have to make: either have all data ready in memory after (long) startup, or load only partial data (then you need to split up the date in multiple files or use SQLite or something like this). I doubt that loading all data upfront from e.g. sqlite into a dictionary will bring any improvement.