I'm searching a way to write some data (List, Array, etc) into a binary file. The collection to put into the binary file represents a list of points. What I try until now :
[
11:17]
Welcome to Scala 2.12.0-M3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40).
Type in expressions for evaluation. Or try :help.
scala> import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
scala> val oos = new ObjectOutputStream(new FileOutputStream("/tmp/f1.data"))
oos: java.io.ObjectOutputStream = java.io.ObjectOutputStream#13bc8645
scala> oos.writeObject(List(1,2,3,4))
scala> oos.close
scala> val ois = new ObjectInputStream(new FileInputStream("/tmp/f1.data"))
ois: java.io.ObjectInputStream = java.io.ObjectInputStream#392a04e7
scala> val s : List[Int] = ois.readObject().asInstanceOf[List[Int]]
s: List[Int] = List(1, 2, 3, 4)
Ok it's working well. The problem is that maybe tomorrow I will need to read this binary file with an another language as Python. Is it a way to have a more generic binary file that can be read by a multiple languages ?
Solution
To the person searching in the same situation, you can do it like that :
def write2binFile(filename : String, a : Array[Int]) = {
val inChannel = new RandomAccessFile(filename, "rw").getChannel
val bbufer = ByteBuffer.allocateDirect(a.length * 4)
val ibuffer = bbufer.asIntBuffer()
ibuffer.put(a)
inChannel.write(bbufer)
inChannel.close
}
Format for cross-platform sharing of point coordinates allowing selective access by RANGE
Your requirements are:
store data by Scala, read by Python (or other languages)
the data are lists of point coordinates
store the data on AWS S3
allow fetching only part of the data using RANGE request
Data structure to use
The data must be uniform in structure and size per element to allow calculating position of certain part by means of RANGE.
If Scala format for storing lists/arrays fulfils this requirement, and if
the binary format is well defined, you may succeed. If not, you have to find
another format.
Reading binary data by Python
Assuming the format is known, use Python struct module from stdlib to read it.
Alternative approach: split data to smaller pieces
You are willing to access the data piece by piece, probably expecting one large object on S3 and using HTTP request with RANGE.
Alternative solution is to split the data into smaller pieces, which are of reasonable size for fetching (e.g. 64 kB, but you know your use case better), and design rule for storing them piece by piece on AWS S3. You may even use tree structure for this purpose.
There are some advantages with this approach:
use whatever format you like, e.g. XML, JSON, no need to deal with special binary formats
pieces can be compressed, you will save some costs
Note, that AWS S3 will charge you not only for data transfer, but also per request, so each HTTP request using RANGE will be counted as one.
Cross-platform binary formats to consider
Consider following formats:
BSON
(Google) Result Buffers
HDF5
If you visit Wikipedia page for any of those formats, you will find many links to other formats.
Anyway, I am not aware of any of such formats, which would be using uniform size per element as most of them are trying to keep the size as small as possible. For this reason they cannot be used in scenario using RANGE unless some special index file is introduced (what is probably not very feasible).
On the other hand, using these formats with alternative approach (splitting the
data to smaller pieces) shall work.
Note: I did some test in past regarding storage efficiency and speed of encoding/decoding. From practical point of view the best results were achieved using simple JSON structure (possibly compressed). You find these options on every platform, it is very simple to use, speed of encoding/decoding is high (I do not say the hightest).
Related
I'm trying to build a Python program that will parse the conversion codes from a Universe Database. These conversion codes are highly dense strings that encode a ton of information, and I'm struggling to get started. The different permutations of the conversion codes likely number in the hundreds of thousands (though I haven't done the math).
I'm familiar with argparse, however I can't come up with a way of handling this kind of parsing with argparse, and my Google-fu hasn't come up with any other solution.
Initially, my lazy work around was to just do a dictionary lookup for the most common conversion codes, but now that we're using this Python program for more data, it's becoming a huge chore to maintain each individual conversion code.
For example, date conversion codes may take forms like:
Date: D[n][*m][s][fmt[[f1, f2, f3, f4, f5]]][E][L], e.g. D2/ or D4-2/RM
Datetime: DT[4|D|4D|T|TS|Z][;timezone], e.g. DTZ or DT4;America/Denver
Datetime ISO: DTI[B][R|W][S][Z][2|1|0][;[timezone|offset]], e.g. DTIBZ2 or DTIR;America/Denver
And there are a bunch of other conversion codes, with equally complicated parameters.
My end goal is to be able to convert Universe's string data into the appropriate Python object and back again, and in order to do that, I need to understand these conversion codes.
If it helps, I do not need to validate these conversion codes. Once they are set in the database, they are validated there.
I recommend that you use the readnamedfields/writenamedfields methods, this will return OCONV data, and when you write back it handles the ICONV.
import u2py
help( u2py.File.readnamedfields)
Help on function readnamedfields in module u2py:
readnamedfields(self, *args)
F.readnamedfields(recordid, fieldnnames, [lockflag]) -> new DynArray object -- read the specified fields by name of a record in the file
fieldnames is a u2py.DynArray object with each of its fields being a name defined in the dictionary file
lockflag is either 0 (default), or [LOCK_EXCLUSIVE or LOCK_SHARED] [ + LOCK_WAIT]
note: if fieldnames contains names that are not defined in the dictionary, these names are replaced by #ID and no exception is raised
I often keep track of duplicates with something like this:
processed = set()
for big_string in strings_generator:
if big_string not in processed:
processed.add(big_string)
process(big_string)
I am dealing with massive amounts of data so don't want to maintain the processed set in memory. I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
To cut down on memory use what do you think of using hashes like this:
processed = set()
for big_string in string_generator:
key = hash(big_string)
if key not in ignored:
processed.add(key)
process(big_string)
The drawback is I could lose data through occasional hash collisions.
1 collision in 1 billion hashes would not be a problem for my use.
I tried the md5 hash but found generating the hashes became a bottleneck.
What would you suggest instead?
I'm going to assume you are hashing web pages. You have to hash at most 55 billion web pages (and that measure almost certainly overlooks some overlap).
You are willing to accept a less than one in a billion chance of collision, which means that if we look at a hash function which number of collisions is close to what we would get if the hash was truly random[ˆ1], we want a hash range of size (55*10ˆ9)*10ˆ9. That is log2((55*10ˆ9)*10ˆ9) = 66 bits.
[ˆ1]: since the hash can be considered to be chosen at random for this purpose,
p(collision) = (occupied range)/(total range)
Since there is a speed issue, but no real cryptographic concern, we can use a > 66-bits non-cryptographic hash with the nice collision distribution property outlined above.
It looks like we are looking for the 128-bit version of the Murmur3 hash. People have been reporting speed increases upwards of 12x comparing Murmur3_128 to MD5 on a 64-bit machine. You can use this library to do your speed tests. See also this related answer, which:
shows speed test results in the range of python's str_hash, which speed you have already deemed acceptable elsewhere – though python's hash is a 32-bit hash leaving you only 2ˆ32/(10ˆ9) (that is only 4) values stored with a less than one in a billion chance of collision.
spawned a library of python bindings that you should be able to use directly.
Finally, I hope to have outlined the reasoning that could allow you to compare with other functions of varied size should you feel the need for it (e.g. if you up your collision tolerance, if the size of your indexed set is smaller than the whole Internet, etc, ...).
You have to decide which is more important: space or time.
If time, then you need to create unique representations of your large_item which take as little space as possible (probably some str value) that is easy (i.e. quick) to calculate and will not have collisions, and store them in a set.
If space, find the quickest disk-backed solution you can and store the smallest possible unique value that will identify a large_item.
So either way, you want small unique identifiers -- depending on the nature of large_item this may be a big win, or not possible.
Update
they are strings of html content
Perhaps a hybrid solution then: Keep a set in memory of the normal Python hash, while also keeping the actual html content on disk, keyed by that hash; when you check to see if the current large_item is in the set and get a positive, double-check with the disk-backed solution to see if it's a real hit or not, then skip or process as appropriate. Something like this:
import dbf
on_disk = dbf.Table('/tmp/processed_items', 'hash N(17,0); value M')
index = on_disk.create_index(lambda rec: rec.hash)
fast_check = set()
def slow_check(hashed, item):
matches = on_disk.search((hashed,))
for record in matches:
if item == record.value:
return True
return False
for large_item in many_items:
hashed = hash(large_item) # only calculate once
if hashed not in fast_check or not slow_check(hashed, large_item):
on_disk.append((hashed, large_item))
fast_check.add(hashed)
process(large_item)
FYI: dbf is a module I wrote which you can find on PyPI
If many_items already resides in memory, you are not creating another copy of the large_item. You are just storing a reference to it in the ignored set.
If many_items is a file or some other generator, you'll have to look at other alternatives.
Eg if many_items is a file, perhaps you can store a pointer to the item in the file instead of the actual item
As you have already seen few options but unfortunately none of them can fully address the situation partly because
Memory Constraint, to store entire object in memory
No perfect Hash function, and for huge data set change of collision is there.
Better Hash functions (md5) are slower
Use of database like sqlite would actually make things slower
As I read this following excerpt
I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
I feel if you work on this, it might help you marginally. Here how it should be
Use tmpfs to create a ramdisk. tmpfs has several advantages over other implementation because it supports swapping of less-used space to swap space.
Store the sqlite database on the ramdisk.
Change the size of the ramdisk and profile your code to check your performance.
I suppose you already have a working code to save your data in sqllite. You only need to define a tmpfs and use the path to store your database.
Caveat: This is a linux only solution
a bloom filter? http://en.wikipedia.org/wiki/Bloom_filter
well you can always decorate large_item with a processed flag. Or something similar.
You can give a try to the str type __hash__ function.
In [1]: hash('http://stackoverflow.com')
Out[1]: -5768830964305142685
It's definitely not a cryptographic hash function, but with a little chance you won't have too much collision. It works as described here: http://effbot.org/zone/python-hash.htm.
I suggest you profile standard Python hash functions and choose the fastest: they are all "safe" against collisions enough for your application.
Here are some benchmarks for hash, md5 and sha1:
In [37]: very_long_string = 'x' * 1000000
In [39]: %timeit hash(very_long_string)
10000000 loops, best of 3: 86 ns per loop
In [40]: from hashlib import md5, sha1
In [42]: %timeit md5(very_long_string).hexdigest()
100 loops, best of 3: 2.01 ms per loop
In [43]: %timeit sha1(very_long_string).hexdigest()
100 loops, best of 3: 2.54 ms per loop
md5 and sha1 are comparable in speed. hash is 20k times faster for this string and it does not seem to depend much on the size of the string itself.
how does your sql lite version work? If you insert all your strings into a database table and then run the query "select distinct big_string from table_name", the database should optimize it for you.
Another option for you would be to use hadoop.
Another option could be to split the strings into partitions such that each partition is small enough to fit in memory. then you only need to check for duplicates within each partition. the formula you use to decide the partition will choose the same partition for each duplicate. the easiest way is to just look at the first few digits of the string e.g.:
d=defaultdict(int)
for big_string in strings_generator:
d[big_string[:4]]+=1
print d
now you can decide on your partitions, go through the generator again and write each big_string to a file that has the start of the big_string in the filename. Now you could just use your original method on each file and just loop through all the files
This can be achieved much more easily by performing simpler checks first, then investigating these cases with more elaborate checks. The example below contains extracts of your code, but it is performing the checks on much smaller sets of data. It does this by first matching on a simple case that is cheap to check. And if you find that a (filesize, checksum) pairs are not discriminating enough you can easily change it for a more cheap, yet vigorous check.
# Need to define the following functions
def GetFileSize(filename):
pass
def GenerateChecksum(filename):
pass
def LoadBigString(filename):
pass
# Returns a list of duplicates pairs.
def GetDuplicates(filename_list):
duplicates = list()
# Stores arrays of filename, mapping a quick hash to a list of filenames.
filename_lists_by_quick_checks = dict()
for filename in filename_list:
quickcheck = GetQuickCheck(filename)
if not filename_lists_by_quick_checks.has_key(quickcheck):
filename_lists_by_quick_checks[quickcheck] = list()
filename_lists_by_quick_checks[quickcheck].append(filename)
for quickcheck, filename_list in filename_lists.iteritems():
big_strings = GetBigStrings(filename_list)
duplicates.extend(GetBigstringDuplicates(big_strings))
return duplicates
def GetBigstringDuplicates(strings_generator):
processed = set()
for big_string in strings_generator:
if big_sring not in processed:
processed.add(big_string)
process(big_string)
# Returns a tuple containing (filesize, checksum).
def GetQuickCheck(filename):
return (GetFileSize(filename), GenerateChecksum(filename))
# Returns a list of big_strings from a list of filenames.
def GetBigStrings(file_list):
big_strings = list()
for filename in file_list:
big_strings.append(LoadBigString(filename))
return big_strings
If I have a pair of floats, is it any more efficient (computationally or storage-wise) to store them as a GeoPtProperty than it would be pickle the tuple and store it as a BlobProperty?
If GeoPt is doing something more clever to keep multiple values in a single property, can it be leveraged for arbitrary data? Can I store the tuple ("Johnny", 5) in a single entity property in a similarly efficient manner?
Here are some empirical answers:
GeoPtProperty uses 31B of storage space.
Using BlobProperty varies based on what exactly you store:
struct.pack('>2f', lat, lon) => 21B.
Using pickle (v2) to packe a 2-tuple containing floats => 37B.
Using pickle (v0) to packe a 2-tuple containing floats => about 30B-32B (v0 uses a variable-length ascii encoding for floats).
In short, it doesn't look like GeoPt is doing anything particularly clever. If you are going to be storing a lot of these, then you could use struct to pack your floats. Packing and unpacking them with struct will probably be unnoticeably different from the CPU cost associated with serializing/deserializing GeoPt.
If you plan on storing lots of floats per entity and space is really important, then you might consider leveraging the CompressedBlobProperty in aetycoon.
Disclaimer: This is the minimum space required. Actual space will be slightly larger per property based on the length of the property's name. The model itself also adds overhead (for its name and key).
GeoPt itself is limited to (-90 - 90, -180 - 180); it can't be used to store any data that won't fit this model.
However, a custom tuple property shouldn't be too difficult to create yourself; take a look at how SetProperty and ArrayProperty are designed in aetycoon.
Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!
For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.
You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes
For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.
Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.
You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)
Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .
More specific dupe of 875228—Simple data storing in Python.
I have a rather large dict (6 GB) and I need to do some processing on it. I'm trying out several document clustering methods, so I need to have the whole thing in memory at once. I have other functions to run on this data, but the contents will not change.
Currently, every time I think of new functions I have to write them, and then re-generate the dict. I'm looking for a way to write this dict to a file, so that I can load it into memory instead of recalculating all it's values.
to oversimplify things it looks something like:
{((('word','list'),(1,2),(1,3)),(...)):0.0, ....}
I feel that python must have a better way than me looping around through some string looking for : and ( trying to parse it into a dictionary.
Why not use python pickle?
Python has a great serializing module called pickle it is very easy to use.
import cPickle
cPickle.dump(obj, open('save.p', 'wb'))
obj = cPickle.load(open('save.p', 'rb'))
There are two disadvantages with pickle:
It's not secure against erroneous or
maliciously constructed data. Never
unpickle data received from an
untrusted or unauthenticated source.
The format is not human readable.
If you are using python 2.6 there is a builtin module called json. It is as easy as pickle to use:
import json
encoded = json.dumps(obj)
obj = json.loads(encoded)
Json format is human readable and is very similar to the dictionary string representation in python. And doesn't have any security issues like pickle. But might be slower than cPickle.
I'd use shelve, json, yaml, or whatever, as suggested by other answers.
shelve is specially cool because you can have the dict on disk and still use it. Values will be loaded on-demand.
But if you really want to parse the text of the dict, and it contains only strings, ints and tuples like you've shown, you can use ast.literal_eval to parse it. It is a lot safer, since you can't eval full expressions with it - It only works with strings, numbers, tuples, lists, dicts, booleans, and None:
>>> import ast
>>> print ast.literal_eval("{12: 'mydict', 14: (1, 2, 3)}")
{12: 'mydict', 14: (1, 2, 3)}
I would suggest that you use YAML for your file format so you can tinker with it on the disc
How does it look:
- It is indent based
- It can represent dictionaries and lists
- It is easy for humans to understand
An example: This block of code is an example of YAML (a dict holding a list and a string)
Full syntax: http://www.yaml.org/refcard.html
To get it in python, just easy_install pyyaml. See http://pyyaml.org/
It comes with easy file save / load functions, that I can't remember right this minute.
Write it out in a serialized format, such as pickle (a python standard library module for serialization) or perhaps by using JSON (which is a representation that can be evaled to produce the memory representation again).
This solution at SourceForge uses only standard Python modules:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The compression bonus will probably reduce your 6GB dictionary to 1GB. If you do not want a store a series of dictionaries, the module also contains a file.gz solution which might be more suitable given your dictionary size.
Here are a few alternatives depending on your requirements:
numpy stores your plain data in a compact form and performs group/mass operations well
shelve is like a large dict backed up by a file
some 3rd party storage module, e.g. stash, stores arbitrary plain data
proper database, e.g. mongodb for hairy data or mysql or sqlite plain data and faster retrieval
For Unicode characters use:
data = [{'key': 1, 'text': 'some text'}]
f = open(path_to_file, 'w', encoding='utf8')
json.dump(data, f, ensure_ascii=False)
f.close()
f = open(path_to_file, encoding="utf8")
data = json.load(f)
print(data)
[{'key': 1, 'text': 'some text'}]