I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack
I want to build a small formatter in python giving me back the numeric
values embedded in lines of hex strings.
It is a central part of my formatter and should be reasonable fast to
format more than 100 lines/sec (each line about ~100 chars).
The code below should give an example where I'm currently blocked.
'data_string_in_orig' shows the given input format. It has to be
byte swapped for each word. The swap from 'data_string_in_orig' to
'data_string_in_swapped' is needed. In the end I need the structure
access as shown. The expected result is within the comment.
Thanks in advance
Wolfgang R
#!/usr/bin/python
import binascii
import struct
## 'uint32 double'
data_string_in_orig = 'b62e000052e366667a66408d'
data_string_in_swapped = '2eb60000e3526666667a8d40'
print data_string_in_orig
packed_data = binascii.unhexlify(data_string_in_swapped)
s = struct.Struct('<Id')
unpacked_data = s.unpack_from(packed_data, 0)
print 'Unpacked Values:', unpacked_data
## Unpacked Values: (46638, 943.29999999943209)
exit(0)
array.arrays have a byteswap method:
import binascii
import struct
import array
x = binascii.unhexlify('b62e000052e366667a66408d')
y = array.array('h', x)
y.byteswap()
s = struct.Struct('<Id')
print(s.unpack_from(y))
# (46638, 943.2999999994321)
The h in array.array('h', x) was chosen because it tells array.array to regard the data in x as an array of 2-byte shorts. The important thing is that each item be regarded as being 2-bytes long. H, which signifies 2-byte unsigned short, works just as well.
This should do exactly what unutbu's version does, but might be slightly easier to follow for some...
from binascii import unhexlify
from struct import pack, unpack
orig = unhexlify('b62e000052e366667a66408d')
swapped = pack('<6h', *unpack('>6h', orig))
print unpack('<Id', swapped)
# (46638, 943.2999999994321)
Basically, unpack 6 shorts big-endian, repack as 6 shorts little-endian.
Again, same thing that unutbu's code does, and you should use his.
edit Just realized I get to use my favorite Python idiom for this... Don't do this either:
orig = 'b62e000052e366667a66408d'
swap =''.join(sum([(c,d,a,b) for a,b,c,d in zip(*[iter(orig)]*4)], ()))
# '2eb60000e3526666667a8d40'
The swap from 'data_string_in_orig' to 'data_string_in_swapped' may also be done with comprehensions without using any imports:
>>> d = 'b62e000052e366667a66408d'
>>> "".join([m[2:4]+m[0:2] for m in [d[i:i+4] for i in range(0,len(d),4)]])
'2eb60000e3526666667a8d40'
The comprehension works for swapping byte order in hex strings representing 16-bit words. Modifying it for a different word-length is trivial. We can make a general hex digit order swap function also:
def swap_order(d, wsz=4, gsz=2 ):
return "".join(["".join([m[i:i+gsz] for i in range(wsz-gsz,-gsz,-gsz)]) for m in [d[i:i+wsz] for i in range(0,len(d),wsz)]])
The input params are:
d : the input hex string
wsz: the word-size in nibbles (e.g for 16-bit words wsz=4, for 32-bit words wsz=8)
gsz: the number of nibbles which stay together (e.g for reordering bytes gsz=2, for reordering 16-bit words gsz = 4)
import binascii, tkinter, array
from tkinter import *
infile_read = filedialog.askopenfilename()
with open(infile, 'rb') as infile_:
infile_read = infile_.read()
x = (infile_read)
y = array.array('l', x)
y.byteswap()
swapped = (binascii.hexlify(y))
This is a 32 bit unsigned short swap i achieved with code very much the same as "unutbu's" answer just a little bit easier to understand. And technically binascii is not needed for the swap. Only array.byteswap is needed.
I have this text :
2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53,281,514229,31,557,2207,19801,3571,141961,107,73,9349,135721,2161,2789,211,433494437,43,109441,139,2971215073,1103,97,101,6376021,90481,953,5779,661,14503,797,59,353,2521,4513,3010349,35239681,1087,14736206161,9901,269,67,137,71,6673,103681,9375829,54018521,230686501,29134601,988681,79,157,1601,2269,370248451,99194853094755497,83,9521,6709,173,263,1069,181,741469,4969,4531100550901,6643838879,761,769,193,599786069,197,401,743519377,919,519121,103,8288823481,119218851371,1247833,11128427,827728777,331,1459000305513721,10745088481,677,229,1381,347,29717,709,159512939815855788121,
This are numbers generated from my generator program,now the problem has a source code limit so I can't use the above texts in my solution so I want to compress this and put it into a data-structure in python so that I can print them by indexing like:
F = [`compressed data`]
and F[0] would give 2 F[5] would give 7 like this ... Please suggest me a suitable compression technique.
PS: I am a very newbie to python so please explain your method.
Sure you can do this:
import base64
import zlib
compressed = 'eJwdktkNgDAMQxfqR+5j/8V4QUJQUttx3Nrzl0+f+uunPPpm+Tf3Z/tKX1DM5bXP+wUFA777bCob4HMRfUk14QwfDYPrrA5gcuQB49lQQxdZpdr+1oN2bEA3pW5Nf8NGOFsR19NBszyX7G2raQpkVUEBdbTLuwSRlcDCYiW7GeBaRYJrgImrM3lmI/WsIxFXNd+aszXoRXuZ1PnZRdwKJeqYYYKq6y1++PXOYdgM0TlZcymCOdKqR7HYmYPiRslDr2Sn6C0Wgw+a6MakM2VnBk6HwU6uWqDRz+p6wtKTCg2WsfdKJwfJlHNaFT4+Q7PGfR9hyWK3p3464nhFwpOd7kdvjmz1jpWcxmbG/FJUXdMZgrpzs+jxC11twrBo3TaNgvsf8oqIYwT4r9XkPnNC1XcP7qD5cW7UHSJZ3my5qba+ozncl5kz8gGEEYOQ'
data = zlib.decompress(base64.b64decode(compressed))
Note that this is only 139 characters shorter.
But it works:
>>> data
'2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53,281,514229,31,557,2207,19801,3571,141961,107,73,9349,135721,2161,2789,211,433494437,43,109441,139,2971215073,1103,97,101,6376021,90481,953,5779,661,14503,797,59,353,2521,4513,3010349,35239681,1087,14736206161,9901,269,67,137,71,6673,103681,9375829,54018521,230686501,29134601,988681,79,157,1601,2269,370248451,99194853094755497,83,9521,6709,173,263,1069,181,741469,4969,4531100550901,6643838879,761,769,193,599786069,197,401,743519377,919,519121,103,8288823481,119218851371,1247833,11128427,827728777,331,1459000305513721,10745088481,677,229,1381,347,29717,709,159512939815855788121,'
If your code limit really is so short, maybe you are supposed to calculate this data or something? What is it?
zlib would get the job done, if you indeed want compression. If you don't want compression, then I'm afraid that my mind-reading skills are on the wane.
On Python 2.4-2.7, pypy, jython:
>>> enc = sdata.encode('zlib').encode('base64')
>>> print enc
eJwdktkNgDAMQxfqR+5j/8V4QUJQUttx3Nrzl0+f+uunPPpm+Tf3Z/tKX1DM5bXP+wUFA777bCob
4HMRfUk14QwfDYPrrA5gcuQB49lQQxdZpdr+1oN2bEA3pW5Nf8NGOFsR19NBszyX7G2raQpkVUEB
dbTLuwSRlcDCYiW7GeBaRYJrgImrM3lmI/WsIxFXNd+aszXoRXuZ1PnZRdwKJeqYYYKq6y1++PXO
YdgM0TlZcymCOdKqR7HYmYPiRslDr2Sn6C0Wgw+a6MakM2VnBk6HwU6uWqDRz+p6wtKTCg2WsfdK
JwfJlHNaFT4+Q7PGfR9hyWK3p3464nhFwpOd7kdvjmz1jpWcxmbG/FJUXdMZgrpzs+jxC11twrBo
3TaNgvsf8oqIYwT4r9XkPnNC1XcP7qD5cW7UHSJZ3my5qba+ozncl5kz8gGEEYOQ
>>> print enc.decode('base64').decode('zlib')[:79]
2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53
>>> sdata == enc.decode('base64').decode('zlib')
True
>>> F = [int(s) for s in sdata.split(',') if s.strip()]
>>> F[0], F[5]
(2, 7)