I'm using an old version of python on an embedded platform ( Python 1.5.2+ on Telit platform ). The problem that I have is my function for converting a string to hex. It is very slow. Here is the function:
def StringToHexString(s):
strHex=''
for c in s:
strHex = strHex + hexLoookup[ord(c)]
return strHex
hexLookup is a lookup table (a python list) containing all the hex representation of each character.
I am willing to try everything (a more compact function, some language tricks I don't know about). To be more clear here are the benchmarks (resolution is 1 second on that platform):
N is the number of input characters to be converted to hex and the time is in seconds.
N | Time (seconds)
50 | 1
150 | 3
300 | 4
500 | 8
1000 | 15
1500 | 23
2000 | 31
Yes, I know, it is very slow... but if I could gain something like 1 or 2 seconds it would be a progress.
So any solution is welcomed, especially from people who know about python performance.
Thanks,
Iulian
PS1: (after testing the suggestions offered - keeping the ord call):
def StringToHexString(s):
hexList=[]
hexListAppend=hexList.append
for c in s:
hexListAppend(hexLoookup[ord(c)])
return ''.join(hexList)
With this function I obtained the following times: 1/2/3/5/12/19/27 (which is clearly better)
PS2 (can't explain but it's blazingly fast) A BIG thank you Sven Marnach for the idea !!!:
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Times:1/1/2/3/6/10/12
Any other ideas/explanations are welcome!
Make your hexLoookup a dictionary indexed by the characters themselves, so you don't have to call ord each time.
Also, don't concatenate to build strings – that used to be slow. Use join on a list instead.
from string import join
def StringToHexString(s):
strHex = []
for c in s:
strHex.append(hexLoookup[c])
return join(strHex, '')
Building on Petr Viktorin's answer, you could further improve the performance by avoiding global vairable and attribute look-ups in favour of local variable look-ups. Local variables are optimized to avoid a dictionary look-up on each access. (They haven't always been, by I just double-checked this optimization was already in place in 1.5.2, released in 1999.)
from string import join
def StringToHexString(s):
strHex = []
strHexappend = strHex.append
_hexLookup = hexLoookup
for c in s:
strHexappend(_hexLoookup[c])
return join(strHex, '')
Constantly reassigning and adding strings together using the + operator is very slow. I guess that Python 1.5.2 isn't yet optimizing for this. So using string.join() would be preferable.
Try
import string
def StringToHexString(s):
listhex = []
for c in s:
listhex.append(hexLookup[ord(c)])
return string.join(listhex, '')
and see if that is any faster.
Try:
from string import join
def StringToHexString(s):
charlist = []
for c in s:
charlist.append(hexLoookup[ord(c)])
return join(charlist, '')
Each string addition takes time proportional to the length of the string so, while join will also take time proportional to the length of the entire string, but you only have to do it once.
You could also make hexLookup a dict mapping characters to hex values, so you don't have to call ord for every character. It's a micro-optimization, so probably won't be significant.
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Seems like this is the fastest! Thank you Sven Marnach!
Related
I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
Hi everyone / Python Gurus
I would like to know how to accomplish the following task, which so far I've been unable to do so.
Here's what I have:
Q1 = 20e-6
Now this is an exponential number that if you print(Q1) as is it will show: 2e-5 which is fine. Mathematically speaking.
However, here's what I want to do with it:
I want Q1 to print only the number 20. And based on the whether this is e-6 then print uC or if this e-9 the print nC.
Here's an example for better understanding:
Q1=20e-6
When I run print(Q1) show: 20uC.
Q2=20e-9
When I run print(Q2) show: 20nC.
Can you please help me figure this out?
just replace the exponent using str.replace:
q1 = 'XXXXXX'
q1 = q1.replace('e-9', 'nC').replace('e-6', 'uC')
print(q1)
I recommend you using si-prefix.
You can install it using pip:
sudo pip install si-prefix
Then you can use something like this:
from si_prefix import si_format
# precision after the point
# char is the unity's char to be used
def get_format(a, char='C', precision=2):
temp = si_format(a, precision)
try:
num, prefix = temp.split()
except ValueError:
num, prefix = temp , ''
if '.' in num:
aa, bb = num.split('.')
if int(bb) == 0:
num = aa
if prefix:
return num + ' ' + prefix + char
else:
return num
tests = [20e-6, 21.46e05, 33.32e-10, 0.5e03, 0.33e-2, 112.044e-6]
for k in tests:
print get_format(k)
Output:
20 uC
2.15 MC
3.33 nC
500
3.30 mC
112.04 uC
You can try by splitting the string:
'20e-9'.split('e')
gives
['20', '-9']
From there on, you can insert whatever you want in between those values:
('u' if int(a[1]) > 0 else 'n').join(a)
(with a = '20e-9'.split('e'))
You can not. The behaviour you are looking for is called "monkey patching". And this is not allowed for int and float.
You can refer to this stackoverflow question
The only way I can think of is to create a class that extends float and then implement a __str__ method that shows as per your requirement.
------- More explanation -----
if you type
Q1 = 20e-6
in python shell and then
type(Q1)
your will get a
float
So basically your Q1 is considered as float by python type system
when you type print(Q1)
the _str__ method of float is called
The process of extending core class is one example of "monkey patch" and that is what I was refereing to.
Now the problem is that you can not "monkey patch" (or extend if you prefer that) core classes in python (which you can in some languages like in Ruby).
[int, float etc are core classes and written in C for your most common python distribution.]
So how do you solve it?
you need to create a new class like this
class Exponent(float):
def init(self, value):
self.value = value
def __str__(self):
return "ok"
x = Exponent(10.0)
print(x) ==> "ok"
hope this helps
I find hurry.filesize very useful but it doesn't give output in decimal?
For example:
print size(4026, system=alternative) gives 3 KB.
But later when I add all the values I don't get the exact sum. For example if the output of hurry.filesize is in 4 variable and each value is 3. If I add them all, I get output as 15.
I am looking for alternative of hurry.filesize to get output in decimals too.
This isn't really hard to implement yourself:
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
i = 0
while nbytes >= 1024 and i < len(suffixes)-1:
nbytes /= 1024.
i += 1
f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[i])
Examples:
>>> humansize(131)
'131 B'
>>> humansize(1049)
'1.02 KB'
>>> humansize(58812)
'57.43 KB'
>>> humansize(68819826)
'65.63 MB'
>>> humansize(39756861649)
'37.03 GB'
>>> humansize(18754875155724)
'17.06 TB'
Disclaimer: I wrote the package I'm about to describe
The module bitmath supports the functionality you've described. It also addresses the comment made by #filmore, that semantically we should be using NIST unit prefixes (not SI), that is to say, MiB instead of MB. rounding is now supported as well.
You originally asked about:
print size(4026, system=alternative)
in bitmath the default prefix-unit system is NIST (1024 based), so, assuming you were referring to 4026 bytes, the equivalent solution in bitmath would look like any of the following:
In [1]: import bitmath
In [2]: print bitmath.Byte(bytes=4026).best_prefix()
3.931640625KiB
In [3]: human_prefix = bitmath.Byte(bytes=4026).best_prefix()
In [4]: print human_prefix.format("{value:.2f} {unit}")
3.93 KiB
I currently have an open task to allow the user to select a preferred prefix-unit system when using the best_prefix method.
Update: 2014-07-16 The latest package has been uploaded to PyPi, and it includes several new features (full feature list is on the GitHub page)
This is not necessary faster than the #nneonneo solution, it's just a bit cooler, if I can say that :)
import math
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def human_size(nbytes):
human = nbytes
rank = 0
if nbytes != 0:
rank = int((math.log10(nbytes)) / 3)
rank = min(rank, len(suffixes) - 1)
human = nbytes / (1024.0 ** rank)
f = ('%.2f' % human).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[rank])
This works based on the fact that the integer part of a logarithm with base 10 of any number is one less than the actual number of digits. The rest is pretty much straight forward.
I used to reinvent the wheel every time I wrote a little script or ipynb or whatever. It got trite, so I wrote the datasize python module. I'm posting this here because I just updated it, and wow have the Python versions moved up!
It is a DataSize class, which subclasses int, so arithmetic just works, however it returns int from arithmetic because I use it with Pandas and some numpy, and I didn't want to slow things down when there is python<-->C++ translation for matrix math libraries.
You can construct a DataSize object using a string with either SI or NIST suffixes in either bits or bytes, and even wierd word lengths if you need to work with data for embedded tech that uses those. The DataSize object has an intuitive format() code syntax for human-readable representation. Internally the value is just an integer count of 8-bit bytes.
eg.
>>> from datasize import DataSize
>>> 'My new {:GB} SSD really only stores {:.2GiB} of data.'.format(DataSize('750GB'),DataSize(DataSize('750GB') * 0.8))
'My new 750GB SSD really only stores 558.79GiB of data.'
I want to build a small formatter in python giving me back the numeric
values embedded in lines of hex strings.
It is a central part of my formatter and should be reasonable fast to
format more than 100 lines/sec (each line about ~100 chars).
The code below should give an example where I'm currently blocked.
'data_string_in_orig' shows the given input format. It has to be
byte swapped for each word. The swap from 'data_string_in_orig' to
'data_string_in_swapped' is needed. In the end I need the structure
access as shown. The expected result is within the comment.
Thanks in advance
Wolfgang R
#!/usr/bin/python
import binascii
import struct
## 'uint32 double'
data_string_in_orig = 'b62e000052e366667a66408d'
data_string_in_swapped = '2eb60000e3526666667a8d40'
print data_string_in_orig
packed_data = binascii.unhexlify(data_string_in_swapped)
s = struct.Struct('<Id')
unpacked_data = s.unpack_from(packed_data, 0)
print 'Unpacked Values:', unpacked_data
## Unpacked Values: (46638, 943.29999999943209)
exit(0)
array.arrays have a byteswap method:
import binascii
import struct
import array
x = binascii.unhexlify('b62e000052e366667a66408d')
y = array.array('h', x)
y.byteswap()
s = struct.Struct('<Id')
print(s.unpack_from(y))
# (46638, 943.2999999994321)
The h in array.array('h', x) was chosen because it tells array.array to regard the data in x as an array of 2-byte shorts. The important thing is that each item be regarded as being 2-bytes long. H, which signifies 2-byte unsigned short, works just as well.
This should do exactly what unutbu's version does, but might be slightly easier to follow for some...
from binascii import unhexlify
from struct import pack, unpack
orig = unhexlify('b62e000052e366667a66408d')
swapped = pack('<6h', *unpack('>6h', orig))
print unpack('<Id', swapped)
# (46638, 943.2999999994321)
Basically, unpack 6 shorts big-endian, repack as 6 shorts little-endian.
Again, same thing that unutbu's code does, and you should use his.
edit Just realized I get to use my favorite Python idiom for this... Don't do this either:
orig = 'b62e000052e366667a66408d'
swap =''.join(sum([(c,d,a,b) for a,b,c,d in zip(*[iter(orig)]*4)], ()))
# '2eb60000e3526666667a8d40'
The swap from 'data_string_in_orig' to 'data_string_in_swapped' may also be done with comprehensions without using any imports:
>>> d = 'b62e000052e366667a66408d'
>>> "".join([m[2:4]+m[0:2] for m in [d[i:i+4] for i in range(0,len(d),4)]])
'2eb60000e3526666667a8d40'
The comprehension works for swapping byte order in hex strings representing 16-bit words. Modifying it for a different word-length is trivial. We can make a general hex digit order swap function also:
def swap_order(d, wsz=4, gsz=2 ):
return "".join(["".join([m[i:i+gsz] for i in range(wsz-gsz,-gsz,-gsz)]) for m in [d[i:i+wsz] for i in range(0,len(d),wsz)]])
The input params are:
d : the input hex string
wsz: the word-size in nibbles (e.g for 16-bit words wsz=4, for 32-bit words wsz=8)
gsz: the number of nibbles which stay together (e.g for reordering bytes gsz=2, for reordering 16-bit words gsz = 4)
import binascii, tkinter, array
from tkinter import *
infile_read = filedialog.askopenfilename()
with open(infile, 'rb') as infile_:
infile_read = infile_.read()
x = (infile_read)
y = array.array('l', x)
y.byteswap()
swapped = (binascii.hexlify(y))
This is a 32 bit unsigned short swap i achieved with code very much the same as "unutbu's" answer just a little bit easier to understand. And technically binascii is not needed for the swap. Only array.byteswap is needed.
I know the easiest way is using a regular expression, but I wonder if there are other ways to do this check.
Why do I need this? I am writing a Python script that reads text messages (SMS) from a SIM card. In some situations, hex messages arrives and I need to do some processing for them, so I need to check if a received message is hexadecimal.
When I send following SMS:
Hello world!
And my script receives
00480065006C006C006F00200077006F0072006C00640021
But in some situations, I receive normal text messages (not hex). So I need to do a if hex control.
I am using Python 2.6.5.
UPDATE:
The reason of that problem is, (somehow) messages I sent are received as hex while messages sent by operator (info messages and ads.) are received as a normal string. So I decided to make a check and ensure that I have the message in the correct string format.
Some extra details: I am using a Huawei 3G modem and PyHumod to read data from the SIM card.
Possible best solution to my situation:
The best way to handle such strings is using a2b_hex (a.k.a. unhexlify) and utf-16 big endian encoding (as #JonasWielicki mentioned):
from binascii import unhexlify # unhexlify is another name of a2b_hex
mystr = "00480065006C006C006F00200077006F0072006C00640021"
unhexlify(mystr).encode("utf-16-be")
>> u'Hello world!'
(1) Using int() works nicely for this, and Python does all the checking for you :)
int('00480065006C006C006F00200077006F0072006C00640021', 16)
6896377547970387516320582441726837832153446723333914657L
will work. In case of failure you will receive a ValueError exception.
Short example:
int('af', 16)
175
int('ah', 16)
...
ValueError: invalid literal for int() with base 16: 'ah'
(2) An alternative would be to traverse the data and make sure all characters fall within the range of 0..9 and a-f/A-F. string.hexdigits ('0123456789abcdefABCDEF') is useful for this as it contains both upper and lower case digits.
import string
all(c in string.hexdigits for c in s)
will return either True or False based on the validity of your data in string s.
Short example:
s = 'af'
all(c in string.hexdigits for c in s)
True
s = 'ah'
all(c in string.hexdigits for c in s)
False
Notes:
As #ScottGriffiths notes correctly in a comment below, the int() approach will work if your string contains 0x at the start, while the character-by-character check will fail with this. Also, checking against a set of characters is faster than a string of characters, but it is doubtful this will matter with short SMS strings, unless you process many (many!) of them in sequence in which case you could convert stringhexditigs to a set with set(string.hexdigits).
You can:
test whether the string contains only hexadecimal digits (0…9,A…F)
try to convert the string to integer and see whether it fails.
Here is the code:
import string
def is_hex(s):
hex_digits = set(string.hexdigits)
# if s is long, then it is faster to check against a set
return all(c in hex_digits for c in s)
def is_hex(s):
try:
int(s, 16)
return True
except ValueError:
return False
I know the op mentioned regular expressions, but I wanted to contribute such a solution for completeness' sake:
def is_hex(s):
return re.fullmatch(r"^[0-9a-fA-F]$", s or "") is not None
Performance
In order to evaluate the performance of the different solutions proposed here, I used Python's timeit module. The input strings are generated randomly for three different lengths, 10, 100, 1000:
s=''.join(random.choice('0123456789abcdef') for _ in range(10))
Levon's solutions:
# int(s, 16)
10: 0.257451018987922
100: 0.40081690801889636
1000: 1.8926858339982573
# all(_ in string.hexdigits for _ in s)
10: 1.2884491360164247
100: 10.047717947978526
1000: 94.35805322701344
Other answers are variations of these two. Using a regular expression:
# re.fullmatch(r'^[0-9a-fA-F]$', s or '')
10: 0.725040541990893
100: 0.7184272820013575
1000: 0.7190397029917222
Picking the right solution thus depends on the length on the input string and whether exceptions can be handled safely. The regular expression certainly handles large strings much faster (and won't throw a ValueError on overflow), but int() is the winner for shorter strings.
One more simple and short solution based on transformation of string to set and checking for subset (doesn't check for '0x' prefix):
import string
def is_hex_str(s):
return set(s).issubset(string.hexdigits)
More information here.
Another option:
def is_hex(s):
hex_digits = set("0123456789abcdef")
for char in s:
if not (char in hex_digits):
return False
return True
Most of the solutions proposed above do not take into account that any decimal integer may be also decoded as hex because decimal digits set is a subset of hex digits set. So Python will happily take 123 and assume it's 0123 hex:
>>> int('123',16)
291
This may sound obvious but in most cases you'll be looking for something that was actually hex-encoded, e.g. a hash and not anything that can be hex-decoded. So probably a more robust solution should also check for an even length of the hex string:
In [1]: def is_hex(s):
...: try:
...: int(s, 16)
...: except ValueError:
...: return False
...: return len(s) % 2 == 0
...:
In [2]: is_hex('123')
Out[2]: False
In [3]: is_hex('f123')
Out[3]: True
This will cover the case if the string starts with '0x' or '0X': [0x|0X][0-9a-fA-F]
d='0X12a'
all(c in 'xX' + string.hexdigits for c in d)
True
In Python3, I tried:
def is_hex(s):
try:
tmp=bytes.fromhex(hex_data).decode('utf-8')
return ''.join([i for i in tmp if i.isprintable()])
except ValueError:
return ''
It should be better than the way: int(x, 16)
Using Python you are looking to determine True or False, I would use eumero's is_hex method over Levon's method one. The following code contains a gotcha...
if int(input_string, 16):
print 'it is hex'
else:
print 'it is not hex'
It incorrectly reports the string '00' as not hex because zero evaluates to False.
Since all the regular expression above took about the same amount of time, I would guess that most of the time was related to converting the string to a regular expression. Below is the data I got when pre-compiling the regular expression.
int_hex
0.000800 ms 10
0.001300 ms 100
0.008200 ms 1000
all_hex
0.003500 ms 10
0.015200 ms 100
0.112000 ms 1000
fullmatch_hex
0.001800 ms 10
0.001200 ms 100
0.005500 ms 1000
Simple solution in case you need a pattern to validate prefixed hex or binary along with decimal
\b(0x[\da-fA-F]+|[\d]+|0b[01]+)\b
Sample: https://regex101.com/r/cN4yW7/14
Then doing int('0x00480065006C006C006F00200077006F0072006C00640021', 0) in python gives
6896377547970387516320582441726837832153446723333914657
The base 0 invokes prefix guessing behaviour.
This has saved me a lot of hassle. Hope it helps!
Most of the solution are not properly in checking string with prefix 0x
>>> is_hex_string("0xaaa")
False
>>> is_hex_string("0x123")
False
>>> is_hex_string("0xfff")
False
>>> is_hex_string("fff")
True
Here's my solution:
def to_decimal(s):
'''input should be int10 or hex'''
isString = isinstance(s, str)
if isString:
isHex = all(c in string.hexdigits + 'xX' for c in s)
return int(s, 16) if isHex else int(s)
else:
return int(hex(s), 16)
a = to_decimal(12)
b = to_decimal(0x10)
c = to_decimal('12')
d = to_decimal('0x10')
print(a, b, c, d)