Python string conversion, take out spaces, add hyphens - python

I have a column in a pandas data frame that is formatted like
f1 d3 a4 0a d0 6a 4b 4a 83 d4 4f c9 1f 15 11 17
and I want to convert it to look like:
f1d3a40a-d06a-4b4a-83d4-4fc91f151117
I know I can use replace(" ", "") to take the whitespace out, but I am not sure how to insert the hyphens in the exact spots that I need them.
I am also not sure how to apply it to a pandas series object.
Any help would be appreciated!

This looks like a UUID, so I'd just use that module
>>> import uuid
>>> s = 'f1 d3 a4 0a d0 6a 4b 4a 83 d4 4f c9 1f 15 11 17'
>>> uuid.UUID(''.join(s.split()))
UUID('f1d3a40a-d06a-4b4a-83d4-4fc91f151117')
>>> str(uuid.UUID(''.join(s.split())))
'f1d3a40a-d06a-4b4a-83d4-4fc91f151117'
EDIT:
df = pd.DataFrame({'col':['f1 d3 a4 0a d0 6a 4b 4a 83 d4 4f c9 1f 15 11 17',
'f1 d3 a4 0a d0 6a 4b 4a 83 d4 4f c9 1f 15 11 17']})
df['col'] = df['col'].str.split().str.join('').apply(uuid.UUID)
print (df)
col
0 f1d3a40a-d06a-4b4a-83d4-4fc91f151117
1 f1d3a40a-d06a-4b4a-83d4-4fc91f151117

a = "f1 d3 a4 0a d0 6a 4b 4a 83 d4 4f c9 1f 15 11 17"
c = "f1d3a40a-d06a-4b4a-83d4-4fc91f151117"
b = [4,2,2,2,6]
def space_2_hyphens(s, num_list,hyphens = "-"):
sarr = s.split(" ")
if len(sarr) != sum(num_list):
raise Exception("str split num must equals sum(num_list)")
out = []
k = 0
for n in num_list:
out.append("".join(sarr[k:k + n]))
k += n
return hyphens.join(out)
print(a)
print(space_2_hyphens(a,b))
print(c)

Related

What's a less bad way to write this hex formatting function in Python?

With Python, I wanted to format a string of hex characters:
spaces between each byte (easy enough): 2f2f -> 2f 2f
line breaks at a specified max byte width (not hard): 2f 2f 2f 2f 2f 2f 2f 2f\n
address ranges for each line (doable): 0x7f8-0x808: 2f 2f 2f 2f 2f 2f 2f 2f\n
replace large ranges of sequential 00 bytes with: ... trimmed 35 x 00 bytes [0x7 - 0x2a] ... ... it was at this point that I knew I was doing some bad coding. The function got bloated and hard to follow. Too many features piled up in a non-intuitive way.
Example output:
0x0-0x10: 5a b6 f7 6e 7c 65 45 a0 bc 6a e5 f5 77 2b 92 48
0x10-0x20: 47 d7 33 ea 40 15 44 ac 6b a4 50 78 6e f2 10 d4
0x20-0x30: 9c 7c c1 f7 5a bf ec 9f b0 2b b7 29 97 ee 56 31
0x30-0x40: ff 23 d9 1a 0b 4e fd 65 50 92 42 eb b2 77 7a 55
0x40-0x50:
I'm pretty sure the address ranges aren't correct anymore in certain cases (particularly when the 00 replacement occurs), the function just looks disgusting, and I'm embarrassed to even show it.
def pretty_print_hex(hex_str, byte_width=16, line_start=False, addr=0):
out = ''
condense_min = 12
total_bytes = int(len(hex_str) / 2)
line_width = False
if byte_width is not False:
line_width = byte_width * 2
if line_start is not False:
out += line_start
end = addr + byte_width
if (end > addr + total_bytes):
end = addr + total_bytes
out += f"{hex(addr)}-{hex(end)}:\t"
addr += byte_width
i = 0
if len(hex_str) == 1:
print('Cannot pretty print < 1 byte', hex_str)
return
condensing = False
cond_start_addr = 0
cond_end_addr = 0
condense_cache = []
while i < len(hex_str):
byte = hex_str[i] + hex_str[i + 1]
i += 2
if byte == '00':
condensing = True
cond_start_addr = (addr - byte_width) + ((i + 1) % byte_width)
condense_cache.append(byte)
else:
if condensing is True:
condensed_count = len(condense_cache)
if condensed_count >= condense_min:
cond_end_addr = cond_start_addr + condensed_count
out += f"... trimmed {condensed_count} x 00 bytes [{hex(cond_start_addr)} - {hex(cond_end_addr)}] ..."
else:
for byte in condense_cache:
out += f"{byte} "
condense_cache = []
condensing = False
if condensing is False:
out += byte + ' '
if (line_width is not False) and (i) % line_width == 0:
out += '\n'
if line_start is not False:
out += line_start
end = addr + byte_width
if end > addr + total_bytes:
end = addr + total_bytes
if (addr - end) != 0:
out += f"{hex(addr)}-{hex(end)}:\t"
addr += byte_width
if condensing is True:
condensed_count = len(condense_cache)
if condensed_count >= condense_min:
cond_end_addr = cond_start_addr + condensed_count
out += f"... trimmed {condensed_count} x 00 bytes [{hex(cond_start_addr)} - {hex(cond_end_addr)}] ..."
else:
for byte in condense_cache:
out += f"{byte} "
return out.rstrip()
example input / output:
hex_str = 'c8d8fb631cc7d072b62aaf9cd47bc270d4341e35f23b7a94acf24f33397a6cb4145b6eacfd56653d79bea10d2842023155e5b14bec3b5851a0a58cb3a523c476b126486e1392bdd2e3bcb6cbc333b23de387ae8624123009'
byte_width=16
line_start='\t'
addr=0
print(pretty_print_hex(hex_str , byte_width=16, line_start='\t', addr=0))
0x0-0x10: c8 d8 fb 63 1c c7 d0 72 b6 2a af 9c d4 7b c2 70
0x10-0x20: d4 34 1e 35 f2 3b 7a 94 ac f2 4f 33 39 7a 6c b4
0x20-0x30: 14 5b 6e ac fd 56 65 3d 79 be a1 0d 28 42 02 31
0x30-0x40: 55 e5 b1 4b ec 3b 58 51 a0 a5 8c b3 a5 23 c4 76
0x40-0x50: b1 26 48 6e 13 92 bd d2 e3 bc b6 cb c3 33 b2 3d
0x50-0x60: e3 87 ae 86 24 12 30 09
It gets much worse when you involve some 00 replacement, here's an example of that:
hex_str = 'c8000000000000000000000000000aaf9cd47bc270d4341e35f23b7a94acf24f33397a6cb4145b6eacfd56653d79bea10d2842023155e5b14bec3b5851a0a58cb3a523c476b126486e1392bdd2e3bcb6cbc333b23de387ae8624123009'
byte_width=16
line_start='\t'
addr=0
print(pretty_print_hex(hex_str, byte_width=16, line_start='\t', addr=0))
0x0-0x10: c8 ... trimmed 13 x 00 bytes [0xd - 0x1a] ...0a af
0x10-0x20: 9c d4 7b c2 70 d4 34 1e 35 f2 3b 7a 94 ac f2 4f
0x20-0x30: 33 39 7a 6c b4 14 5b 6e ac fd 56 65 3d 79 be a1
0x30-0x40: 0d 28 42 02 31 55 e5 b1 4b ec 3b 58 51 a0 a5 8c
0x40-0x50: b3 a5 23 c4 76 b1 26 48 6e 13 92 bd d2 e3 bc b6
0x50-0x60: cb c3 33 b2 3d e3 87 ae 86 24 12 30 09
It would also make more sense to make the address range (`0x0-0x10) portray the true range, to include the trimmed bytes on that line, but I couldn't even begin to think of how to add that in.
Rather than patch this bad looking function, I thought I might ask for a better approach entirely, if one exists.
I would suggest to not start a "trimmed 00 bytes" series in the middle of an output line, but only apply this compacting when it applies to complete output lines with only zeroes.
This means that you will still see non-compacted zeroes in a line that also contains non-zeroes, but in my opinion this results in a cleaner output format. For instance, if a line would end with just two 00 bytes, it really does not help to replace that last part of the line with the longer "trimmed 2 x 00 bytes" message. By only replacing complete 00-lines with this message, and compress multiple such lines with one message, the output format seems cleaner.
To produce that output format, I would use the power of regular expressions:
to identify a block of bytes to be output on one line: either a line with at least one non-zero, or a range of zero bytes which either runs to the end of the input, or else is a multiple of the "byte width" argument.
to insert spaces in a line of bytes
All this can be done through iterations in one expression:
def pretty_print_hex(hex_str, byte_width=16, line_start='\t', addr=0):
return "\n".join(f"{hex(start)}-{hex(last)}:{line_start}{line}"
for start, last, line in (
(match.start() // 2, match.end() // 2 - 1,
f"...trimmed {(match.end() - match.start()) // 2} x 00 bytes..." if match[1]
else re.sub("(..)(?!$)", r"\1 ", match[0])
)
for match in re.finditer(
f"(0+$|(?:(?:00){{{byte_width}}})+)|(?:..){{1,{byte_width}}}",
hex_str
)
)
)
If you want to use it rather than write it (not sure - tell me to delete if required), you can use the excellent (I am not associated with it) hexdump:
https://pypi.org/project/hexdump
python -m hexdump binary.dat
It is super cool - I guess you could also inspect the source for ideas.
It doesn't, however, look like it is still maintained...
I liked the challenge in this function, and this is what I could come up with this evening. It is somewhat shorter than your original one, but not as short as trincot's answer.
def hexpprint(
hexstring: str,
width: int = 16,
hexsep: str = " ",
addr: bool = False,
addrstart: int = 0,
linestart: str = "",
compress: bool = False,
):
# if address get hex address length size
if addr:
addrlen = len(f"{addrstart+len(hexstring):x}")
# compression buffer just count hex 0 chars
cbuf = 0
for i in range(0, len(hexstring), width):
j = i + width
row = hexstring[i:j]
# if using compression and compressable
if compress and row.count("0") == len(row):
cbuf += len(row)
continue
# if not compressable and has cbuf, flush it
if cbuf:
line = linestart
if addr:
beg = f"0x{addrstart+i-cbuf:0{addrlen}x}"
end = f"0x{addrstart+i:0{addrlen}x}"
line += f"{beg}-{end} "
line += f"compressed {cbuf//2} NULL bytes"
print(line)
cbuf = 0
# print formatted hex row
line = linestart
if addr:
beg = f"0x{addrstart+i:0{addrlen}x}"
end = f"0x{addrstart+i+len(row):0{addrlen}x}"
line += f"{beg}-{end} "
line += hexsep.join(row[i : i + 2] for i in range(0, width, 2))
print(line)
# flush cbuf if necessary
if cbuf:
line = linestart
if addr:
beg = f"0x{addrstart+i-cbuf:0{addrlen}x}"
end = f"0x{addrstart+len(hexstring):0{addrlen}x}"
line += f"{beg}-{end} "
line += f"compressed {cbuf//2} NULL bytes"
print(line)
PS: I don't really like the code repetition to print things, so I might come back and edit later.

How can I read 32-bit binary numbers from a binary file in python?

I have data files which contain series of 32-bit binary "numbers."
I say "numbers" because the 32 1/0's define what type of data sensors were picking up, when they were, which sensor,etc; so the decimal value of the numbers is of no concern to me. In particular some (most) of the data will begin with possibly up to 5 zeros.
I simply need a way in python to read these files, get a list containing each 32-bit number, and then I'll need to mess around with it a little (delete some events) and rewrite it to a new file.
Can anyone help me with this? I've tried the following so far but the numbers which should be corresponding to the time data we encode seem to be impossible.
with open(lm_file, mode='rb') as file:
bytes_read = file.read(struct.calcsize("I"))
while bytes_read:
idList = struct.unpack("I", bytes_read)[0]
idList=bin(idList)
print(idList)
bytes_read = file.read(struct.calcsize("=l"))
Output of hexdump:
00000000 80 0a 83 4d ba a5 80 0c c0 00 7b 42 cb 90 0f 41 |...M......{B...A|
00000010 98 c9 9c 53 4c 15 35 52 d8 54 f7 0a 5d 87 16 4d |...SL.5R.T..]..M|
00000020 89 6a 3f 04 f2 eb c4 4a e2 37 e6 08 23 5e ca 06 |.j?....J.7..#^..|

Pythonic way to split a line into groups of four words? [duplicate]

This question already has answers here:
Alternative way to split a list into groups of n [duplicate]
(6 answers)
Closed 8 years ago.
Suppose I have string that look like the following, of varying length, but with the number of "words" always equal to multiple of 4.
9c 75 5a 62 32 3b 3a fe 40 14 46 1c 6e d5 24 de
c6 11 17 cc 3d d7 99 f4 a1 3f 7f 4c
I would like to chop them into strings like 9c 75 5a 62 and 32 3b 3a fe
I could use a regex to match the exact format, but I wonder if there is a more straightforward way to do it, because regex seems like overkill for what should be a simple problem.
A staight-forward way of doing this is as follows:
wordlist = words.split()
for i in xrange(0, len(wordlist), 4):
print ' '.join(wordlist[i:i+4])
If for some reason you can't make a list of all words (e.g. an infinite stream), you could do this:
from itertools import groupby, izip
words = (''.join(g) for k, g in groupby(words, ' '.__ne__) if k)
for g in izip(*[iter(words)] * 4):
print ' '.join(g)
Disclaimer: I didn't come up with this pattern; I found it in a similar topic a while back. It arguably relies on an implementation detail, but it would be quite a bit more ugly when done in a different way.
A slightly functional way of doing it based on itertools grouper recipe
for x in grouper(words.split(), 4):
print ' '.join(x)
giantString= '9c 75 5a 62 32 3b 3a fe 40 14 46 1c 6e d5 24 de c6 11 17 cc 3d d7 99 f4 a1 3f 7f 4c'
splitGiant = giantString.split(' ')
stringHolders = []
for item in xrange(len(splitGiant)/4):
stringHolders.append(splitGiant[item*4:item*4+4])
stringHolder2 = []
for item in stringHolders:
stringHolder2.append(' '.join(item))
print stringHolder2
the most long convoluted way to do this possible.
>>> words = '9c 75 5a 62 32 3b 3a fe 40 14 46 1c 6e d5 24 de'.split()
>>> [' '.join(words[i*4:i*4+4]) for i in range(len(words)/4)]
['9c 75 5a 62', '32 3b 3a fe', '40 14 46 1c', '6e d5 24 de']
or based on 1_CR's answer
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
[' '.join(x) for x in grouper(words.split(), 4)]

Diffing Binary Files In Python

I've got two binary files. They look something like this, but the data is more random:
File A:
FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...
File B:
41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...
What I'd like is to call something like:
>>> someDiffLib.diff(file_a_data, file_b_data)
And receive something like:
[Match(pos=4, length=4)]
Indicating that in both files the bytes at position 4 are the same for 4 bytes. The sequence 44 43 42 41 would not match because they're not in the same positions in each file.
Is there a library that will do the diff for me? Or should I just write the loops to do the comparison?
You can use itertools.groupby() for this, here is an example:
from itertools import groupby
# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())
matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
if k:
pos = next(g)
length = len(list(g)) + 1
matches.append((pos, length))
Or the same thing as above using a list comprehension:
matches = [(next(g), len(list(g))+1)
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
if k]
Here is the setup for the example if you are using Python 2.x:
f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())
The provided itertools.groupby solution works fine, but it's pretty slow.
I wrote a pretty naive attempt using numpy and tested it versus the other solution on a particular 16MB file I happened to have, and it was about 42x faster on my machine. Someone familiar with numpy could likely improve this significantly.
import numpy as np
def compare(path1, path2):
x,y = np.fromfile(path1, np.int8), np.fromfile(path2, np.int8)
length = min(x.size, y.size)
x,y = x[:length], y[:length]
z = np.where(x == y)[0]
if(z.size == 0) : return z
borders = np.append(np.insert(np.where(np.diff(z) != 1)[0] + 1, 0, 0), len(z))
lengths = borders[1:] - borders[:-1]
starts = z[borders[:-1]]
return np.array([starts, lengths]).T

String variable comparison in python

I have 2 variables from a xml file;
edit:*i m sorry. i pasted wrong value *
x="00 25 9E B8 B9 19 "
y="F0 00 00 25 9E B8 B9 19 "
when i use if x in y: statement nothings happen
but if i use if "00 25 9E B8 B9 19 " in y: i get results
any idea?
i am adding my full code;
import xml.etree.ElementTree as ET
tree =ET.parse('c:/sw_xml_test_4a.xml')
root=tree.getroot()
for sw in root.findall('switch'):
for switch in root.findall('switch'):
if sw[6].text.rstrip() in switch.find('GE01').text:
print switch[0].text
if sw[6].text.strip() in switch.find('GE02').text.strip():
print switch[0].text
if sw[6].text.strip() in switch.find('GE03').text.strip():
print switch[0].text
if sw[6].text.strip() in switch.find('GE04').text.strip():
print switch[0].text
xml file detail;
- <switch>
<ci_adi>"aaa_bbb_ccc"</ci_adi>
<ip_adress>10.10.10.10</ip_adress>
<GE01>"F0 00 00 25 9E 2C BC 98 "</GE01>
<GE02>"80 00 80 FB 06 C6 A1 2B "</GE02>
<GE03>"F0 00 00 25 9E B8 BB AA "</GE03>
<GE04>"F0 00 00 25 9E B8 BB AA "</GE04>
<bridge_id>"00 25 9E B8 BB AA "</bridge_id>
</switch>
>>> x = "00 25 9E 2C BC 8B"
>>> y = "F0 00 00 25 9E B8 B9 19"
>>> x in y
False
>>> "00 25 9E 2C BC 8B " in y
False
how exactly are you getting results?
let me explain what in is checking.
in is checking if the entire value of x is contained anywhere within the value of y. as you can see, the entire value of x is NOT contained in its entirety in y.
however, some elements of x are, maybe what you are trying to do is:
>>> x = ["00", "25", "9E", "2C", "BC", "8B"]
>>> y = "F0 00 00 25 9E B8 B9 19"
>>> for item in x:
if item in y:
print item + " is in " + y
00 is in F0 00 00 25 9E B8 B9 19
25 is in F0 00 00 25 9E B8 B9 19
9E is in F0 00 00 25 9E B8 B9 19
The operators in and not in test for collection membership. x in s evaluates to true if x is a member of the collection s, and false otherwise. For strings, this translates to return True if entire string x is a substring of y, else return False.
Other than a mix-up of values in your question, this seems to work the way you want:
sxml="""\
<switch>
<ci_adi>"aaa_bbb_ccc"</ci_adi>
<ip_adress>10.10.10.10</ip_adress>
<GE01>"F0 00 00 25 9E 2C BC 98 "</GE01>
<GE02>"80 00 80 FB 06 C6 A1 2B "</GE02>
<GE03>"F0 00 00 25 9E B8 BB AA "</GE03>
<GE04>"F0 00 00 25 9E B8 BB AA "</GE04>
<bridge_id>"00 25 9E B8 BB AA "</bridge_id>
</switch>"""
tree=et.fromstring(sxml)
x="80 00 80 FB 06 C6 A1 2B" # Note: I used a value of x I could see in the data;
# your value of x="00 25 9E B8 B9 19 " is not present...
for el in tree:
print '{}: {}'.format(el.tag, el.text)
if x in el.text:
print 'I found "{}" by gosh at {}!!!\n'.format(x,el.tag)

Categories