decode/revert characters with shift in python - python

I have a function. the input would be a word, and every time each character will be added to the shifted value of the result.
def magic2(b):
res = 0
for c in b:
res = (res << 8) + ord(c)
print(res)
return res
Because it uses shifts, I will lose some data. I wanna to decode/reverse this with exact letters of the input word.
For example, if input would be "saman", the output is result "495555797358" and step by step would be:
115
29537
7561581
1935764833
495555797358
How can I get back to the input word just with these outputs?

Consider what you're doing: for each character, you shift left by 8 bits, and then add on another 8 bits.1
So, how do you undo that? Well, for each character, you grab the rightmost 8 bits, then shift everything else right by 8 bits. How do you know when you're done? When shifting right by 8 bits leaves you with 0, you must have just gotten the leftmost character. So:
def unmagic2(n):
while n > 0:
c = chr(n & 0xff) # 0xff is (1 << 8) - 1
n = n >> 8
Now you just have to figure out what to do with each c to get your original string back. It's not quite as trivial as you might at first think, because we're getting the leftmost character last, not first. But you should be able to figure it out from here.
1. If you're using the full gamut of Unicode, this is of course lossy, because you're shifting left by 8 bits and then adding on another 21 bits, so there's no way to reverse that. But I'll assume you're using Latin-1 strings here, or bytes—or Python 2 str.

Related

Determine if all the characters in a string are unique using bit-manipulation

Implement an algorithm to determine if all the characters in a string are unique.
Can someone explain me how to do it using bit manipulation? There is a solution provided using bits in CTCI but I am unable to understand it.
Lets implement an algorithm to check if string contains duplicate characters using Bit Manipulation in python.
Return True for "abcdgt" and False for "bdb"
Bit manipulation algorithm to check duplicates only take O(n) time
and O(1) space as we would be using only 4 bytes of space to check if
a string contains duplicate characters or not.
For this problem we only assume that string contains character from a-z only as an integer can only store 32 bits maximum.
# 1<<val shifts 1 by to the left by appending zeros equal to val. So 1<<5 would be 100000. So this means that if a char is present in string, we assign 1 at its place else 0. If checker already has 1 at the place of char and we take '&' with val(value of character in range 0-26) again, we will get 1 (1 & 1), which is greater than zero. Hence it shows that char is present or string contains duplicate characters.
Lets take an example for str = 'bdb'.
checker = 0
1. c = 'b'
=> val = 98 - 97 = 1
=> 0 & 1<<1
=> 0 & 10
=> 0
checker = 0 | 10
checker = 10
2. c = 'd'
=> val = 100 - 97 = 3
=> 10 & 1<<3
=> 10 & 1000
=> 0
checker = 10 | 1000
checker = 1010
3. c = 'b'
=> val = 98 - 97 = 1
=> 1010 & 1<<1
=> 1010 & 10
=> 1 > 0; return False (which is true as string contains duplicates)
Below is the code for the algorithm.
def checkDuplicates(str):
checker = 0
for c in str:
#Gets the difference in ASCII between character and 'a'
val = ord(c) - ord('a')
if (checker & 1<<val) > 0:
return False
checker |= 1<<val
return True
Hope, i was able to explain bit manipulation as lot of my friends even don't understand this concept and find this hard but it's actually easy.
From what people are saying online, it seems to work like this:
Assuming you have a string (containing only the characters from either the set a-z, or from the set A-Z, but not both), you keep track of which characters have been seen before by using bitwise operations to flip bits in an integer. For example, if your string is "aza", you'd set the first bit in your integer to a 1, then you'd set the twenty-sixth bit to a 1, and then you'd attempt to set the first bit to a 1 again. Since it's already been flipped, you can assert that this character has been seen before, and is therefore not unique. The restriction on the character set (a-z or A-Z) seems to arise from the fact that there are 26 letters in the alphabet, and 32 bits in an integer in most languages, so you have enough bits to represent the entire alphabet.
However, I don't think this is a great solution to the problem using Python, since integers have arbitrary precision in Python. It's also not a great solution in general, in my opinion. You're better off using a set() to keep track of previously seen characters, and then asserting that the length of the string matches the length of your set.

Need help understanding a short python code

I want to emphasis that this is not a ask for completing my homework or job: I am studying the LZW algorithm for gif file compression by reading someone's code on github, and got confused by a code block here:
class DataBlock(object):
def __init__ (self):
self.bitstream = bytearray()
self.pos = 0
def encode_bits (self, num, size):
"""
Given a number *num* and a length in bits *size*, encode *num*
as a *size* length bitstring at the current position in the bitstream.
"""
string = bin(num)[2:]
string = '0'*(size - len(string)) + string
for digit in reversed(string):
if len(self.bitstream) * 8 <= self.pos:
self.bitstream.append(0)
if digit == '1':
self.bitstream[-1] |= 1 << self.pos % 8
self.pos += 1
What I cannot understand is the for loop in the function encode_bits():
for digit in reversed(string):
if len(self.bitstream) * 8 <= self.pos:
self.bitstream.append(0)
if digit == '1':
self.bitstream[-1] |= 1 << self.pos % 8
self.pos += 1
Here is my guess (depend on his comment):
The function encode_bits() will turn an input integer num into a binary string of length size (padding zeroes at left if needed) and reverse the string, and append the digits to bitstream one by one. Hence
suppose s=DataBlock(), then s.encode_bits(3, 3) would firstly turn 3 into 011 (padding a zero at left to make it length 3) and reverse it to 110, and then append 110 to self.bitstream, hence the result should be bytearray('110'). But as I run the code the result gives bytearray(b'\x03'), not as expected. Further more, \x03 is one byte, not 3 bits, conflicts with his comment, I cannot understand why?
I forgot to add that his code runs and gives correct output hence there's something wrong in my understanding.
Try looking at it this way:
You will be given a bytearray object (call it x for the moment).
How many bytes are in the object? (Obviously, it's just len(x).)
How many bits are in the object? (This is an exercise; please calculate the answer.)
Once you've done that, suppose we start with no (zero) bytes in x, i.e., x is a bytearray(b''). How many bytes do we need to add (x.append(...)) in order to store three bits? What if we want to store eight bits? What if we want to store ten bits?
Again, these are exercises. Calculate the answers and you should be enlightened.
(Incidentally, this technique, of compressing some number of sub-objects into some larger space, is called packing. In mathematics the problem is generalized, while in computers it is often more limited.)

How to concatenate bits in Python

I have two bytes, e.g. 01010101 and 11110000. I need to concatenate the four most significant bit of the second byte "1111" and the first whole byte, resulting something like 0000010101011111, namely, padding four zeros, the first whole byte and finally the four most significant bit of the second byte.
Any idea?
Try this:
first = 0b01010101
second = 0b11110000
res = (first<<4) | (second>>4)
print bin(res)
By shifting the first byte by 4 bits to the left (first<<4) you'll add 4 trailing zero bits. Second part (second>>4) will shift out to the right 4 LSB bits of your second byte to discard them, so only the 4 MSB bits will remain, then you can just bitwise OR both partial results (| in python) to combine them.
Splitting result back
To answer #JordanMackie 's question, you can split the res back to two variables, just you will loose original 4 least significant bits from second.
first = 0b01010101
second = 0b11110000
res = (first<<4) | (second>>4)
print ("res : %16s" %(bin(res)) )
first2 = (res>>4) & 255
second2 = (res&0b1111)<<4
print ("first2 : %16s" % (bin(first2)) )
print ("second2: %16s" % (bin(second2)) )
Output looks like this:
res : 0b10101011111
first2 : 0b1010101
second2: 0b11110000
First of the commands extracts original first byte. It shifts 4 LSB bits that came from second variable to the right (operator >>), so they will be thrown away. Next logical and operation & keeps only 8 lowest bits of the operation and any extra higher bits are thrown away:
first2 = (res>>4) & 255
Second of the commands can restore only 4 MSB bits of the second variable. It selects only 4 LSB from the result that belong to second using logical multiplication (&). (anything & 1 = anything, anything & 0 = 0). Higher bits are discarded because they are AND'ed with 0 bit.
Next those 4 bits are shifted to the left. Zero bits appear at 4 lowest significant bit positions:
second2 = (res&0b1111)<<4

Get the string that is the midpoint between two other strings

Is there a library or code snippet available that can take two strings and return the exact or approximate mid-point string between the two strings?
Preferably the code would be in Python.
Background:
This seems like a simple problem on the surface, but I'm kind of struggling with it:
Clearly, the midpoint string between "A" and "C" would be "B".
With base64 encoding, the midpoint string between "A" and "B" would probably be "Ag"
With UTF-8 encoding, I'm not sure what the valid midpoint would be because the middle character seems to be a control character: U+0088 c2 88 <control>
Practical Application:
The reason I am asking is because I was hoping write map-reduce type algorithm to read all of the entries out of our database and process them. The primary keys in the database are UTF-8 encoded strings with random distributions of characters. The database we are using is Cassandra.
Was hoping to get the lowest key and the highest key out of the database, then break that up into two ranges by finding the midpoint, then breaking those two ranges up into two smaller sections by finding each of their midpoints until I had a few thousand sections, then I could read each section asynchronously.
Example if the strings were base-16 encoded: (Some of the midpoints are approximate):
Starting highest and lowest keys: '000' 'FFF'
/ \ / \
'000' '8' '8' 'FFF'
/ \ / \ / \ / \
Result: '000' '4' '4' '8' '8' 'B8' 'B8' 'FFF'
(After 3 levels of recursion)
Unfortunately not all sequences of bytes are valid UTF-8, so it's not trivial to just take the midpoint of the UTF-8 values, like the following.
def midpoint(s, e):
'''Midpoint of start and end strings'''
(sb, eb) = (int.from_bytes(bytes(x, 'utf-8'), byteorder='big') for x in (s, e))
midpoint = int((eb - sb) / 2 + sb)
midpoint_bytes = midpoint.to_bytes((midpoint.bit_length() // 8) + 1, byteorder='big')
return midpoint_bytes.decode('utf-8')
Basically this code converts each string into an integer represented by the sequence of bytes in memory, finds the midpoint of those two integers, and attempts to interpret the "midpoint" bytes as UTF-8 again.
Depending on exactly what behavior you would like, the next step could be to replace the invalid bytes in midpoint_bytes with some kind of replacement character to form a valid UTF-8 string. For your problem it might not matter much exactly which character you use for the replacement so long as you're consistent.
However, since you're trying to partition the data and don't seem to care too much about the string representation of the midpoint, another option is to just leave the midpoint representation as an integer and convert the keys to integers while doing the partition. Depending on the scale of your problem this option may or may not be feasible.
Here's a general solution that gives an approximate midpoint m between any two Unicode strings a and b, such that a < m < b if possible:
from os.path import commonprefix
# This should be set according to the range and frequency of
# characters used.
MIDCHAR = u'm'
def midpoint(a, b):
prefix = commonprefix((a, b))
p = len(prefix)
# Find the codepoints at the position where the strings differ.
ca = ord(a[p]) if len(a) > p else None
cb = ord(b[p])
# Find the approximate middle code point.
cm = (cb // 2 if ca is None else (ca + cb) // 2)
# If a middle code point was found, add it and return.
if ca < cm < cb:
return prefix + unichr(cm)
# If b still has more characters after this, then just use
# b's code point and return.
if len(b) > p + 1:
return prefix + unichr(cb)
# Otherwise, if cb == 0, then a and b are consecutive so there
# is no midpoint. Return a.
if cb == 0:
return a
# Otherwise, use part of a and an extra character so that
# the result is greater than a.
i = p + 1
while i < len(a) and a[i] >= MIDCHAR:
i += 1
return a[:i] + MIDCHAR
The function assumes that a < b. Other than that, it should work with arbitrary Unicode strings, even ones containing u'\x00' characters. Note also that it may return strings containing u'\x00' or other nonstandard code points. If there is no midpoint due to b == a + u'\x00' then a is returned.

PNG Chunk type-code Bit #5

I'm trying to write my own little PNG reader in Python. There is something in the documentation I don't quite understand. In chapter 3.3 (where chunks are handled) it says:
Four bits of the type code, namely bit 5 (value 32) of each byte, are used to convey chunk properties. This
choice means that a human can read off the assigned properties according to whether each letter of the type
code is uppercase (bit 5 is 0) or lowercase (bit 5 is 1). However, decoders should test the properties of an unknown
chunk by numerically testing the specified bits; testing whether a character is uppercase or lowercase
is inefficient, and even incorrect if a locale-specific case definition is used.
Ok, so it explicitly denotes one should not test whether a byte is uppercase or lowercase. Then, how do I check that bit 5?
Furthermore, the documentation states
Ancillary bit: bit 5 of first byte
0 (uppercase) = critical, 1 (lowercase) = ancillary.
I have the following function to convert an integer to a bit-stream:
def bits(x, n):
""" Convert an integer value *x* to a sequence of *n* bits as a string. """
return ''.join(str([0, 1][x >> i & 1]) for i in xrange(n - 1, -1, -1))
Just for example, take the sRGB chunk. The lowercase s denotes the chunk is ancillary. But comparing the bit-streams of an uppercase S and lowercase s
01110011
01010011
we can see that bit #5 is zero in both cases.
I think I do have a wrong understanding of counting the bits. As the only bit that changes is the third one (i.e. indexed with 2), i assume this is the bit I'm searching for? It is also the 6th bit from the right and indexed with 5 (from the right of course). Is this what I'm searching for?
Python does have bitwise manipulation. You are doing it the hard way, when they already gave you the bitmask (32 or 0x20).
is_critical = (type_code & 0x20) == 0
or, equivalently:
is_critical = (type_code & (0x1 << 5)) == 0
(with extra parentheses for clarity)

Categories