Convert numpy binary string array back to string [duplicate] - python

This question already has answers here:
Convert binary to ASCII and vice versa
(8 answers)
Closed 6 years ago.
I have a numpy binary array like this:
np_bin_array = [0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
It was 8-bit string characters of a word originally, starting from the left, with 0's padding it out.
I need to convert this back into strings to form the word again and strip the 0's and the output for the above should be 'Hello'.
Thanks for your help!

You can firstly interpret the bits into an array, using numpy.packbits(), then convert it to an array of bytes by applying bytearray(), then decode() it to be a normal string.
The following code
import numpy
np_bin_array = [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print(bytearray(numpy.packbits(np_bin_array)).decode().strip("\x00"))
gives
Hello

This one works to me. I had a boolean array though, so had to make an additional conversion.
def split_list(alist,max_size=1):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(alist), max_size):
yield alist[i:i+max_size]
result = "".join([chr(i) for i in (int("".join([str(int(j)) for j in letter]), base=2) for letter in split_list(np_bin_array, 8)) if i != 0])

import numpy as np
np_bin_array = np.array([0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
bhello = ''.join(map(str, np_bin_array))
xhello = hex(int(bhello, 2)).strip("0x")
''.join(chr(int(xhello[i:i+2], 16)) for i in range(0, len(xhello), 2))

I got it working with this:
np_bin_array = [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
yy=[]
yy_word=""
yy=np.packbits(np_bin_array)
for i in yy:
if i:
j = chr(i)
yy_word += str(j)
print(yy_word)

Related

How to convert values to their index

I have a numpy array containing 1's and 0's:
a = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 1, 1, 0, 0, 0]])
I'd like to convert each 1 to the index in the subarray that it's occuring at, to get this:
e = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
So far what I've done is multiply the array by a range:
a * np.arange(a.shape[0])
which is good, but I'm wondering if there's a better, simpler way to do it, like a single function call?
This modifies a in place:
In [4]: i, j = np.nonzero(a)
In [5]: a[i, j] = j
In [6]: a
Out[6]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
Make a copy if you don't want modify a in place.
Or, this creates a new array (in one line):
In [8]: np.arange(a.shape[1])[a]
Out[8]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
Your approach is a fast as it gets but it uses the wrong dimension for the multiplication (it would fait if the matrix wasn't square).
Multiply the matrix by a range of column indexes:
import numpy as np
a = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0]])
e = a * np.arange(a.shape[1])
print(e)
[[ 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 5 0 7 0 0 0]
[ 0 0 2 0 0 0 0 0 0 0 0]
[ 0 1 2 0 0 0 6 0 0 0 0]
[ 0 0 2 0 0 5 0 0 0 9 0]
[ 0 0 0 0 0 5 0 0 0 9 0]
[ 0 0 2 0 0 0 6 0 0 0 10]
[ 0 0 0 3 0 0 6 0 8 0 0]
[ 0 0 0 0 0 0 0 0 0 9 0]
[ 0 1 2 0 0 5 6 0 0 0 0]]
I benchmarked the obligatory np.einsum approach, which was ~1.29x slower for larger arrays (100_000, 1000) than the corrected original solution. The inplace solution was ~8x slower than np.einsum.
np.einsum('ij,j->ij', a, np.arange(a.shape[1]))

Store python array in each entry of a column

I have got an array 'mutlilabel' which looks like this:
[[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
...
[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0]]
and want to store each of those arrays in my target variable as I am facing a multi-label classification task. How can I achieve that? My code:
pd.DataFrame(multilabel)
Outputs multiple columns:
0 1 2 3 4 5 6 7
0 0 0 0 0 1 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
Thanks in advance!
df = pd.DataFrame(list(multilabel))
list_column = df.apply(lambda row: row.values, axis=1)
pd.DataFrame(list_column, columns=['list_column'])
Result df:
Have you consider using the following trick?
import pandas as pd
arr = [[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0]]
pd.DataFrame([arr]).T
Output
0
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
EDIT
In case you are using numpy arrays you can use the following
import numpy as np
pd.DataFrame(np.array(arr))\
.apply(lambda x: np.array(x), axis=1)
So, the real question is why... it doesn't seem like the most useful data structure.
That said, the one-dimensional data type in pandas is the Series:
>>> pd.Series(multilabel)
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
dtype: object
You can then convert it further into a DataFrame:
>>> pd.DataFrame(pd.Series(multilabel))
0
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
Edit: Per further discussion, this works if multilabel is a nested Python list, but not if it's a NumPy array.

How to exceed limitation of numpy.array() to convert list of array to an array of array?

I have a list of arrays containing each one 16 int :
ListOfArray=[array([0,1,....,15]), array([0,1,....,15]), array([0,1,....,15]),....,array([0,1,....,15])]
I want to convert it to an array of array.
So I use :
ListOfArray=numpy.array(ListOfArray)
or:
ListOfArray=numpy.asarray(ListOfArray)
or :
ArrayOfArray=numpy.asarray(ListOfArray)
Same result
If my list of arrays contained less than 17716 arrays I have the normal result :
[[0 0 0 ... 0 0 1]
[1 0 0 ... 0 1 0]
[0 0 0 ... 0 0 1]
...
[0 1 1 ... 1 0 0]
[0 1 1 ... 0 0 0]
[0 1 1 ... 0 0 1]]
But from 17716 arrays I have this :
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]) ...
array([0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0])
array([0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1])
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
It seems that there is a limit somewhere, why ?
Can we exceed it ?
Edit :
there is no problem with numpy.array.. It's wasn't desired to have an array containing 17 values. I converted wav frames into binary and then into string (of fifteen 0 and 1), which I add 0 before if it's a negative and 1 for positive, and then convert to a list and then an array.. I didn't expect a value of -32768 (-0b10000000000000000), believed that -32767 and 32767 (15 binaries digits) was the maximums.
It's a pretty ugly code, i'm not proud, but if you have advice for a less patchworking code, here it is :
import numpy as np
import wave
import struct
f= wave.open('Test16PCM.wav','rb')
nf = f.getnframes()
frames=f.readframes(nf)
f.close()
L=[]
# extracting values samples
for i in range (0,((nf-1)*4)+1,4):
L.append( (struct.unpack('<h',frames[i:(i+2)])[0]) ) # only the left track
Lbin=[] # convert int values to string of binaries + 0 or 1 for negative or positive
for i in L:
a=str(bin(i))
if a[0]=="-" : # something like "-0b00101101"
a=a[3:]
while len(a)<16: # to have same length binary number (was 15 before correction)
a='0'+a
Lbin.append('0'+a)
else : # something like "0b00101101"
a=a[2:]
while len(a)<16:
a='0'+a
Lbin.append('1'+a)
Lout=[]
for i in Lbin :
temp=[]
for j in i :
temp.append(int(j))
temp=np.array(temp)
Lout.append(temp)
Lout=np.asarray(Lout)
print(Lout)

Tensorflow Initialize list with 0s and 1s similar to tf.one_hot

I currently have a list of values over a time sequence say the values are [1, 3, 5, 7, 3]. Currently I am using tf.one_hot to get a one hot vector/tensor representative for each value within the list.
1 = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
5 = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
7 = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
3 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
In Tensorflow is there a way/function that would allow me to do something similar but initialize all values from 0 to the value with 1s?
Desired Result:
1 = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
5 = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
7 = [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
3 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
You are looking for tf.sequence_mask.
TensorFlow API

How to one-hot-encode sentences at the character level?

I would like to convert a sentence to an array of one-hot vector.
These vector would be the one-hot representation of the alphabet.
It would look like the following:
"hello" # h=7, e=4 l=11 o=14
would become
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Unfortunately OneHotEncoder from sklearn does not take as input string.
Just compare the letters in your passed string to a given alphabet:
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
vector = [[0 if char != letter else 1 for char in alphabet]
for letter in strng]
return vector
Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).
The output of string_vectorizer('hello'):
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.
alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]
# #divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97
# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')
one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()
In [15]: one_hot.eval()
Out[15]:
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
With pandas, you can use pd.get_dummies by passing a categorical Series:
import pandas as pd
import string
low = string.ascii_lowercase
pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out:
a b c d e f g h i j ... q r s t u v w x y z
0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
[5 rows x 26 columns]
Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -
ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)
If you are looking for performance, I would suggest using an initialized array and then assign -
out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1
Sample run -
In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97
In [154]: ints
Out[154]: array([ 7, 4, 11, 11, 14], dtype=uint8)
In [155]: out = (ints[:,None] == np.arange(26)).astype(int)
In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]
You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:
def onehot(ltr):
return [1 if i==ord(ltr) else 0 for i in range(97,123)]
def onehotvec(s):
return [onehot(c) for c in list(s.lower())]
onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Categories