One-hot encoding of categories - python

I have a list like similar to this:
list = ['Opinion, Journal, Editorial',
'Opinion, Magazine, Evidence-based',
'Evidence-based']
where the commas split between categories eg. Opinion and Journal are two separate categories. The real list is much larger and has more possible categories. I would like to use one-hot encoding to transform the list so that it can be used for machine learning. For example, from that list I would like to produce a sparse matrix containing data like:
list = [[1, 1, 1, 0, 0],
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 1]]
Ideally, I would like to use scikit-learn's one hot encoder as I presume this would be the most efficient.
In response to #nbrayns comment:
The idea is to transform the list of categories from text to a vector wherby if it belongs to that category it will be assigned 1, otherwise 0. For the above example, the headings would be:
headings = ['Opinion', 'Journal', 'Editorial', 'Magazine', 'Evidence-based']

If you are able to use Pandas, this functionality is essentially built-in there:
import pandas as pd
l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
pd.Series(l).str.get_dummies(', ')
Editorial Evidence-based Journal Magazine Opinion
0 1 0 1 0 1
1 0 1 0 1 1
2 0 1 0 0 0
If you'd like to stick with the sklearn ecosystem, you are looking for MultiLabelBinarizer, not for OneHotEncoder. As the name implies, OneHotEncoder only supports one level per sample per category, while your dataset has multiple.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer() # pass sparse_output=True if you'd like
mlb.fit_transform(s.split(', ') for s in l)
[[1 0 1 0 1]
[0 1 0 1 1]
[0 1 0 0 0]]
To map the columns back to categorical levels, you can access mlb.classes_. For the above example, this gives ['Editorial' 'Evidence-based' 'Journal' 'Magazine' 'Opinion'].

One more way:
l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
# Get list of unique classes
classes = list(set([j for i in l for j in i.split(', ')]))
=> ['Journal', 'Opinion', 'Editorial', 'Evidence-based', 'Magazine']
# Get indices in the matrix
indices = np.array([[k, classes.index(j)] for k, i in enumerate(l) for j in i.split(', ')])
=> array([[0, 1],
[0, 0],
[0, 2],
[1, 1],
[1, 4],
[1, 3],
[2, 3]])
# Generate output
output = np.zeros((len(l), len(classes)), dtype=int)
output[indices[:, 0], indices[:, 1]]=1
=> array([[ 1, 1, 1, 0, 0],
[ 0, 1, 0, 1, 1],
[ 0, 0, 0, 1, 0]])

This may not be the most efficient method, but probably easy to grasp.
If you don't already have a list of all possible words, you need to create that. In the code below it's called unique. The columns of the output matrix s will then correspond to those unique words; the rows will be the item from the list.
import numpy as np
lis = ['Opinion, Journal, Editorial','Opinion, Magazine, Evidence-based','Evidence-based']
unique=list(set(", ".join(lis).split(", ")))
print unique
# prints ['Opinion', 'Journal', 'Magazine', 'Editorial', 'Evidence-based']
s = np.zeros((len(lis), len(unique)))
for i, item in enumerate(lis):
for j, notion in enumerate(unique):
if notion in item:
s[i,j] = 1
print s
# prints [[ 1. 1. 0. 1. 0.]
# [ 1. 0. 1. 0. 1.]
# [ 0. 0. 0. 0. 1.]]

Very easy in pandas:
import pandas as pd
s = pd.Series(['a','b','c'])
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1

Related

Formatting multidimensional arrays

How can I code a function that isolates the 0s and 1s in the 2 dimensional array. So the function is to separate the bits into chunks.
[[ 0 38846]
[ 1 51599]
[ 0 51599]
[ 1 52598]
[ 0 290480]
[ 0 360467]]
Expected Output:
Ones = 51599 ,52598
Zeroes = 38846, 51599, 290480, 360467
I am not sure I am following. Are you looking for a particular method, like with numpy or is list comprehension good enough? In any case, here are some examples:
x = array([[ 0, 38846],
[ 1, 51599],
[ 0, 51599],
[ 1, 52598],
[ 0, 290480],
[ 0, 360467]])
#List comprehension
ones = [v for b,v in x if b]
zeros = [v for b,v in x if b==0]
#Numpy access
ones = x[x[:,0]==1][:,1]
zeros = x[x[:,0]==0][:,1]
The following code should do the trick:
arr = np.array([[0, 38846],
[1, 51599],
[0, 51599],
[1, 52598],
[0, 290480],
[0, 360467]])
col1 = arr[:, 0]
col2 = arr[:, 1]
zeros = col2[col1 == 0]
ones = col2[col1 == 1]
The lines
col1 = arr[:, 0]
col2 = arr[:, 1]
get the 1st column (containing the 0s and 1s) and the 2nd column (which contain the numbers you want).
The col1 == 0 creates an array of booleans for which indexes where col1 is 0 are True and the other indexes are False. By doing col2[col1 == 0], we can then get the values in column 2 where the corresponding index of col1 == 0 is true, which gives us the values in column 2 which correspond to 0s in column 1.

Convert a matrix of positive integer numbers into a boolean matrix without loops

I'm trying to write code in Python using NumPy. I'm not sure it's possible but here's what I'm trying to do:
I have a 2D matrix a of shape (rows, cols) with positive integer numbers and I want to define a matrix b such that if a[i,j]=x then b[i,j+1]=b[i,j+2]=...=b[i,j+x]=1 (b is initialized to a matrix of zeros).
You can assume that for every j,x: j+x<=cols-1.
For example, if a is:
[0 2 0 0]
[0 2 0 0]
[3 0 0 0]
[2 0 1 0]
Then b should be:
[0 0 1 1]
[0 0 1 1]
[0 1 1 1]
[0 1 1 1]
Is it possible to do the above in Python with NumPy without using loops?
If it's not possible to do it without loops, is there an efficient way to do it? (rows and cols can be big numbers.)
If it's not possible to do it without loops, is there an efficient way to do it? (rows and cols can be big numbers.)
I'm sorry, I don't know a NumPy function which would help in your situation, but I think a regular loop and array indexing should be quite fast:
import numpy as np
a = np.array([
[0, 2, 0, 0],
[0, 2, 0, 0],
[3, 0, 0, 0],
[2, 0, 1, 0],
])
b = np.zeros(a.shape)
for i, x in enumerate(a.flat):
b.flat[i + 1 : i + 1 + x] = 1
print(b)
Which prints your expected result:
[[0. 0. 1. 1.]
[0. 0. 1. 1.]
[0. 1. 1. 1.]
[0. 1. 1. 1.]]
Here's a slightly optimized solution of #finefoot
aa = a.ravel()
b = np.zeros_like(aa)
for i, x in enumerate(aa):
if x != 0:
b[i + 1 : i + 1 + x] = 1
b = b.reshape(a.shape)
And here's another solution which is slightly faster but less readable:
from itertools import chain
aa = a.ravel()
b = np.zeros_like(aa)
w = np.nonzero(aa)[0]
ranges = (range(s, e) for s, e in zip(w + 1, w + 1 + aa[w]))
for r in chain.from_iterable(ranges):
b[r] = 1
b = b.reshape(a.shape)
Gives correct results under assumption that j,x: j+x<=cols-1. Both solutions use a for-loop though, but I don't think that it's possible to do it otherwise.

python - applying a mask to an array in a for loop

I have this code:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
for v in np.unique(result['depth']):
temp_v = (result['depth'] == v)
values_v = [result[string][temp_v] for string in result.keys()]
this_v = dict(zip(result.keys(), values_v))
in which I want to create a new dictcalled 'this_v', with the same keys as the original dict result, but fewer values.
The line:
values_v = [result[string][temp_v] for string in result.keys()]
gives an error
TypeError: only integer scalar arrays can be converted to a scalar index
which I don't understand, since I can create ex = result[result.keys()[0]][temp_v] just fine. It just does not let me do this with a for loop so that I can fill the list.
Any idea as to why it does not work?
In order to solve your problem (finding and dropping duplicates) I encourage you to use pandas. It is a Python module that makes your life absurdly simple:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]),\
np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
# Here comes pandas!
import pandas as pd
# Converting your dictionary of lists into a beautiful dataframe
df = pd.DataFrame(result)
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 1 [0, 0, 0] 1 2 1
# 2 [0, 0, 0] 1 3 1
# 3 [0, 0, 0] 2 1 2
# 4 [0, 0, 0] 2 2 2
# 5 [0, 0, 0] 2 3 2
# Dropping duplicates... in one single command!
df = df.drop_duplicates('depth')
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 3 [0, 0, 0] 2 1 2
If you want oyur data back in the original format... you need yet again just one line of code!
df.to_dict('list')
#> {'data': [array([0, 0, 0]), array([0, 0, 0])],
# 'depth': [1, 2],
# 'dimension': [1, 1],
# 'generation': [1, 2]}

Editting python 2-dimensional array without for-loop?

So, I have a given 2 dimensional matrix which is randomly generated:
a = np.random.randn(4,4)
which gives output:
array([[-0.11449491, -2.7777728 , -0.19784241, 1.8277976 ],
[-0.68511473, 0.40855461, 0.06003551, -0.8779363 ],
[-0.55650378, -0.16377137, 0.10348714, -0.53449633],
[ 0.48248298, -1.12199767, 0.3541335 , 0.48729845]])
I want to change all the negative values to 0 and all the positive values to 1.
How can I do this without a for loop?
You can use np.where()
import numpy as np
a = np.random.randn(4,4)
a = np.where(a<0, 0, 1)
print(a)
[[1 1 0 1]
[1 0 1 0]
[1 1 0 0]
[0 1 1 0]]
(a<0).astype(int)
This is one possibly solution - converting the array to boolean array according to your condition and then converting it from boolean to integer.
array([[ 0.63694991, -0.02785534, 0.07505496, 1.04719295],
[-0.63054947, -0.26718763, 0.34228736, 0.16134474],
[ 1.02107383, -0.49594998, -0.11044738, 0.64459594],
[ 0.41280766, 0.668819 , -1.0636972 , -0.14684328]])
And the result -
(a<0).astype(int)
>>> array([[0, 1, 0, 0],
[1, 1, 0, 0],
[0, 1, 1, 0],
[0, 0, 1, 1]])

In order to generate all combinations of 1's and 0's we use a simple binary table. How can I easily create this binary table in an array?

For example the binary table for 3 bit:
0 0 0
0 0 1
0 1 0
1 1 1
1 0 0
1 0 1
And I want to store this into an n*n*2 array so it would be:
0 0 0
0 0 1
0 1 0
1 1 1
1 0 0
1 0 1
For generating the combinations automatically, you can use itertools.product standard library, which generates all possible combinations of the different sequences which are supplied, i. e. the cartesian product across the input iterables. The repeat argument comes in handy as all of our sequences here are identical ranges.
from itertools import product
x = [i for i in product(range(2), repeat=3)]
Now if we want an array instead a list of tuples from that, we can just pass this to numpy.array.
import numpy as np
x = np.array(x)
# [[0 0 0]
# [0 0 1]
# [0 1 0]
# [0 1 1]
# [1 0 0]
# [1 0 1]
# [1 1 0]
# [1 1 1]]
If you want all elements in a single list, so you could index them with a single index, you could chain the iterable:
from itertools import chain, product
x = list(chain.from_iterable(product(range(2), repeat=3)))
result: [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1]
Most people would expect 2^n x n as in
np.c_[tuple(i.ravel() for i in np.mgrid[:2,:2,:2])]
# array([[0, 0, 0],
# [0, 0, 1],
# [0, 1, 0],
# [0, 1, 1],
# [1, 0, 0],
# [1, 0, 1],
# [1, 1, 0],
# [1, 1, 1]])
Explanation: np.mgrid as used here creates the coordinates of the corners of a unit cube which happen to be all combinations of 0 and 1. The individual coordinates are then ravelled and joined as columns by np.c_
Here's a recursive, native python (no libraries) version of it:
def allBinaryPossiblities(maxLength, s=""):
if len(s) == maxLength:
return s
else:
temp = allBinaryPossiblities(maxLength, s + "0") + "\n"
temp += allBinaryPossiblities(maxLength, s + "1")
return temp
print (allBinaryPossiblities(3))
It prints all possible:
000
001
010
011
100
101
110
111

Categories