Get log(n, 2) features after dummy encoding - python

During the dummy encoding (for example OneHotEncoder) we can drop first feature (with param: drop='first'). It helps when we have 3 features. It works because it's enough to have 2 cells to encoding 3 features like: (0, 0), (0, 1), (1, 0); also for 4 features: (0, 0), (0, 1), (1, 0), (1, 1). So I noticed that to encode n-categories it's enough to have math.ceil(log(n, 2)) features. But I can't find function (in sklearn/pandas) that allows to do this. I ask your help.

What you're searching for is bin() function which is a standard built-in in Python.
Suppose you happen to have a simple pandas df:
df = pd.DataFrame({"a":["a","b","c","x","a","c"]})
print(df)
a
0 a
1 b
2 c
3 x
4 a
5 c
Then you may proceed as follows:
df["enc"] = df["a"].astype("category").cat.codes
max_enc_length = len(bin(df["enc"].max())[2:])
df["enc"]=df["enc"].apply(lambda x: bin(x)[2:].zfill(max_enc_length))
df = pd.concat([df["a"], df["enc"].apply(lambda x: pd.Series(list(x)))], axis=1)
print(df)
a 0 1
0 a 0 0
1 b 0 1
2 c 1 0
3 x 1 1
4 a 0 0
5 c 1 0
Note, linear models are out for such type of encoding.

Related

How does one interpret sklearn sparse matrix outputs?

I am trying to produce a bigram word co-occurrence matrix, indicating how many times one word follows another in a corpus.
As a test, I wrote the following (which I gathered from other SE questions):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
bigram_vec = CountVectorizer(ngram_range=(1,2))
X = bigram_vec.fit_transform(test_sent)
Xc = (X.T * X)
print Xc
This should give the correct output. The matrix Xc is output like so:
(0, 0) 1
(1, 1) 2
(2, 2) 2
(3, 3) 1
(4, 4) 1
I have no idea how to interpret this. I attempted to make it dense to help with my interpretation using Xc.todense(), which got this:
[[1 0 0 0 0]
[0 2 0 0 0]
[0 0 2 0 0]
[0 0 0 1 0]
[0 0 0 0 1]]
Neither of these give the correct word co-occurrence matrix showing one how many times row follows column.
Could someone please explain how I can interpret/use the output? Why is it like that?
Addition to question
Here is another possible output with a different example using ngram_range=(2,2):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello biggest awesome biggest biggest awesome today lively splendid awesome today']
bigram_vec = CountVectorizer(ngram_range=(2,2))
X = bigram_vec.fit_transform(test_sent)
print bigram_vec.get_feature_names()
Xc = (X.T * X)
print Xc
print ' '
print Xc.todense()
(4, 0) 1
(2, 0) 2
(0, 0) 1
(3, 0) 1
(1, 0) 2
(7, 0) 1
(5, 0) 1
(6, 0) 1
(4, 1) 2
(2, 1) 4
(0, 1) 2
(3, 1) 2
(1, 1) 4
(7, 1) 2
(5, 1) 2
(6, 1) 2
(4, 2) 2
(2, 2) 4
(0, 2) 2
(3, 2) 2
(1, 2) 4
(7, 2) 2
(5, 2) 2
(6, 2) 2
(4, 3) 1
: :
(6, 4) 1
(4, 5) 1
(2, 5) 2
(0, 5) 1
(3, 5) 1
(1, 5) 2
(7, 5) 1
(5, 5) 1
(6, 5) 1
(4, 6) 1
(2, 6) 2
(0, 6) 1
(3, 6) 1
(1, 6) 2
(7, 6) 1
(5, 6) 1
(6, 6) 1
(4, 7) 1
(2, 7) 2
(0, 7) 1
(3, 7) 1
(1, 7) 2
(7, 7) 1
(5, 7) 1
(6, 7) 1
[[1 2 2 1 1 1 1 1]
[2 4 4 2 2 2 2 2]
[2 4 4 2 2 2 2 2]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]]
This one seems to tokenize by bigrams, since calling bigram_vec.get_feature_names() gives
[u'awesome biggest', u'awesome today', u'biggest awesome', u'biggest biggest', u'hello biggest', u'lively splendid', u'splendid awesome', u'today lively']
Some help interpretting this would be great. It's a symmetric matrix so I'm thinking it might just be number of occurrences?
First you need to check out the feature names which the CountVectorizer is using.
Do this:
bigram_vec.get_feature_names()
# Out: [u'am', u'dont', u'hello', u'to', u'want']
You see that the word "i" is not present. That's because the default tokenizer uses a pattern:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp select tokens of 2 or more
alphanumeric characters (punctuation is completely ignored and always
treated as a token separator).
And the actual output of the X should be interpreted as:
[u'am', u'dont', u'hello', u'to', u'want']
'hello' [[ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'am' [ 1 0 0 0 0]
'hello' [ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]
'want' [ 0 0 0 0 1]
'to' [ 0 0 0 1 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]]
Now when you do X.T * X this should be interpreted as:
u'am' u'dont' u'hello' u'to' u'want'
u'am' [[1 0 0 0 0]
u'dont' [0 2 0 0 0]
u'hello' [0 0 2 0 0]
u'to' [0 0 0 1 0]
u'want' [0 0 0 0 1]]
If you are expecting anything else, then you should add the details in the question.

Pandas: Conditionally Selecting Columns to perform Calculation based on Header of another Column

My dataframe looks like this:
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (1, 9) (1, 10) (1, 11) ... 2 3 4 5 6 7 8 9 10 11
0 0 1 0 1 1 1 1 0 1 0 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
1 0 0 0 0 0 0 0 0 0 0 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
2 1 0 0 1 0 1 1 0 0 0 ... 0.481291 0.593353 0.271028 0.498949 0.588807 0.641602 0.901779 0.424495 0.303309 0.669657
3 1 1 0 1 0 1 1 0 0 1 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
4 0 0 0 1 1 1 1 1 1 1 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
where i have column headers with a tuple like (1, 2) and column headers that are a single element, like 1. I want to perform a calculation on the tuple columns based on the columns that have the elements of that tuple. For example, with the tuple (1, 2), I want to retrieve the columns 1 and 2, multiply them together, then subtract the result from the column (1, 2).
The solution that I thought of was to create (55) new columns that perform the first calculation from the columns that contain only a single element (e.g. 1 or 2), and then do some sort of identity match using the .where() and all() statements. However, this seems rather computationally inefficient since I'd be making a whole other set of data, rather than performing the calculation directly on the tuple column. How would I go about this?
Not sure if this is faster, but here's a solution without needing where()/all()
import pandas as pd
# create some sample data
arr = [[1, 2, 3, 4, 5, 6, 7],
[7, 6, 5, 4, 3, 2, 1]]
df = pd.DataFrame(arr, columns=[('a', 'b'), ('c','d'), ('a', 'd'), 'a', 'b', 'c', 'd'])
# get all tuple headers
tuple_columns = [col for col in df.columns if isinstance(col, tuple)]
# put the results into a list of series and concat into a DataFrame
results = pd.concat([df[col] - df[col[0]] * df[col[1]] for col in tuple_columns], axis=1)
# rename the columns
results.columns = tuple_columns

Determine list of all possible products from a list of integers in Python

In Python 2.7 I need a method that returns all possible products of a list or tuple of int. Ie. if input is (2, 2, 3, 4), then I'd want a output like
(3, 4, 4), 2 * 2 = 4
(2, 4, 6), 2 * 3 = 6
(2, 3, 8), 2 * 4 = 8
(3, 4, 4), 2 * 2 = 4
(2, 2, 12), 3 * 4 = 12
(2, 24), 2 * 3 * 4 = 24
(3, 16), 2 * 2 * 4 = 16
(4, 12), 2 * 2 * 3 = 12
(48), 2 * 2 * 3 * 4 = 48
wrapped up in a list or tuple. I figure that a nice implementation is probably possible using combinations from itertools, but I'd appreciate any help. Note that I am only interested in distinct lists, where order of int plays no role.
EDIT
Some futher explanation for some clarification. Take the first output list. Input is (2, 2, 3, 4) (always). Then I take 2 and 2 out of the list and multiply them, so now I am left with a list (3, 4, 4). 3 and 4 from the input and the last 4 from the product.
I haven't tried anything yet since I just can't spin my head around that kind of loop. But I can't stop thinking about the problem, so I'll add some code if I do get a suggestion.
Your problem is basically one of find all subsets of a given set (multiset in your case). Once you have the subsets its straight forward to construct the output you've asked for.
For a set A find all the subsets [S0, S1, ..., Si]. For each subset Si, take (A - Si) | product(Si), where | is union and - is a set difference. You might not be interested in subsets of size 0 and 1, so you can just exclude those.
Finding subsets is a well known problem so I'm sure you can find resources on how to do that. Keep in mind that there are 2**N setbsets of a set with N elements.
Suppose you have a vector of 4 numbers (for instance (2,2,3,4)).
You can generate a grid (as that one showed below):
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Now remove the rows with all '0' and the rows with only one '1'.
0 0 1 1
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Now you can substitute the '1' with the respective element in the vector.
If your vector is (2,2,3,4) it becomes:
0 0 3 4
0 2 0 4
0 2 3 0
0 2 3 4
2 0 0 4
2 0 3 0
2 0 3 4
2 2 0 0
2 2 0 4
2 2 3 0
2 2 3 4
Try to implement this in Python.
Below a pseudo code:
for i from 0 to 2^VECTOR_LEN:
bin=convert_to_binary(i)
if sum_binary_digit(bin) > 1:
print(exec_moltiplication(bin,vector)
# if you want you can also use the bin vector as mask for the updating
# of your tuple of int with the result of the product and append it
# in a list (as in your example).
# For example if bin is (1 1 0 0) you can edit (2 2 3 4) in (4 3 4)
# and append (4 3 4) inside the list or if it is (1 0 1 0) you can
# update (2 2 3 4) in (6 2 4)
WHERE:
vector: is the vector containing the numbers
VECTOR_LEN is the length of vector
convert_to_binary(num) is a function that convert an integer (num) to binary
sum_binary_digit(bin) is a function that sum the 1s in your binary number (bin)
exec_multiplication(vector,bin) take in input the vector (vector) and the binary (bin) and returns the value of the multiplication.
I can't give you the algo(as i don't know it myself), but there is lib which can achieve this task...
Looking at you given input numbers, they seem to be factors, so if we multiply all of these factors we get a number(say x), now using sympy, we can get all of the divisors of that number:--
import numpy
ls = [2,2,3,4]
x = numpy.prod(ls)
from sympy import divisors
divisors_x = divisors(x)
Here you go!! this the list(divisors_x )
You can break this down into three steps:
get all the permutations of the list of numbers
for each of those permutations, create all the possible partitions
for each sublist in the partitions, calculate the product
For the permutations, you can use itertools.permutations, but as far as I know, there is no builtin function for partitions, but that's not too difficult to write (or to find):
def partitions(lst):
if lst:
for i in range(1, len(lst) + 1):
for p in partitions(lst[i:]):
yield [lst[:i]] + p
else:
yield []
For a list like (1,2,3,4), this will generate [(1),(2),(3),(4)], [(1),(2),(3,4)], [(1),(2,3),(4)], [(1),(2,3,4)], and so on, but not, e.g. [(1,3),(2),(4)]; that's why we also need the permutations. However, for all the permutations, this will create many partitions that are effectively duplicates, like [(1,2),(3,4)] and [(4,3),(1,2)] (182 for your data), but unless your lists are particularly long, this should not be too much of a problem.
We can combine the second and third step; this way we can weed out all the duplicates as soon as they arise:
data = (2, 2, 3, 4)
res = {tuple(sorted(reduce(operator.mul, lst) for lst in partition))
for permutation in itertools.permutations(data)
for partition in partitions(permutation)}
Afterwards, res is {(6, 8), (2, 4, 6), (2, 2, 3, 4), (2, 2, 12), (48,), (3, 4, 4), (4, 12), (3, 16), (2, 24), (2, 3, 8)}
Alternatively, you can combine it all in one, slightly more complex algorithm. This still generates some duplicates, due to the two 2 in your data set, that can again be removed by sorting and collecting in a set. The result is the same as above.
def all_partitions(lst):
if lst:
x = lst[0]
for partition in all_partitions(lst[1:]):
# x can either be a partition itself...
yield [x] + partition
# ... or part of any of the other partitions
for i, _ in enumerate(partition):
partition[i] *= x
yield partition
partition[i] //= x
else:
yield []
res = set(tuple(sorted(x)) for x in all_partitions(list(data)))

python- pandas- concatenate columns with a loop

I have a list of columns that I need to concatenate. An example table would be:
import numpy as np
cats1=['T_JW', 'T_BE', 'T_FI', 'T_DE', 'T_AP', 'T_KI', 'T_HE']
data=np.array([random.sample(range(0,2)*7,7)]*3)
df_=pd.DataFrame(data, columns=cats1)
So I need to get the concatenation of each line (if it's possible with a blank space between each value). I tried:
listaFin=['']*1000
for i in cats1:
lista=list(df_[i])
listaFin=zip(listaFin,lista)
But I get a list of tuples:
listaFin:
[((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1)]
And I need to get something like
[0 0 1 0 1 0 1,
0 0 1 0 1 0 1,
0 0 1 0 1 0 1]
How can I do this only using one loop or less (i don't want to use a double loop)?
Thanks.
I don't think you can have a list of space delimited integers in Python without them being in a string (I might be wrong). Having said that, the answer I have is:
output = []
for i in range(0,df_.shape[0]):
output.append(' '.join(str(x) for x in list(df_.loc[i])))
print(output)
output looks like this:
['1 0 0 0 1 0 1', '1 0 0 0 1 0 1', '1 0 0 0 1 0 1']

Slice pandas series with elements not in the index

I have a pandas series indexed by tuples, like this:
from pandas import Series
s = Series({(0, 0): 1, (0, 1): 2, (0, 3): 3, (1, 0): 1, (1, 2): 4, (3, 0): 5})
I want to slice such a series by using indexes that are also tuples (using lexicographic ordering), but not necessarily in the index. Slicing seems to work when I pass an index that is on the series:
s[:(1,0)]
(0, 0) 1
(0, 1) 2
(0, 3) 3
(1, 0) 1
dtype: int64
but if I try slicing by an index which is not on the series there is an error:
s[:(1,1)]
...
ValueError: Index(...) must be called with a collection of some kind, 0 was passed
Ideally I'd like to get the series elements indexed by (0, 0), (0, 1), (0, 3), (1, 0), similar to what happens when slicing using dates in TimeSeries. Is there a simple way to achieve this?
This works if you have a MultiIndex rather than an index of tuples:
In [11]: s.index = pd.MultiIndex.from_tuples(s.index)
In [12]: s
Out[12]:
0 0 1
1 2
3 3
1 0 1
2 4
3 0 5
dtype: int64
In [13]: s[:(1,1)]
Out[13]:
0 0 1
1 2
3 3
1 0 1
dtype: int64
In a previous edit I had suggested this could be a bug, and had created an awful hack...

Categories