How does one interpret sklearn sparse matrix outputs? - python

I am trying to produce a bigram word co-occurrence matrix, indicating how many times one word follows another in a corpus.
As a test, I wrote the following (which I gathered from other SE questions):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
bigram_vec = CountVectorizer(ngram_range=(1,2))
X = bigram_vec.fit_transform(test_sent)
Xc = (X.T * X)
print Xc
This should give the correct output. The matrix Xc is output like so:
(0, 0) 1
(1, 1) 2
(2, 2) 2
(3, 3) 1
(4, 4) 1
I have no idea how to interpret this. I attempted to make it dense to help with my interpretation using Xc.todense(), which got this:
[[1 0 0 0 0]
[0 2 0 0 0]
[0 0 2 0 0]
[0 0 0 1 0]
[0 0 0 0 1]]
Neither of these give the correct word co-occurrence matrix showing one how many times row follows column.
Could someone please explain how I can interpret/use the output? Why is it like that?
Addition to question
Here is another possible output with a different example using ngram_range=(2,2):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello biggest awesome biggest biggest awesome today lively splendid awesome today']
bigram_vec = CountVectorizer(ngram_range=(2,2))
X = bigram_vec.fit_transform(test_sent)
print bigram_vec.get_feature_names()
Xc = (X.T * X)
print Xc
print ' '
print Xc.todense()
(4, 0) 1
(2, 0) 2
(0, 0) 1
(3, 0) 1
(1, 0) 2
(7, 0) 1
(5, 0) 1
(6, 0) 1
(4, 1) 2
(2, 1) 4
(0, 1) 2
(3, 1) 2
(1, 1) 4
(7, 1) 2
(5, 1) 2
(6, 1) 2
(4, 2) 2
(2, 2) 4
(0, 2) 2
(3, 2) 2
(1, 2) 4
(7, 2) 2
(5, 2) 2
(6, 2) 2
(4, 3) 1
: :
(6, 4) 1
(4, 5) 1
(2, 5) 2
(0, 5) 1
(3, 5) 1
(1, 5) 2
(7, 5) 1
(5, 5) 1
(6, 5) 1
(4, 6) 1
(2, 6) 2
(0, 6) 1
(3, 6) 1
(1, 6) 2
(7, 6) 1
(5, 6) 1
(6, 6) 1
(4, 7) 1
(2, 7) 2
(0, 7) 1
(3, 7) 1
(1, 7) 2
(7, 7) 1
(5, 7) 1
(6, 7) 1
[[1 2 2 1 1 1 1 1]
[2 4 4 2 2 2 2 2]
[2 4 4 2 2 2 2 2]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]]
This one seems to tokenize by bigrams, since calling bigram_vec.get_feature_names() gives
[u'awesome biggest', u'awesome today', u'biggest awesome', u'biggest biggest', u'hello biggest', u'lively splendid', u'splendid awesome', u'today lively']
Some help interpretting this would be great. It's a symmetric matrix so I'm thinking it might just be number of occurrences?

First you need to check out the feature names which the CountVectorizer is using.
Do this:
bigram_vec.get_feature_names()
# Out: [u'am', u'dont', u'hello', u'to', u'want']
You see that the word "i" is not present. That's because the default tokenizer uses a pattern:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp select tokens of 2 or more
alphanumeric characters (punctuation is completely ignored and always
treated as a token separator).
And the actual output of the X should be interpreted as:
[u'am', u'dont', u'hello', u'to', u'want']
'hello' [[ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'am' [ 1 0 0 0 0]
'hello' [ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]
'want' [ 0 0 0 0 1]
'to' [ 0 0 0 1 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]]
Now when you do X.T * X this should be interpreted as:
u'am' u'dont' u'hello' u'to' u'want'
u'am' [[1 0 0 0 0]
u'dont' [0 2 0 0 0]
u'hello' [0 0 2 0 0]
u'to' [0 0 0 1 0]
u'want' [0 0 0 0 1]]
If you are expecting anything else, then you should add the details in the question.

Related

filter list dataframe by element

I have a list and dataframe (example below).
0 1
0 ((test1, AA), (1, 1)) 1
1 ((test2, BB), (1, 1)) 2
2 ((test1, CC), (1, 1)) 3
3 ((test1, DD), (2, 1)) 8
4 ((test3, EE), (3, 1)) 9
I need to filter out only data with first elements test1 AND 1 . Could you please help?
Expected output:
0 1
0 ((test1, AA), (1, 1)) 1
2 ((test1, CC), (1, 1)) 3
You can use boolean indexing:
v = df[0].apply(lambda i: i[0][0] == 'test1' and i[1][0] == 1)
df = df[v]
print(df)
Output
0 1
0 ((test1, AA), (1, 1)) 1
2 ((test1, CC), (1, 1)) 3

How to pad an array non-symmetrically (e.g., only from one side)?

There is an example in Numpy's documentation for padding 2D arrays with constants:
def pad_with(vector, pad_width, iaxis, kwargs):
pad_value = kwargs.get('padder', 10)
vector[:pad_width[0]] = pad_value
vector[-pad_width[1]:] = pad_value
but it works for symmetric paddings only. For instance, np.pad(a, ((2, 2), (1, 1)), pad_with, padder=0) gives:
[[0 0 0 0 0]
[0 0 0 0 0]
[0 1 1 1 0]
[0 1 1 1 0]
[0 0 0 0 0]
[0 0 0 0 0]]
Question: How can I pad the array only from specific sides (i.e., only left and top sides)? Like this:
[[0 0 0 0]
[0 0 0 0]
[0 1 1 1]
[0 1 1 1]]
It turns our that with a simple change we can achieve that:
def pad_with(vector, pad_width, iaxis, kwargs):
pad_value = kwargs.get('padder', 0)
vector[:pad_width[0]] = pad_value
if pad_width[1] != 0: # <-- the only change (0 indicates no padding)
vector[-pad_width[1]:] = pad_value
Here are some examples:
Padding 1 row of zeros (only) to the top:
>>> np.pad(a, ((1, 0), (0, 0)), pad_with, padder=0)
[[0 0 0]
[1 1 1]
[1 1 1]]
Padding 2 rows of zeros, both to the left and right:
np.pad(a, ((0, 0), (2, 2)), pad_with, padder=0)
[[0 0 1 1 1 0 0]
[0 0 1 1 1 0 0]]
and so on.

How to create matrix in python of repeating number?

I want to:
Create a vector list from 0 to 4, i.e. [0, 1, 2, 3, 4] and from that
Create a matrix containing a "tiered list" from 0 to 4, 3 times over, once for each dimension. The matrix has 4^3 = 64 rows, so for example
T = [0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
...
1 0 0
...
1 1 0
....
4 4 4]
This is what I have so far:
n=5;
ind=list(range(0,n))
print(ind)
I am just getting started with Python so any help would be greatly appreciated!
The python itertools module product() function can do this:
for code in itertools.product( range(5), repeat=3 ):
print(code)
Giving the result:
(0, 0, 0)
(0, 0, 1)
(0, 0, 2)
(0, 0, 3)
...
(4, 4, 2)
(4, 4, 3)
(4, 4, 4)
So to make this into a matrix:
import itertools
matrix = []
for code in itertools.product( range(5), repeat=3 ):
matrix.append( list(code) )
list_ = []
for a in range(5):
for b in range(5):
for c in range(5):
list_ += [a ,b ,c ]
print(list_)
Note, you really want the matrix to have 5^3 = 125 rows. The basic answer is to just iterate in nested for loops:
T = []
for a in range(5):
for b in range(5):
for c in range(5):
T.append([a, b, c])
There are other, probably faster, ways of doing this, but for sheer get 'er done velocity, it's hard to beat this.

Determine list of all possible products from a list of integers in Python

In Python 2.7 I need a method that returns all possible products of a list or tuple of int. Ie. if input is (2, 2, 3, 4), then I'd want a output like
(3, 4, 4), 2 * 2 = 4
(2, 4, 6), 2 * 3 = 6
(2, 3, 8), 2 * 4 = 8
(3, 4, 4), 2 * 2 = 4
(2, 2, 12), 3 * 4 = 12
(2, 24), 2 * 3 * 4 = 24
(3, 16), 2 * 2 * 4 = 16
(4, 12), 2 * 2 * 3 = 12
(48), 2 * 2 * 3 * 4 = 48
wrapped up in a list or tuple. I figure that a nice implementation is probably possible using combinations from itertools, but I'd appreciate any help. Note that I am only interested in distinct lists, where order of int plays no role.
EDIT
Some futher explanation for some clarification. Take the first output list. Input is (2, 2, 3, 4) (always). Then I take 2 and 2 out of the list and multiply them, so now I am left with a list (3, 4, 4). 3 and 4 from the input and the last 4 from the product.
I haven't tried anything yet since I just can't spin my head around that kind of loop. But I can't stop thinking about the problem, so I'll add some code if I do get a suggestion.
Your problem is basically one of find all subsets of a given set (multiset in your case). Once you have the subsets its straight forward to construct the output you've asked for.
For a set A find all the subsets [S0, S1, ..., Si]. For each subset Si, take (A - Si) | product(Si), where | is union and - is a set difference. You might not be interested in subsets of size 0 and 1, so you can just exclude those.
Finding subsets is a well known problem so I'm sure you can find resources on how to do that. Keep in mind that there are 2**N setbsets of a set with N elements.
Suppose you have a vector of 4 numbers (for instance (2,2,3,4)).
You can generate a grid (as that one showed below):
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Now remove the rows with all '0' and the rows with only one '1'.
0 0 1 1
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Now you can substitute the '1' with the respective element in the vector.
If your vector is (2,2,3,4) it becomes:
0 0 3 4
0 2 0 4
0 2 3 0
0 2 3 4
2 0 0 4
2 0 3 0
2 0 3 4
2 2 0 0
2 2 0 4
2 2 3 0
2 2 3 4
Try to implement this in Python.
Below a pseudo code:
for i from 0 to 2^VECTOR_LEN:
bin=convert_to_binary(i)
if sum_binary_digit(bin) > 1:
print(exec_moltiplication(bin,vector)
# if you want you can also use the bin vector as mask for the updating
# of your tuple of int with the result of the product and append it
# in a list (as in your example).
# For example if bin is (1 1 0 0) you can edit (2 2 3 4) in (4 3 4)
# and append (4 3 4) inside the list or if it is (1 0 1 0) you can
# update (2 2 3 4) in (6 2 4)
WHERE:
vector: is the vector containing the numbers
VECTOR_LEN is the length of vector
convert_to_binary(num) is a function that convert an integer (num) to binary
sum_binary_digit(bin) is a function that sum the 1s in your binary number (bin)
exec_multiplication(vector,bin) take in input the vector (vector) and the binary (bin) and returns the value of the multiplication.
I can't give you the algo(as i don't know it myself), but there is lib which can achieve this task...
Looking at you given input numbers, they seem to be factors, so if we multiply all of these factors we get a number(say x), now using sympy, we can get all of the divisors of that number:--
import numpy
ls = [2,2,3,4]
x = numpy.prod(ls)
from sympy import divisors
divisors_x = divisors(x)
Here you go!! this the list(divisors_x )
You can break this down into three steps:
get all the permutations of the list of numbers
for each of those permutations, create all the possible partitions
for each sublist in the partitions, calculate the product
For the permutations, you can use itertools.permutations, but as far as I know, there is no builtin function for partitions, but that's not too difficult to write (or to find):
def partitions(lst):
if lst:
for i in range(1, len(lst) + 1):
for p in partitions(lst[i:]):
yield [lst[:i]] + p
else:
yield []
For a list like (1,2,3,4), this will generate [(1),(2),(3),(4)], [(1),(2),(3,4)], [(1),(2,3),(4)], [(1),(2,3,4)], and so on, but not, e.g. [(1,3),(2),(4)]; that's why we also need the permutations. However, for all the permutations, this will create many partitions that are effectively duplicates, like [(1,2),(3,4)] and [(4,3),(1,2)] (182 for your data), but unless your lists are particularly long, this should not be too much of a problem.
We can combine the second and third step; this way we can weed out all the duplicates as soon as they arise:
data = (2, 2, 3, 4)
res = {tuple(sorted(reduce(operator.mul, lst) for lst in partition))
for permutation in itertools.permutations(data)
for partition in partitions(permutation)}
Afterwards, res is {(6, 8), (2, 4, 6), (2, 2, 3, 4), (2, 2, 12), (48,), (3, 4, 4), (4, 12), (3, 16), (2, 24), (2, 3, 8)}
Alternatively, you can combine it all in one, slightly more complex algorithm. This still generates some duplicates, due to the two 2 in your data set, that can again be removed by sorting and collecting in a set. The result is the same as above.
def all_partitions(lst):
if lst:
x = lst[0]
for partition in all_partitions(lst[1:]):
# x can either be a partition itself...
yield [x] + partition
# ... or part of any of the other partitions
for i, _ in enumerate(partition):
partition[i] *= x
yield partition
partition[i] //= x
else:
yield []
res = set(tuple(sorted(x)) for x in all_partitions(list(data)))

python- pandas- concatenate columns with a loop

I have a list of columns that I need to concatenate. An example table would be:
import numpy as np
cats1=['T_JW', 'T_BE', 'T_FI', 'T_DE', 'T_AP', 'T_KI', 'T_HE']
data=np.array([random.sample(range(0,2)*7,7)]*3)
df_=pd.DataFrame(data, columns=cats1)
So I need to get the concatenation of each line (if it's possible with a blank space between each value). I tried:
listaFin=['']*1000
for i in cats1:
lista=list(df_[i])
listaFin=zip(listaFin,lista)
But I get a list of tuples:
listaFin:
[((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1)]
And I need to get something like
[0 0 1 0 1 0 1,
0 0 1 0 1 0 1,
0 0 1 0 1 0 1]
How can I do this only using one loop or less (i don't want to use a double loop)?
Thanks.
I don't think you can have a list of space delimited integers in Python without them being in a string (I might be wrong). Having said that, the answer I have is:
output = []
for i in range(0,df_.shape[0]):
output.append(' '.join(str(x) for x in list(df_.loc[i])))
print(output)
output looks like this:
['1 0 0 0 1 0 1', '1 0 0 0 1 0 1', '1 0 0 0 1 0 1']

Categories