I am trying to count how many consecutive TRUEs on each row and I solved that part myself but I need to find a solution for this part: If a row starts with FALSE then result must be 0. There is a sample dataset below. Can you recommend me your tips to how to solve this.
PS. my original question is at the link below.
how to find number of consecutive decreases(increases)
Sample data, .csv file
idx,Expected Results,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1002,3,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,4,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1006,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1007,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1008,1,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE
1010,1,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE
1013,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1014,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1017,2,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
After John Solution;
How can I count the Trues till I see the "False"
result = df.where(df[0], 0)
idx,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,0,0,0,0,0,0,0,0,0,0,0
1002,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,0,0,0,0,0,0,0,0,0,0,0
1006,0,0,0,0,0,0,0,0,0,0,0,0
1007,0,0,0,0,0,0,0,0,0,0,0,0
1008,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,0,0,0,0,0,0,0,0,0,0,0
1010,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,0,0,0,0,0,0,0,0,0,0,0
1013,0,0,0,0,0,0,0,0,0,0,0,0
1014,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,0,0,0,0,0,0,0,0,0,0,0
1017,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,0,0,0,0,0,0,0,0,0,0,0
You can use np.argmin. You needn't prefilter your df, it will handle rows starting with False correctly.
df.loc[:, 'M_1':'M_12'].values.argmin(1)
#array([0, 3, 1, 4, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0])
Note that this assumes there is at least one False in every row.
df.loc[:, 'M_1':'M_12'].apply(np.logical_and.accumulate, axis=1).sum(axis=1)
reverse values of columns M-1 - M-12 using negation '~'. I.e, True to False and vice-versa. Doing cummax to separate first group of consecutive True (note: at this point True represent False-value and 'False' represent True-value). Doing another negation on the result of cummax and finally sum
(~(~df.drop(['idx'], 1)).cummax(1)).sum(1)
Out[503]:
0 0
1 3
2 1
3 4
4 0
5 0
6 0
7 1
8 0
9 1
10 0
11 0
12 1
13 1
14 0
15 2
16 0
dtype: int64
I have a numpy.ndarray called grouping of size (S, N). Each row of grouping gives me the group labels of a sample of data. I run my algorithm S times and get new group labels in each iteration.
I want to determine how many times each sample of my data has the same group label as every other sample of my data across the S iterations in a fully vectorized way.
In a not-completely-vectorized way:
sim_matrix = np.zeros((N, N))
for s in range(S):
sim_matrix += np.equal.outer(grouping[s, :], grouping[s, :])
One vectorized approach would be with broadcasting -
(grouping[:,None,:] == grouping[:,:,None]).sum(0)
For performance, we can use np.count_nonzero -
np.count_nonzero(grouping[:,None,:] == grouping[:,:,None],axis=0)
The sum of equal.outer is a cryptic way of calculating all-pairs similarity of columns:
sum_i sum_jk (A[i,j] == A[i,k]) is the same as
sum_jk sum_i (A[i,j] == A[i,k])
where sum_i loops over rows, sum_jk over all pairs of columns.
Comparing two vectors by counting the the number of positions where they differ
is called
Hamming distance .
If we change == above to !=, similarity to distance = nrows - similarity
(most similar ⇔ distance 0), we get the problem:
find the Hamming distance between all pairs of a bunch of vectors:.
def allpairs_hamming( A, dtype=np.uint32 ):
""" -> Hamming distances between all pairs of rows of A """
nrow, ncol = A.shape
allpair_dist = np.zeros( [nrow, nrow], dtype=dtype )
for j in xrange(nrow):
for k in xrange( j + 1, nrow ):
allpair_dist[j,k] = allpair_dist[k,j] = (A[j] != A[k]).sum() # row diff
return allpair_dist
allpairs_hamming: 30.7 sec, 3 ns per cmp Nvec 2000 Veclen 5000 A 10m pairdist uint32 15m
Almost all the cpu time is in the row diff, not in the outer loop for j ... for k -- 3 ns per scalar compare, on a stock mac, isn't bad.
However memory caching is much faster if each row A[j] is in contiguous memory,
as for numpy C-order arrays.
Apart from that, whether you do "all pairs of rows" or "all pairs of columns"
doesn't matter, as long as you're clear.
(Is it possible to find "nearby" pairs in time and space < O(npairs), here O(20000^2) ? Afaik there are more methods than test cases.)
See also:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html (bug: hamming .mean not .sum)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://stats.stackexchange.com/search?q=[clustering]+pairwise
You want to compare identic rows. A way to do that is grouping the entire rows in a raw block :
S,N=12,2
a=np.random.randint(0,3,(S,N)) #12 samples of two labels.
#a
0 1
0 2 2
1 2 0
2 1 2
3 0 0
4 0 1
5 1 1
6 0 1
7 0 1
8 0 1
9 0 0
10 2 2
11 0 0
samples=np.ascontiguousarray(a).view(dtype((void,a.strides[0])))
sample.shape is then (S,1).
you can now inventory your sample with np.unique, and use Pandas dataframes for pretty report :
_,inds,invs=np.unique(samples,return_index=True, return_inverse=True)
df=pd.DataFrame(invs)
result=df.reset_index().groupby(0).index.apply(list).to_frame()
result['sample']=[list(x) for x in a[inds]]
for
index samples
0
0 [3, 9, 11] [0, 0]
1 [4, 6, 7, 8] [0, 1]
2 [5] [1, 1]
3 [2] [1, 2]
4 [1] [2, 0]
5 [0, 10] [2, 2]
It can be a O(S ln S) if there is few fits between samples, when yours is O( N²S).
I have two equal sized arrays ( array1 and array2 ) of 0's and 1's. How do I get all the arrays whose bit wise union with array1 result into array2 ? For example,if array1 = [1, 1, 1] and array2 = [1, 1, 1]. Output should be all eight arrays : [0, 0, 0], [1, 0, 0], ...., [1, 1, 1] . Are there efficient solutions to it or only brute force is the way ?
My try :
I tried to calculate bit wise difference first and if any of bit is negative then return false( not possible to combine first array with any kind of array to get array2). If all bits are non-negative then .... if bit in difference is 0 then it can be replaced by 0 or 1 either( this is wrong assumption albeit and fails for if array1= [0,0] , array2= [0,0], and if any bit in difference is 0 then required array has to have 1 at that place to make it 1
Here's how I would go about solving this problem:
First, let's think about this. You need to find all arrays of binary values that, when combined (via some operator) with a known binary value, = a new binary value. Don't try to solve the problem yet. Assume you need to go from 00 to 11. How many possible answers are there? Assume you need to go from 11 to 11. How many possible answers are there? Can you do any better (in the worst case) than a brute force approach? That'll give you a complexity bound.
With that rough bound in mind, tackle the bits of the question that are a bit curious. Drill down onto the question a little bit more. What is the 'bitwise union operator'? Is it 'and'? Is it 'or'? Is it something more complicated? 'Bitwise Union' sounds like B[i] = A[i] OR X[i], but anyone asking that question could mean something else..
Depending on the answer to questions 1 and 2, you have a lot to work with here. I can think of a few different options, but I think from here you can come up with an algorithm.
Once you have a solution, you need to think about "Can I do a better job here'? A lot of that goes back to the initial impressions about the problem and how they're constructed, and what/how much you think you can optimize.
Note: I will explain the following with an example input:
A = [0 0 1 0 1 1], B = [1 1 1 0 1 1]
Assuming you want to calculate X for the equations A OR X = B, let us see what are the options for each choice of bit in A and B:
A OR X = B
--------------------
0 0 0
0 1 1
1 N.A. 0
1 (0,1) 1
If any bit in A is 1, and its corresponding B bit is 0, there are no solutions possible. Return an empty set.
If the corresponding bits in A and B are 1, the corresponding bit in X does not matter.
Now, see that one solution for X is B itself, (if condition #1, as stated above, is satisfied). Hence, lets construct a number start_num = B. This will be one solution, and the other solutions will be constructed from this.
start_num = B = [1 1 1 0 1 1]
The 'choice' bits are those where X can take any value, i.e. those positions where A=1 and B=1. Let us make another number choice = A AND B, so that choice = 1 denotes those positions. Also notice that, if there are k positions where choice = 1, the total number of solutions is 2^k.
choice = A AND B = [0 0 1 0 1 1] ,hence, k = 3
Store these 'choice' positions in an array (of length k), starting from the right (LSB = 0). Let us call this array pos_array.
pos_array = [0 1 3]
Notice that all the 'choice' bits in start_num are set to 1. Hence, all the other solutions will have some (1 <= p <= k) of these bits set to 0. Now that we know which bits are to be changed, we need to make these solutions in an efficient manner.
This can be done by making all solutions in an order where the difference between the previous solution and the current one is just at one position, hence making it efficient to calculate the solutions. For example, if we have two 'choice' bits, the following explains the difference between simply running through all combinations in an arithmetic progression, and going through them in a 1-bit-change order:
1-bit-toggle-order decreasing order
---------------------- ----------------------
1 1 // start 1 1 // start
1 0 // toggle bit 0 1 0 // subtract 1
0 0 // toggle bit 1 0 1 // subtract 1
0 1 // toggle bit 0 0 0 // subtract 1
(We want to exploit the speed of bitwise operations, hence we will use the 1-bit-toggle order).
Now, we will build each solution: (This is not actual C code, just an explanation)
addToSet(start_num); // add the initial solution to the set
for(i=1; i<2^k; i++)
{
pos = 0;
count = i;
while( ( count & 1) != 0)
{
count = count>>1;
pos++;
}
toggle(start_num[pos_array[pos]]); // update start_num by toggling the desired bit
addToSet(start_num); // Add the updated vector to the set
}
If this code is run on the above example, the following toggle statements will be executed:
toggle(start_num[0])
toggle(start_num[1])
toggle(start_num[0])
toggle(start_num[3])
toggle(start_num[0])
toggle(start_num[1])
toggle(start_num[0])
, which will result in the following additions:
addToSet([1 1 1 0 1 0])
addToSet([1 1 1 0 0 0])
addToSet([1 1 1 0 0 1])
addToSet([1 1 0 0 0 1])
addToSet([1 1 0 0 0 0])
addToSet([1 1 0 0 1 0])
addToSet([1 1 0 0 1 1])
, which, in addition to the already-present initial solution [1 1 1 0 1 1], completes the set.
NOTE: I am not an expert in bitwise operations, besides other things. I think there are better ways to write the algorithm, making better use of bit-access pointers and bitwise binary operations (and will be glad if someone can suggest improvements). What I am proposing with this solution is the general approach to this problem.
You can construct the digit options for each slot i by evaluating:
for d in (0, 1):
if (array1[i] or d) == array2[i]):
digits[i].append(d)
Then you just need to iterate over i.
The objective is to construct a list of lists: [[0,1],[1],[0,1]] showing the valid digits in each slot. Then you can use itertools.product() to construct all of the valid arrays:
arrays = list(itertools.product(*digits))
You can put all this together using list comprehensions and this would result in:
list(it.product(*[[d for d in (0, 1) if (x or d) == y] for x, y in zip(array1, array2)]))
In action:
>>> import itertools as it
>>> a1, a2 = [1,1,1], [1,1,1]
>>> list(it.product(*[[d for d in (0, 1) if (x or d) == y] for x, y in zip(a1, a2)]))
[(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)]
>>> a1, a2 = [1,0,0], [1,1,1]
>>> list(it.product(*[[d for d in (0, 1) if (x or d) == y] for x, y in zip(a1, a2)]))
[(0, 1, 1), (1, 1, 1)]
>>> a1, a2 = [1,0,0], [0,1,1]
>>> list(it.product(*[[d for d in (0, 1) if (x or d) == y] for x, y in zip(a1, a2)]))
[]