I want to convert a number(eg:3) to it's logical array([0 0 1 0 0 0 0 0 0 0]).
In matlab, we can use
a = 1:10
b = 3
a == b
then, we can get 0 0 1 0 0 0 0 0 0 0.
How can I get it in python, because when I try this in python, I got:
In [220]: import numpy as np
In [221]: a = np.arange(10)
In [222]: b = 3
In [223]: a == b
Out[223]: array([False, False, False, True, False, False, False, False, False, False], dtype=bool)
You could convert it to an integer afterwards:
(a == b).astype(int)
Actually, if you ant the same output as MATLAB, you need
np.asarray(a + 1 == b).astype(np.int32)
or to define a as
a = np.arange(1,11)
You can replace a == b by np.array(a == b).astype(np.int32)
By changing the type it will solve your problem.
Related
Hello I have a dataframe like the following one:
df = pd.DataFrame({"a": [True, True, False, True, True], "b": [True, True, False, False, True]})
df
I would like to be able to transform the False values in between Trues to obtain a result like this (depending on a threshold).
# Threshold = 1
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, False, False, True]})
df
# Threshold = 2
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, True, True, True]})
df
Any suggestions to do this apart from a for loop?
Edit: The threshold value defines how many consecutive Falses you will take into account to do the transformation.
Edit2: In the beggining and end of the array you should not consider any special case.
If possible simplify solution for replace Falses groups less like Threshold value first filter separate groups by DataFrame.cumsum with DataFrame.mask, counts by Series.map with Series.value_counts and last compare by DataFrame.le with pass to DataFrame.mask:
Threshold = 1
m = df.cumsum().mask(df).apply(lambda x: x.map(x.value_counts())).le(Threshold)
df = df.mask(m, True)
If need not replace start or ends groups by Falses:
df = pd.DataFrame({"a": [False, False, True, False, True, False],
"b": [True, True, False, False, True, True]})
print (df)
a b
0 False True
1 False True
2 True False
3 False False
4 True True
5 False True
Threshold = 1
df1 = df.cumsum().mask(df)
m1 = df1.apply(lambda x: x.map(x.value_counts())).le(Threshold)
m2 = df1.ne(df1.iloc[0]) & df1.ne(df1.iloc[-1])
df = df.mask(m1 & m2, True)
print (df)
a b
0 False True
1 False True
2 True False
3 True False
4 True True
5 False True
one way would be to use itertools groupby to generate counts of each adjacent items group, but sadly it does include a couple of loops:
from itertools import groupby
def how_many_identical_elements(itter):
return sum([[x]*x for x in [len(list(v)) for g,v in groupby(itter)]], [])
def fill_up_df(df, th):
df = df.copy()
for c in df.columns:
df[f'{c}_count'] = how_many_identical_elements(df[c].values)
df[c] = [False if x[0]==False and x[1]>th else True for x in zip(df[c], df[f'{c}_count'])]
return df[[c for c in df.columns if 'count' not in c]]
then
fill_up_df(df, 1)
a
b
0
True
True
1
True
True
2
True
False
3
True
False
4
True
True
fill_up_df(df, 2)
a
b
0
True
True
1
True
True
2
True
True
3
True
True
4
True
True
This code looks from -threshold -> threshold, on a column-by-column basis and or's the results together to create a masking dataframe that meets your criteria. The last line is just the logical or of your original data and the new mask as we only need to fill False values. It should be one of the faster solutions if speed is an issue.
threshold = 2
filling_mask = reduce(
lambda x, y: x | y,
(
df.shift(-i, fill_value=True) & df.shift(i, fill_value=True)
for i in range(1, threshold + 1)
)
)
df |= filling_mask
Threshold 1:
>>> df # Threshold 1
a b
0 True True
1 True True
2 True False
3 True False
4 True True
Threshold 2:
>>> df # Threshold 2
a b
0 True True
1 True True
2 True True
3 True True
4 True True
I´m dealing with this example of DataFrame groupDisk, is the result of a Grouping operation (by VM), I need to count how many True appears in the list of each row of the column Thin
VM Powerstate Thin
0 VIRTU1 [poweredOn] [False]
1 VIRTU2 [poweredOn, poweredOn] [False, False]
2 VIRTU3 [poweredOn, poweredOn] [False, False]
3 VIRTU4 [poweredOn, poweredOn] [True, True]
4 VIRTU5 [poweredOn, poweredOn, poweredOn] [False, True, False]
This must be the result = 3
The Thin column can be 1, 2 or N elements
Any clue will be appreciated
Use Series.apply with sum if values are list of booleans:
df['new'] = df['Thin'].apply(sum)
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False, False] 0
2 VIRTU3 [poweredOn,poweredOn] [False, False] 0
3 VIRTU4 [poweredOn,poweredOn] [True, True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False, True, False] 1
Or if values are strings use Series.str.count:
df['new'] = df['Thin'].str.count('True')
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False,False] 0
2 VIRTU3 [poweredOn,poweredOn] [False,False] 0
3 VIRTU4 [poweredOn,poweredOn] [True,True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False,True,False] 1
I want to compute the transitive closure of a sparse matrix in Python. Currently I am using scipy sparse matrices.
The matrix power (**12 in my case) works well on very sparse matrices, no matter how large they are, but for directed not-so-sparse cases I would like to use a smarter algorithm.
I have found the Floyd-Warshall algorithm (German page has better pseudocode) in scipy.sparse.csgraph, which does a bit more than it should: there is no function only for Warshall's algorithm - that is one thing.
The main problem is that I can pass a sparse matrix to the function, but this is utterly senseless as the function will always return a dense matrix, because what should be 0 in the transitive closure is now a path of inf length and someone felt this needs to be stored explicitly.
So my question is: Is there any python module that allows computing the transitive closure of a sparse matrix and keeps it sparse?
I am not 100% sure that he works with the same matrices, but Gerald Penn shows impressive speed-ups in his comparison paper, which suggests that it is possible to solve the problem.
EDIT: As there were a number of confusions, I will point out the theoretical background:
I am looking for the transitive closure (not reflexive or symmetric).
I will make sure that my relation encoded in a boolean matrix has the properties that are required, i.e. symmetry or reflexivity.
I have two cases of the relation:
reflexive
reflexive and symmetric
I want to apply the transitive closure on those two relations. This works perfectly well with matrix power (only that in certain cases it is too expensive):
>>> reflexive
matrix([[ True, True, False, True],
[False, True, True, False],
[False, False, True, False],
[False, False, False, True]])
>>> reflexive**4
matrix([[ True, True, True, True],
[False, True, True, False],
[False, False, True, False],
[False, False, False, True]])
>>> reflexive_symmetric
matrix([[ True, True, False, True],
[ True, True, True, False],
[False, True, True, False],
[ True, False, False, True]])
>>> reflexive_symmetric**4
matrix([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]])
So in the first case, we get all the descendents of a node (including itself) and in the second, we get all the components, that is all the nodes that are in the same component.
This was brought up on SciPy issue tracker. Problem is not so much the output format; the implementation of Floyd-Warshall is to begin with the matrix full of infinities and then insert finite values when a path is found. Sparsity is lost immediately.
The networkx library offers an alternative with its all_pairs_shortest_path_length. Its output is an iterator which returns tuples of the form
(source, dictionary of reachable targets)
which takes a little work to convert to a SciPy sparse matrix (csr format is natural here). A complete example:
import numpy as np
import networkx as nx
import scipy.stats as stats
import scipy.sparse as sparse
A = sparse.random(6, 6, density=0.2, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
G = nx.DiGraph(A) # directed because A need not be symmetric
paths = nx.all_pairs_shortest_path_length(G)
indices = []
indptr = [0]
for row in paths:
reachable = [v for v in row[1] if row[1][v] > 0]
indices.extend(reachable)
indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = A + sparse.csr_matrix((data, indices, indptr), shape=A.shape)
print(A, "\n\n", A_trans)
The reason for adding A back is as follows. Networkx output includes paths of length 0, which would immediately fill the diagonal. We don't want that to happen (you wanted transitive closure, not reflexive-and-transitive closure). Hence the line reachable = [v for v in row[1] if row[1][v] > 0]. But then we don't get any diagonal entries at all, even where A had them (the 0-length empty path beats 1-length path formed by self-loop). So I add A back to the result. It now has entries 1 or 2 but only the fact they are nonzero is of significance.
An example of running the above (I pick 6 by 6 size for readability of the output). Original matrix:
(0, 3) 1
(3, 2) 1
(4, 3) 1
(5, 1) 1
(5, 3) 1
(5, 4) 1
(5, 5) 1
Transitive closure:
(0, 2) 1
(0, 3) 2
(3, 2) 2
(4, 2) 1
(4, 3) 2
(5, 1) 2
(5, 2) 1
(5, 3) 2
(5, 4) 2
(5, 5) 1
You can see that this worked correctly: the added entries are (0, 2), (4, 2), and (5, 2), all acquired via the path (3, 2).
By the way, networkx also has floyd_warshall method but its documentation says
This algorithm is most appropriate for dense graphs. The running time is O(n^3), and running space is O(n^2) where n is the number of nodes in G.
The output is dense again. I get the impression that this algorithm is just considered dense by nature. It seems the all_pairs_shortest_path_length is a kind of Dijkstra's algorithm.
Transitive and Reflexive
If instead of transitive closure (which is the smallest transitive relation containing the given one) you wanted transitive and reflexive closure (the smallest transitive and reflexive relation containing the given one) , the code simplifies as we no longer worry about 0-length paths.
for row in paths:
indices.extend(row[1])
indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = sparse.csr_matrix((data, indices, indptr), shape=A.shape)
Transitive, Reflexive, and Symmetric
This means finding the smallest equivalence relation containing the given one. Equivalently, dividing the vertices into connected components. For this you don't need to go to networkx, there is connected_components method of SciPy. Set directed=False there. Example:
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import itertools
A = sparse.random(20, 20, density=0.02, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
components = sparse.csgraph.connected_components(A, directed=False)
nonzeros = []
for k in range(components[0]):
idx = np.where(components[1] == k)[0]
nonzeros.extend(itertools.product(idx, idx))
row = tuple(r for r, c in nonzeros)
col = tuple(c for r, c in nonzeros)
data = np.ones_like(row)
B = sparse.coo_matrix((data, (row, col)), shape=A.shape)
This is what the output print(B.toarray()) looks like for a random example, 20 by 20:
[[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]
I have a column of a pandas DataFrame that looks like this:
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 False
10 False
11 True
12 False
I would like to get the count of False between the True. Something like this:
1 3
2 0
3 5
4 1
This is what I've done:
counts = []
count = 0
for k in df['result'].index:
if df['result'].loc[k] == False:
count += 1
else:
counts.append(count)
count = 0
where counts would be the result. Is there a simpler way?
Group by the cumulative sum of itself and then count the False with sum:
s = pd.Series([False, False, False, True, True, False, False, False, False, False, True, False])
(~s).groupby(s.cumsum()).sum()
#0 3.0
#1 0.0
#2 5.0
#3 1.0
#dtype: float64
You can use the groupby function from the itertools package to group the False and True values together and append the count to a list.
s = pd.Series([False,False,False,True,True,False,False,False,False,False,True,False],
index=range(1,13)
from itertools import groupby
out = []
for v,g in groupby(s):
if not v: # v is false
out.append(len(tuple(g)))
else: # v is true
out.extend([0]*(len(tuple(g))-1))
out
[3, 0, 5, 1]
In want to implement the below listed Matlab commands in Python. I am able to figure out the Matlab equivalent commands in Python, but i am not getting the exact result. Can someone please help me to achieve so.
MATLAB CODE:
n0 = 3
n1 = 1
n2 = 5
n = [n1:n2]
>> 1 2 3 4 5
x = [(n - n0) == 0]
>> 0 0 1 0 0
PYTHON CODE:
import numpy
n0 = 3
n1 = 1
n2 = 5
n = r_[n1:n2+1]
>> [1 2 3 4 5]
x = r_[(n-n0) == 0]
>> [False False True False False]
So x is my array with boolean data type " [array([False, False, True, False False], dtype=bool)]". How can i make my last command to return result in form of 0's or 1's such that result is exactly same as Matlab.
use a list comprehension to convert bool to int:
[int(val) for val in x]