Dealing with no-List in pandas Dataframe - python

I´m dealing with this example of DataFrame groupDisk, is the result of a Grouping operation (by VM), I need to count how many True appears in the list of each row of the column Thin
VM Powerstate Thin
0 VIRTU1 [poweredOn] [False]
1 VIRTU2 [poweredOn, poweredOn] [False, False]
2 VIRTU3 [poweredOn, poweredOn] [False, False]
3 VIRTU4 [poweredOn, poweredOn] [True, True]
4 VIRTU5 [poweredOn, poweredOn, poweredOn] [False, True, False]
This must be the result = 3
The Thin column can be 1, 2 or N elements
Any clue will be appreciated

Use Series.apply with sum if values are list of booleans:
df['new'] = df['Thin'].apply(sum)
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False, False] 0
2 VIRTU3 [poweredOn,poweredOn] [False, False] 0
3 VIRTU4 [poweredOn,poweredOn] [True, True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False, True, False] 1
Or if values are strings use Series.str.count:
df['new'] = df['Thin'].str.count('True')
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False,False] 0
2 VIRTU3 [poweredOn,poweredOn] [False,False] 0
3 VIRTU4 [poweredOn,poweredOn] [True,True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False,True,False] 1

Related

Generate Boolean Matrix of overlapping intervals

I have a dataframe where two columns represent the start and end points of intervals on a real number line. I want to generate a third column as a list of the indices of rows which said row has any overlap with. I'm having difficulty creating a inequality boolean matrix for this natively in pandas. I assume logic like this s1<=e2 and e1>=s2 will do the trick, but I don't know how to effectively broadcast it.
As a toy example I'm hoping for a simple way to at least generate a 5x5 boolean matrix (with all True down the diagonal) given this dataframe:
import pandas as pd
intervals_df = pd.DataFrame({"Starts":[0,1,5,10,15,20],"Ends":[4,2,9,14,19,24]})
Starts Ends
0 0 4
1 1 2
2 5 9
3 10 14
4 15 19
5 20 24
The condition for the two intervals (s1,e1) and (s2,e2) to intersect is max(s1,s2) <= min(e1,e2). So you can do a cross merge (this is the broadcast), calculate the condition, the pivot:
d = (intervals_df.reset_index()
.merge(intervals_df.reset_index(), how='cross')
.assign(cond=lambda x: x.filter(like='Starts').max(axis=1) <= x.filter(like='Ends').min(axis=1))
.pivot('index_x', 'index_y', 'cond')
)
You would get:
index_y 0 1 2 3 4 5
index_x
0 True True False False False False
1 True True False False False False
2 False False True False False False
3 False False False True False False
4 False False False False True False
5 False False False False False True
Or you can make do with numpy's broadcasting:
starts = intervals_df[['Starts']].to_numpy()
ends = intervals_df[['Ends']].to_numpy()
np.maximum(starts, starts.T) <= np.minimum(ends, ends.T)
Output:
array([[ True, True, False, False, False, False],
[ True, True, False, False, False, False],
[False, False, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, False],
[False, False, False, False, False, True]])

Transform false in between trues

Hello I have a dataframe like the following one:
df = pd.DataFrame({"a": [True, True, False, True, True], "b": [True, True, False, False, True]})
df
I would like to be able to transform the False values in between Trues to obtain a result like this (depending on a threshold).
# Threshold = 1
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, False, False, True]})
df
# Threshold = 2
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, True, True, True]})
df
Any suggestions to do this apart from a for loop?
Edit: The threshold value defines how many consecutive Falses you will take into account to do the transformation.
Edit2: In the beggining and end of the array you should not consider any special case.
If possible simplify solution for replace Falses groups less like Threshold value first filter separate groups by DataFrame.cumsum with DataFrame.mask, counts by Series.map with Series.value_counts and last compare by DataFrame.le with pass to DataFrame.mask:
Threshold = 1
m = df.cumsum().mask(df).apply(lambda x: x.map(x.value_counts())).le(Threshold)
df = df.mask(m, True)
If need not replace start or ends groups by Falses:
df = pd.DataFrame({"a": [False, False, True, False, True, False],
"b": [True, True, False, False, True, True]})
print (df)
a b
0 False True
1 False True
2 True False
3 False False
4 True True
5 False True
Threshold = 1
df1 = df.cumsum().mask(df)
m1 = df1.apply(lambda x: x.map(x.value_counts())).le(Threshold)
m2 = df1.ne(df1.iloc[0]) & df1.ne(df1.iloc[-1])
df = df.mask(m1 & m2, True)
print (df)
a b
0 False True
1 False True
2 True False
3 True False
4 True True
5 False True
one way would be to use itertools groupby to generate counts of each adjacent items group, but sadly it does include a couple of loops:
from itertools import groupby
def how_many_identical_elements(itter):
return sum([[x]*x for x in [len(list(v)) for g,v in groupby(itter)]], [])
def fill_up_df(df, th):
df = df.copy()
for c in df.columns:
df[f'{c}_count'] = how_many_identical_elements(df[c].values)
df[c] = [False if x[0]==False and x[1]>th else True for x in zip(df[c], df[f'{c}_count'])]
return df[[c for c in df.columns if 'count' not in c]]
then
fill_up_df(df, 1)
a
b
0
True
True
1
True
True
2
True
False
3
True
False
4
True
True
fill_up_df(df, 2)
a
b
0
True
True
1
True
True
2
True
True
3
True
True
4
True
True
This code looks from -threshold -> threshold, on a column-by-column basis and or's the results together to create a masking dataframe that meets your criteria. The last line is just the logical or of your original data and the new mask as we only need to fill False values. It should be one of the faster solutions if speed is an issue.
threshold = 2
filling_mask = reduce(
lambda x, y: x | y,
(
df.shift(-i, fill_value=True) & df.shift(i, fill_value=True)
for i in range(1, threshold + 1)
)
)
df |= filling_mask
Threshold 1:
>>> df # Threshold 1
a b
0 True True
1 True True
2 True False
3 True False
4 True True
Threshold 2:
>>> df # Threshold 2
a b
0 True True
1 True True
2 True True
3 True True
4 True True

Best solution for selecting the columns that contain at least one True value in a pandas DataFrame

In [4]: df = pd.DataFrame({'a': [True, False, True], 'b': [False, False, False],
...: 'c': [False, False, False], 'd': [False, True, False],
...: 'e': [False, False, False]})
In [5]: df
Out[5]:
a b c d e
0 True False False False False
1 False False False True False
2 True False False False False
In [6]: df[df.any()[df.any()].index]
Out[6]:
a d
0 True False
1 False True
2 True False
The code under [6] does what I want. My question, however, is: is there a better solution? That is, more concise and/or more elegant.
One direct method is using df.loc with the mask generated by df.any() as input:
df.loc[:, df.any()]
a d
0 True False
1 False True
2 True False
Another option is to index df.columns,
df[df.columns[df.any()]]
Or, df.keys():
df[df.keys()[df.any()]]
a d
0 True False
1 False True
2 True False

Is there an efficient way to create a random bit mask in Pytorch?

I want to have a random bit mask that has some specified percent of 0s. The function I devised is:
def create_mask(shape, rate):
"""
The idea is, you take a random permutations of numbers. You then mod then
mod it by the [number of entries in the bitmask] / [percent of 0s you
want]. The number of zeros will be exactly the rate of zeros need. You
can clamp the values for a bitmask.
"""
mask = torch.randperm(reduce(operator.mul, shape, 1)).float().cuda()
# Mod it by the percent to get an even dist of 0s.
mask = torch.fmod(mask, reduce(operator.mul, shape, 1) / rate)
# Anything not zero should be put to 1
mask = torch.clamp(mask, 0, 1)
return mask.view(shape)
To illustrate:
>>> x = create_mask((10, 10), 10)
>>> x
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1
0 1 1 1 1 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 0 1 1
0 1 1 1 1 1 1 1 1 1
1 1 1 0 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 1
[torch.cuda.FloatTensor of size 10x10 (GPU 0)]
The main issue I have with this method is it requires the rate to divide the shape. I want a function that accepts an arbitrary decimal and gives approximately rate percent of 0s in the bitmask. Furthermore, I am trying to find a relatively efficient way of doing so. Hence, I would rather not move a numpy array from the CPU to the GPU. Is there an effiecient way of doing so that allows for a decimal rate?
For anyone running into this, this will create a bitmask with approximately 80% zero's directly on GPU. (PyTorch 0.3)
torch.cuda.FloatTensor(10, 10).uniform_() > 0.8
Use NumPy and cudamat:
import numpy as np
import cudamat as cm
gpuMask = cm.CUDAMatrix(np.random.choice([0, 1], size=(10,10), p=[1./2, 1./2]))
where the elements of the list are fraction representations of the probabilities of your 1s and 0s.
The proper way to create a bit mask directly on GPU with Pytorch is:
import torch
tensor = torch.rand((3, 5), device=torch.device("cuda")) < 0.9
# tensor([[ True, True, False, True, True, True, True, True, True, False],
# [ True, True, True, True, True, True, True, False, False, True],
# [ True, False, False, True, True, True, True, True, True, False],
# [ True, True, True, True, True, True, True, True, True, True],
# [ True, True, False, True, True, True, True, False, True, True],
# [ True, True, False, False, True, True, True, False, True, True],
# [ True, True, True, False, True, True, True, True, True, True],
# [ True, True, True, True, True, True, False, False, True, True],
# [ True, False, True, True, True, True, True, True, True, True],
# [ True, True, True, True, True, True, True, True, False, True]],
# device='cuda:0')

how to convert integer to logical array in python

I want to convert a number(eg:3) to it's logical array([0 0 1 0 0 0 0 0 0 0]).
In matlab, we can use
a = 1:10
b = 3
a == b
then, we can get 0 0 1 0 0 0 0 0 0 0.
How can I get it in python, because when I try this in python, I got:
In [220]: import numpy as np
In [221]: a = np.arange(10)
In [222]: b = 3
In [223]: a == b
Out[223]: array([False, False, False, True, False, False, False, False, False, False], dtype=bool)
You could convert it to an integer afterwards:
(a == b).astype(int)
Actually, if you ant the same output as MATLAB, you need
np.asarray(a + 1 == b).astype(np.int32)
or to define a as
a = np.arange(1,11)
You can replace a == b by np.array(a == b).astype(np.int32)
By changing the type it will solve your problem.

Categories