I have a dataframe where two columns represent the start and end points of intervals on a real number line. I want to generate a third column as a list of the indices of rows which said row has any overlap with. I'm having difficulty creating a inequality boolean matrix for this natively in pandas. I assume logic like this s1<=e2 and e1>=s2 will do the trick, but I don't know how to effectively broadcast it.
As a toy example I'm hoping for a simple way to at least generate a 5x5 boolean matrix (with all True down the diagonal) given this dataframe:
import pandas as pd
intervals_df = pd.DataFrame({"Starts":[0,1,5,10,15,20],"Ends":[4,2,9,14,19,24]})
Starts Ends
0 0 4
1 1 2
2 5 9
3 10 14
4 15 19
5 20 24
The condition for the two intervals (s1,e1) and (s2,e2) to intersect is max(s1,s2) <= min(e1,e2). So you can do a cross merge (this is the broadcast), calculate the condition, the pivot:
d = (intervals_df.reset_index()
.merge(intervals_df.reset_index(), how='cross')
.assign(cond=lambda x: x.filter(like='Starts').max(axis=1) <= x.filter(like='Ends').min(axis=1))
.pivot('index_x', 'index_y', 'cond')
)
You would get:
index_y 0 1 2 3 4 5
index_x
0 True True False False False False
1 True True False False False False
2 False False True False False False
3 False False False True False False
4 False False False False True False
5 False False False False False True
Or you can make do with numpy's broadcasting:
starts = intervals_df[['Starts']].to_numpy()
ends = intervals_df[['Ends']].to_numpy()
np.maximum(starts, starts.T) <= np.minimum(ends, ends.T)
Output:
array([[ True, True, False, False, False, False],
[ True, True, False, False, False, False],
[False, False, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, False],
[False, False, False, False, False, True]])
Related
Hello I have a dataframe like the following one:
df = pd.DataFrame({"a": [True, True, False, True, True], "b": [True, True, False, False, True]})
df
I would like to be able to transform the False values in between Trues to obtain a result like this (depending on a threshold).
# Threshold = 1
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, False, False, True]})
df
# Threshold = 2
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, True, True, True]})
df
Any suggestions to do this apart from a for loop?
Edit: The threshold value defines how many consecutive Falses you will take into account to do the transformation.
Edit2: In the beggining and end of the array you should not consider any special case.
If possible simplify solution for replace Falses groups less like Threshold value first filter separate groups by DataFrame.cumsum with DataFrame.mask, counts by Series.map with Series.value_counts and last compare by DataFrame.le with pass to DataFrame.mask:
Threshold = 1
m = df.cumsum().mask(df).apply(lambda x: x.map(x.value_counts())).le(Threshold)
df = df.mask(m, True)
If need not replace start or ends groups by Falses:
df = pd.DataFrame({"a": [False, False, True, False, True, False],
"b": [True, True, False, False, True, True]})
print (df)
a b
0 False True
1 False True
2 True False
3 False False
4 True True
5 False True
Threshold = 1
df1 = df.cumsum().mask(df)
m1 = df1.apply(lambda x: x.map(x.value_counts())).le(Threshold)
m2 = df1.ne(df1.iloc[0]) & df1.ne(df1.iloc[-1])
df = df.mask(m1 & m2, True)
print (df)
a b
0 False True
1 False True
2 True False
3 True False
4 True True
5 False True
one way would be to use itertools groupby to generate counts of each adjacent items group, but sadly it does include a couple of loops:
from itertools import groupby
def how_many_identical_elements(itter):
return sum([[x]*x for x in [len(list(v)) for g,v in groupby(itter)]], [])
def fill_up_df(df, th):
df = df.copy()
for c in df.columns:
df[f'{c}_count'] = how_many_identical_elements(df[c].values)
df[c] = [False if x[0]==False and x[1]>th else True for x in zip(df[c], df[f'{c}_count'])]
return df[[c for c in df.columns if 'count' not in c]]
then
fill_up_df(df, 1)
a
b
0
True
True
1
True
True
2
True
False
3
True
False
4
True
True
fill_up_df(df, 2)
a
b
0
True
True
1
True
True
2
True
True
3
True
True
4
True
True
This code looks from -threshold -> threshold, on a column-by-column basis and or's the results together to create a masking dataframe that meets your criteria. The last line is just the logical or of your original data and the new mask as we only need to fill False values. It should be one of the faster solutions if speed is an issue.
threshold = 2
filling_mask = reduce(
lambda x, y: x | y,
(
df.shift(-i, fill_value=True) & df.shift(i, fill_value=True)
for i in range(1, threshold + 1)
)
)
df |= filling_mask
Threshold 1:
>>> df # Threshold 1
a b
0 True True
1 True True
2 True False
3 True False
4 True True
Threshold 2:
>>> df # Threshold 2
a b
0 True True
1 True True
2 True True
3 True True
4 True True
I´m dealing with this example of DataFrame groupDisk, is the result of a Grouping operation (by VM), I need to count how many True appears in the list of each row of the column Thin
VM Powerstate Thin
0 VIRTU1 [poweredOn] [False]
1 VIRTU2 [poweredOn, poweredOn] [False, False]
2 VIRTU3 [poweredOn, poweredOn] [False, False]
3 VIRTU4 [poweredOn, poweredOn] [True, True]
4 VIRTU5 [poweredOn, poweredOn, poweredOn] [False, True, False]
This must be the result = 3
The Thin column can be 1, 2 or N elements
Any clue will be appreciated
Use Series.apply with sum if values are list of booleans:
df['new'] = df['Thin'].apply(sum)
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False, False] 0
2 VIRTU3 [poweredOn,poweredOn] [False, False] 0
3 VIRTU4 [poweredOn,poweredOn] [True, True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False, True, False] 1
Or if values are strings use Series.str.count:
df['new'] = df['Thin'].str.count('True')
print (df)
VM Powerstate Thin new
0 VIRTU1 [poweredOn] [False] 0
1 VIRTU2 [poweredOn,poweredOn] [False,False] 0
2 VIRTU3 [poweredOn,poweredOn] [False,False] 0
3 VIRTU4 [poweredOn,poweredOn] [True,True] 2
4 VIRTU5 [poweredOn,poweredOn,poweredOn] [False,True,False] 1
In [4]: df = pd.DataFrame({'a': [True, False, True], 'b': [False, False, False],
...: 'c': [False, False, False], 'd': [False, True, False],
...: 'e': [False, False, False]})
In [5]: df
Out[5]:
a b c d e
0 True False False False False
1 False False False True False
2 True False False False False
In [6]: df[df.any()[df.any()].index]
Out[6]:
a d
0 True False
1 False True
2 True False
The code under [6] does what I want. My question, however, is: is there a better solution? That is, more concise and/or more elegant.
One direct method is using df.loc with the mask generated by df.any() as input:
df.loc[:, df.any()]
a d
0 True False
1 False True
2 True False
Another option is to index df.columns,
df[df.columns[df.any()]]
Or, df.keys():
df[df.keys()[df.any()]]
a d
0 True False
1 False True
2 True False
I have two DataFrames where each column contain True/False statements. I am looking for a way to test all possible combinations and find out where "True" for each row in df1 also is "True" in the corresponding row in df2.
In reference to the data below, the logic would be something like this:
For each row, starting in column "Main1", test if row is equal to True and if row in column "Sub1" also is True. Next, test if row in "Main1" is equal to true and if rows in column "Sub1" is True and column "sub2" also is True. In this case, if all values are True, the output would be True. Then repeat for all columns and all possible combinations.
df1:
Main1 Main2 Main3
0 True False True
1 False False False
2 False True True
3 False False True
4 False True True
5 True True True
6 True False False
df2:
Sub1 Sub2 Sub3
0 False False True
1 False True False
2 True False True
3 False False False
4 True True False
5 False False False
6 True True True
The output would be similar to something like this.
Of course, I could do this manually but it would be timely as well as there would be rooms for errors.
Main1Sub1 Main1Sub1Sub2 ... Main3Sub2Sub3 Main3Sub3
0 False False ... False True
1 False False ... False False
2 False False ... False True
3 False False ... False False
4 False False ... False False
5 False False ... False False
6 True True ... False False
[7 rows x 18 columns]
Any help on how to tackle this problem is appreciated!
You can use the combinations() function in itertools to extract all the possible combinations of the columns of the 2 data frames, and then use the product() function in pandas to identify the rows where all the columns in the considered combination are equal to True. I included an example below, which considers all combinations of either 2 or 3 columns.
import pandas as pd
from itertools import combinations
df1 = pd.DataFrame({"Main1": [True, False, False, False, False, True, True],
"Main2": [False, False, True, False, True, True, False],
"Main3": [True, False, True, True, True, True, False]})
df2 = pd.DataFrame({"Sub1": [False, False, True, False, True, False, True],
"Sub2": [False, True, False, False, True, False, True],
"Sub3": [True, False, True, False, False, False, True]})
df3 = df1.join(df2)
all_combinations = list(combinations(df3.columns, 2)) + \
list(combinations(df3.columns, 3))
for combination in all_combinations:
df3["".join(list(combination))] = df3[list(combination)].product(axis=1).astype(bool)
df3.drop(labels=["Main1", "Main2", "Main3", "Sub1", "Sub2", "Sub3"], axis=1, inplace=True)
df3
Main1Main2 Main1Main3 ... Main3Sub2Sub3 Sub1Sub2Sub3
0 False True ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 False False ... False False
5 True True ... False False
6 False False ... False True
I want to have a random bit mask that has some specified percent of 0s. The function I devised is:
def create_mask(shape, rate):
"""
The idea is, you take a random permutations of numbers. You then mod then
mod it by the [number of entries in the bitmask] / [percent of 0s you
want]. The number of zeros will be exactly the rate of zeros need. You
can clamp the values for a bitmask.
"""
mask = torch.randperm(reduce(operator.mul, shape, 1)).float().cuda()
# Mod it by the percent to get an even dist of 0s.
mask = torch.fmod(mask, reduce(operator.mul, shape, 1) / rate)
# Anything not zero should be put to 1
mask = torch.clamp(mask, 0, 1)
return mask.view(shape)
To illustrate:
>>> x = create_mask((10, 10), 10)
>>> x
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1
0 1 1 1 1 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 0 1 1
0 1 1 1 1 1 1 1 1 1
1 1 1 0 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 1
[torch.cuda.FloatTensor of size 10x10 (GPU 0)]
The main issue I have with this method is it requires the rate to divide the shape. I want a function that accepts an arbitrary decimal and gives approximately rate percent of 0s in the bitmask. Furthermore, I am trying to find a relatively efficient way of doing so. Hence, I would rather not move a numpy array from the CPU to the GPU. Is there an effiecient way of doing so that allows for a decimal rate?
For anyone running into this, this will create a bitmask with approximately 80% zero's directly on GPU. (PyTorch 0.3)
torch.cuda.FloatTensor(10, 10).uniform_() > 0.8
Use NumPy and cudamat:
import numpy as np
import cudamat as cm
gpuMask = cm.CUDAMatrix(np.random.choice([0, 1], size=(10,10), p=[1./2, 1./2]))
where the elements of the list are fraction representations of the probabilities of your 1s and 0s.
The proper way to create a bit mask directly on GPU with Pytorch is:
import torch
tensor = torch.rand((3, 5), device=torch.device("cuda")) < 0.9
# tensor([[ True, True, False, True, True, True, True, True, True, False],
# [ True, True, True, True, True, True, True, False, False, True],
# [ True, False, False, True, True, True, True, True, True, False],
# [ True, True, True, True, True, True, True, True, True, True],
# [ True, True, False, True, True, True, True, False, True, True],
# [ True, True, False, False, True, True, True, False, True, True],
# [ True, True, True, False, True, True, True, True, True, True],
# [ True, True, True, True, True, True, False, False, True, True],
# [ True, False, True, True, True, True, True, True, True, True],
# [ True, True, True, True, True, True, True, True, False, True]],
# device='cuda:0')