Make a Pandas mask based on a column vector - python

I have a given dataframe and I would like for each line to be able to select the values that are above the line's given percentile.
Let's consider this dataframe:
df = pd.DataFrame({'A' : [5,6,3,4, 0,5,9], 'B' : [1,2,3, 5,7,0,1]})
A B
0 5 1
1 6 2
2 3 3
3 4 5
4 0 7
5 5 0
6 9 1
And a given vector of the 20th quantiles for each row:
rowsQuantiles = df.quantile(0.2, axis=1)
0 1.8
1 2.8
2 3.0
3 4.2
4 1.4
5 1.0
6 2.6
I would like to be able to filter-out for each row the values that are below the row's quantile in to have the following result:
quantileMask = df > rowsQuantiles
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
EDIT:
I really liked both approaches by #andrew_reece and #Andy Hayden, so I decided to see which one was the fastet/best-implemented:
N=10000000
df = pd.DataFrame({'A' : [random.random() for i in range(N)], 'B' : [random.random() for i in range(N)]})
rowsQuantiles = df.quantile(0.2, axis=1)
t0=time.time()
mask=(df.T>rowsQuantiles).T
#mask=df.apply(lambda row: row > rowsQuantiles)
print(str(time.time()-t0))
Results are pretty straightforward (after several repeted tests):
220ms for mask=(df.T>rowsQuantiles).T
65ms for mask=df.apply(lambda row: row > rowsQuantiles)
21ms for df.gt(rowsQuantiles,0), the accepted answer.

Also only using gt
df.gt(rowsQuantiles,0)
Out[288]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
Using add
df.add(-rowsQuantiles,0).gt(0)
Out[284]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False

There's a transpose error with your mask, but assuming you want to replace the values with NaN, the method you're looking for is where:
In [11]: df.T > rowsQuantiles
Out[11]:
0 1 2 3 4 5 6
A True True False False False True True
B False False False True True False False
In [12]: (df.T > rowsQuantiles).T
Out[12]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
In [13]: df.where((df.T > rowsQuantiles).T)
Out[13]:
A B
0 5.0 NaN
1 6.0 NaN
2 NaN NaN
3 NaN 5.0
4 NaN 7.0
5 5.0 NaN
6 9.0 NaN

df.apply(lambda row: row > rowsQuantiles)
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False

An alternative I could get behind is np.where:
np.where(df.values > rowsQuantiles[:, None], True, False)
array([[ True, False],
[ True, False],
[False, False],
[False, True],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
Which returns a numpy array, if you're okay with that.
Timings
%timeit df.T > rowsQuantiles
1 loop, best of 3: 251 ms per loop
%timeit df.where((df.T > rowsQuantiles).T)
1 loop, best of 3: 583 ms per loop
%timeit np.where(df.values > rowsQuantiles[:, None], True, False)
10 loops, best of 3: 136 ms per loop
%timeit df.add(-rowsQuantiles,0).gt(0)
10 loops, best of 3: 141 ms per loop
%timeit df.gt(rowsQuantiles,0)
10 loops, best of 3: 25.4 ms per loop
%timeit df.apply(lambda row: row > rowsQuantiles)
10 loops, best of 3: 60.6 ms per loop

Related

Condition is true to start counting, until the next row is true to restart counting

expected result table
bool count
0 FALSE
1 FALSE
2 TRUE 0
3 FALSE 1
4 FALSE 2
5 FALSE 3
6 TRUE 0
7 FALSE 1
8 TRUE 0
9 TRUE 0
How to calculate the value of column 'count'
Here you go:
# create bool dataframe
df = pd.DataFrame(dict(bool_= [0, 0, 1, 0, 0, 1, 1, 0, 0, 0]), dtype= bool)
df.index = list("abcdefghij")
# create a new Series unique integers to associate a group for the rows
# between True values
ix = pd.Series(range(df.shape[0])).where(df.bool_.values, np.nan).ffill().values
# if the first rows are False, they will be NaNs and shouldn't be
# counted so only perform groupby and cumcount() for what is notna
notna = pd.notna(ix)
df["count"] = df[notna].groupby(ix[notna]).cumcount()
>>> df
bool_ count
a False NaN
b False NaN
c True 0.0
d False 1.0
e False 2.0
f True 0.0
g True 0.0
h False 1.0
i False 2.0
j False 3.0
Use a GroupBy.cumcount and mask with where:
g = df['bool'].cumsum()
df['count'] = df['bool'].groupby(g).cumcount().where(g.gt(0))
Alternative:
g = df['bool'].cumsum()
df['count'] = (df['bool'].groupby(g).cumcount()
.where(df['bool'].cummax())
)
Output:
bool count
0 False NaN
1 False NaN
2 True 0.0
3 False 1.0
4 False 2.0
5 True 0.0
6 True 0.0
7 False 1.0
8 False 2.0
9 False 3.0
You can try groupby the cumsum of bool column then transform a customize function to check if first element in each group is True
df['m'] = df['bool'].cumsum()
df['out'] = (df.groupby(df['bool'].cumsum())
['bool'].transform(lambda col: range(len(col)) if col.iloc[0] else [pd.NA]*len(col)))
print(df)
bool count m out
0 False NaN 0 <NA>
1 False NaN 0 <NA>
2 True 0.0 1 0
3 False 1.0 1 1
4 False 2.0 1 2
5 False 3.0 1 3
6 True 0.0 2 0
7 False 1.0 2 1
8 True 0.0 3 0
9 True 0.0 4 0
I think your question is not clear.
We need a little more context and objectives to work with here.
Let's assume that you have a dataframe of Boolean values [True, False], and you wish to compute a count of how many "True" and how many "False"
import pandas as pd
import random
## Randomly generating Boolean values to populate a dataframe
choices = [ 'True', 'False' ]
df = pd.DataFrame(index = range(10), columns = ['boolean'])
df['boolean'] = df['boolean'].apply(lambda x: random.choice(choices))
Randomly generated data
boolean
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
## Reporting the count of True and False values
results = df.groupby('boolean').size()
print(results)
Results
boolean
False 8
True 2
If you want to obtain the count not in pandas way, you can try this.
result = []
count = np.nan
for i in df['bool']:
if i == True:
count = 0
result.append(count)
if i == False:
count += 1
result.append(count)
elif i == False:
result.append(np.nan)
result
Out[4]: [nan, nan, 0, 1, 2, 3, 0, 1, 0, 0]
df['count'] = result
If you mean the sum of all the elements in count then you can do it this way:
Count_Total = df['count'].sum()

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

Is there a Pandas equivalent to tidyr's uncount?

Let's assume we have a table with groupings of variable and their frequencies:
In R:
> df
# A tibble: 3 x 3
Cough Fever cases
<lgl> <lgl> <dbl>
1 TRUE FALSE 1
2 FALSE FALSE 2
3 TRUE TRUE 3
Then we could use tidyr::uncount to get a dataframe with the individual cases:
> uncount(df, cases)
# A tibble: 6 x 2
Cough Fever
<lgl> <lgl>
1 TRUE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
Is there an equivalent in Python/Pandas?
You have a row index and repeat it according to the counts, for example in R you can do:
df[rep(1:nrow(df),df$cases),]
first to get a data like yours:
df = pd.DataFrame({'x':[1,1,2,2,2,2],'y':[0,1,0,1,1,1]})
counts = df.groupby(['x','y']).size().reset_index()
counts.columns = ['x','y','n']
x y n
0 1 0 1
1 1 1 1
2 2 0 1
3 2 1 3
Then:
counts.iloc[np.repeat(np.arange(len(counts)),counts.n),:2]
x y
0 1 0
1 1 1
2 2 0
3 2 1
3 2 1
3 2 1
I haven't found an equivalent function in Python, but this works
df2 = df.pop('cases')
df = pd.DataFrame(df.values.repeat(df2, axis=0), columns=df.columns)
df['cases'] is passed to df2, then you create a new DataFrame with the elements from the original DataFrame repeated according to the count in df2. Please let me know if it helps.
In addition to the other solutions, you could combine take, repeat and drop:
import pandas as pd
df = pd.DataFrame({'Cough': [True, False, True],
'Fever': [False, False, True],
'cases': [1, 2, 3]})
df.take(df.index.repeat(df.cases)).drop(columns="cases")
Cough Fever
0 True False
1 False False
1 False False
2 True True
2 True True
2 True True
As easy as you use tidyr's API with datar:
>>> from datar.all import f, tribble, uncount
>>> df = tribble(
... f.Cough, f.Fever, f.cases,
... True, False, 1,
... False, False, 2,
... True, True, 3
... )
>>> uncount(df, f.cases)
Cough Fever
<bool> <bool>
0 True False
1 False False
2 False False
3 True True
4 True True
5 True True
I am the author of the package. Feel free to submit issues if you have any questions.

All rows within a given column must match, for all columns

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

How to nicely measure runs of same-data in a pandas dataframe

I want to give a function an arbitrary dataframe, dateindex, and column and ask it to return how many continuous preceding rows (including itself) had the same value. I've been able to keep most of my pandas code vectorized. Struggling to think how I can do this cleanly though.
Below is a small toy dataset and examples of what outputs I'd want from the function.
bar foo
2016-06-01 False True
2016-06-02 True False
2016-06-03 True True
2016-06-06 True False
2016-06-07 False False
2016-06-08 True False
2016-06-09 True False
2016-06-10 False True
2016-06-13 False True
2016-06-14 True True
import pandas as pd
rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)
"""
consec('foo','2016-06-09') => 4 # it's the fourth continuous 'False' in a row
consec('foo', '2016-06-08') => 3 # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1 # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3 # It's the third continuous True
"""
==================
I ended up using the itertools-answer below, with a small change, because it got me exactly what I wanted (slightly more involved than my original question spec). Thanks for the many suggestions.
rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
if state == True:
lst.extend([i+1 for i,v in enumerate(repeat_values)])
elif state == False:
lst.extend([0 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
The output as follows:
bar foo bar_consec foo_consec
2016-06-01 False True 0 1
2016-06-02 True False 1 0
2016-06-03 True True 2 1
2016-06-06 True False 3 0
2016-06-07 False False 0 0
2016-06-08 True False 1 0
2016-06-09 True False 2 0
2016-06-10 False True 0 1
2016-06-13 False True 0 2
2016-06-14 True True 1 3
2016-06-15 False True 0 4
2016-06-16 True False 1 0
2016-06-17 True True 2 1
2016-06-20 True False 3 0
2016-06-21 False False 0 0
2016-06-22 True False 1 0
Here's an alternative method which creates a new column with the relevant consecutive count for each row. I tested this when the dataframe has 10000 rows and it took 24 ms. It uses groupby from itertools. It takes advantage of the fact that a break is created whenever the key value, in this case foo and bar changes so we can just use the index from there.
rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
lst.extend([i+1 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
Output:
bar foo bar_consec foo_consec
2016-06-01 False True 1 1
2016-06-02 True False 1 1
2016-06-03 True True 2 1
2016-06-06 True False 3 1
2016-06-07 False False 1 2
2016-06-08 True False 1 3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop
try this:
In [135]: %paste
def consec(df, col, d):
return (df[:d].groupby((df[col] != df[col].shift())
.cumsum())[col]
.transform('size').tail(1)[0])
## -- End pasted text --
In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4
In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3
In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1
In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1
Explanation:
In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01 1
2016-06-02 2
2016-06-03 3
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 5
2016-06-13 5
2016-06-14 5
Freq: B, Name: foo, dtype: int32
In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01 1
2016-06-02 1
2016-06-03 1
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 3
2016-06-13 3
2016-06-14 3
Freq: B, dtype: int64
In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14 3
Freq: B, dtype: int64
You can use:
#reorder index in df
df = df[::-1]
def consec(col, date):
#select df by date
df1 = df.ix[date:,:]
#get first group == 1
colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3
Another solution with finding last value of Series colconsec by iat for creating mask:
def consec(col, date):
df1 = df.ix[:date,:]
colconsec = (df1[col] != df1[col].shift()).cumsum()
mask = colconsec == colconsec.iat[-1]
return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

Categories