(Python) Selecting rows containing a string in ANY column? - python

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm

Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

Related

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

Create column indicating historical existence of specific value based on other column

Suppose I have df below:
df = pd.DataFrame({
'A': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'B': [False, True, False, False, True, False, False, True]
})
df is already sorted by A (obviously) and time (descending). So for each group defined by A, the vlues in B are time sorted descendingly. What I want to do is to add a columns C which, for each group, is True if there is a True value in B in the past. The result would look like:
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
I suspect I need to use groupby() and idxmax() somehow but haven't been able to make it work. Any ideas?
idxmax is the way with transform
df['New']=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
df
A B New
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
If all False
s1=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
s2=df.groupby('A').B.transform('any')
df['New']=s1&s2
IIUC here's one way:
rev_cs = df[::-1].groupby('A').B.apply(lambda x: x.cumsum().shift(fill_value=0.).gt(0))
df['C'] = rev_cs[::-1]
print(df)
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False

What happened to python's ~ when working with boolean?

In a pandas DataFrame, I have a series of boolean values. In order to filter to rows where the boolean is True, I can use: df[df.column_x]
I thought in order to filter to only rows where the column is False, I could use: df[~df.column_x]. I feel like I have done this before, and have seen it as the accepted answer.
However, this fails because ~df.column_x converts the values to integers. See below.
import pandas as pd . # version 0.24.2
a = pd.Series(['a', 'a', 'a', 'a', 'b', 'a', 'b', 'b', 'b', 'b'])
b = pd.Series([True, True, True, True, True, False, False, False, False, False], dtype=bool)
c = pd.DataFrame(data=[a, b]).T
c.columns = ['Classification', 'Boolean']```
print(~c.Boolean)
0 -2
1 -2
2 -2
3 -2
4 -2
5 -1
6 -1
7 -1
8 -1
9 -1
Name: Boolean, dtype: object
print(~b)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
dtype: bool
Basically, I can use c[~b], but not c[~c.Boolean]
Am I just dreaming that this use to work?
Ah , since you created the c by using DataFrame constructor , then T,
1st let us look at what we have before T:
pd.DataFrame([a, b])
Out[610]:
0 1 2 3 4 5 6 7 8 9
0 a a a a b a b b b b
1 True True True True True False False False False False
So pandas will make each columns only have one dtype, if not it will convert to object .
After T what data type we have for each columns
The dtypes in your c :
c.dtypes
Out[608]:
Classification object
Boolean object
Boolean columns became object type , that is why you get unexpected output for ~c.Boolean
How to fix it ? ---concat
c=pd.concat([a,b],1)
c.columns = ['Classification', 'Boolean']
~c.Boolean
Out[616]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: Boolean, dtype: bool

Condensing pandas dataframe by dropping missing elements

Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.
There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e
ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)
Output:
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True
Hope it helps.
A numpy solution from Divakar here for 10x speed i.e
def mask_app(a):
out = np.full(a.shape,np.nan,dtype=a.dtype)
mask = ~np.isnan(a.astype(float))
out[np.sort(mask,1)[:,::-1]] = a[mask]
return out
ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True

All rows within a given column must match, for all columns

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

Categories