In a pandas DataFrame, I have a series of boolean values. In order to filter to rows where the boolean is True, I can use: df[df.column_x]
I thought in order to filter to only rows where the column is False, I could use: df[~df.column_x]. I feel like I have done this before, and have seen it as the accepted answer.
However, this fails because ~df.column_x converts the values to integers. See below.
import pandas as pd . # version 0.24.2
a = pd.Series(['a', 'a', 'a', 'a', 'b', 'a', 'b', 'b', 'b', 'b'])
b = pd.Series([True, True, True, True, True, False, False, False, False, False], dtype=bool)
c = pd.DataFrame(data=[a, b]).T
c.columns = ['Classification', 'Boolean']```
print(~c.Boolean)
0 -2
1 -2
2 -2
3 -2
4 -2
5 -1
6 -1
7 -1
8 -1
9 -1
Name: Boolean, dtype: object
print(~b)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
dtype: bool
Basically, I can use c[~b], but not c[~c.Boolean]
Am I just dreaming that this use to work?
Ah , since you created the c by using DataFrame constructor , then T,
1st let us look at what we have before T:
pd.DataFrame([a, b])
Out[610]:
0 1 2 3 4 5 6 7 8 9
0 a a a a b a b b b b
1 True True True True True False False False False False
So pandas will make each columns only have one dtype, if not it will convert to object .
After T what data type we have for each columns
The dtypes in your c :
c.dtypes
Out[608]:
Classification object
Boolean object
Boolean columns became object type , that is why you get unexpected output for ~c.Boolean
How to fix it ? ---concat
c=pd.concat([a,b],1)
c.columns = ['Classification', 'Boolean']
~c.Boolean
Out[616]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: Boolean, dtype: bool
Related
I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3
I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10
How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.
Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})
Let's assume we have a table with groupings of variable and their frequencies:
In R:
> df
# A tibble: 3 x 3
Cough Fever cases
<lgl> <lgl> <dbl>
1 TRUE FALSE 1
2 FALSE FALSE 2
3 TRUE TRUE 3
Then we could use tidyr::uncount to get a dataframe with the individual cases:
> uncount(df, cases)
# A tibble: 6 x 2
Cough Fever
<lgl> <lgl>
1 TRUE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
Is there an equivalent in Python/Pandas?
You have a row index and repeat it according to the counts, for example in R you can do:
df[rep(1:nrow(df),df$cases),]
first to get a data like yours:
df = pd.DataFrame({'x':[1,1,2,2,2,2],'y':[0,1,0,1,1,1]})
counts = df.groupby(['x','y']).size().reset_index()
counts.columns = ['x','y','n']
x y n
0 1 0 1
1 1 1 1
2 2 0 1
3 2 1 3
Then:
counts.iloc[np.repeat(np.arange(len(counts)),counts.n),:2]
x y
0 1 0
1 1 1
2 2 0
3 2 1
3 2 1
3 2 1
I haven't found an equivalent function in Python, but this works
df2 = df.pop('cases')
df = pd.DataFrame(df.values.repeat(df2, axis=0), columns=df.columns)
df['cases'] is passed to df2, then you create a new DataFrame with the elements from the original DataFrame repeated according to the count in df2. Please let me know if it helps.
In addition to the other solutions, you could combine take, repeat and drop:
import pandas as pd
df = pd.DataFrame({'Cough': [True, False, True],
'Fever': [False, False, True],
'cases': [1, 2, 3]})
df.take(df.index.repeat(df.cases)).drop(columns="cases")
Cough Fever
0 True False
1 False False
1 False False
2 True True
2 True True
2 True True
As easy as you use tidyr's API with datar:
>>> from datar.all import f, tribble, uncount
>>> df = tribble(
... f.Cough, f.Fever, f.cases,
... True, False, 1,
... False, False, 2,
... True, True, 3
... )
>>> uncount(df, f.cases)
Cough Fever
<bool> <bool>
0 True False
1 False False
2 False False
3 True True
4 True True
5 True True
I am the author of the package. Feel free to submit issues if you have any questions.
Suppose I have df below:
df = pd.DataFrame({
'A': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'B': [False, True, False, False, True, False, False, True]
})
df is already sorted by A (obviously) and time (descending). So for each group defined by A, the vlues in B are time sorted descendingly. What I want to do is to add a columns C which, for each group, is True if there is a True value in B in the past. The result would look like:
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
I suspect I need to use groupby() and idxmax() somehow but haven't been able to make it work. Any ideas?
idxmax is the way with transform
df['New']=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
df
A B New
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
If all False
s1=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
s2=df.groupby('A').B.transform('any')
df['New']=s1&s2
IIUC here's one way:
rev_cs = df[::-1].groupby('A').B.apply(lambda x: x.cumsum().shift(fill_value=0.).gt(0))
df['C'] = rev_cs[::-1]
print(df)
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False