Querying pandas against each other - python

I am trying to filter a df against another, something like this:
df1:
myval
0 1.2
1 3.5
2 5.7
3 0.4
df2:
thrsh
0 0.4
1 5.5
2 1.0
3 0.0
I would love to query this way:
(df1['myval']>df2['thrsh'])
so to come out with a new df which would have all the combinations:
df3:
thrsh myval
0 0 true
1 true
2 true
3 false
1 0 false
1 false
2 true
3 false
basically creating a 3rd dimension out of the combination of the 2 dfs.
as for now the result is "ValueError: Can only compare identically-labeled Series objects".
any idea?
thank you so much!

Create MultiIndex.from_product and then reindex both columns, so get same Multindex in both, so possible compare:
mux = pd.MultiIndex.from_product([df1.index, df2.index], names=['thrsh','myval'])
m = df1['myval'].reindex(mux, level=1) > df2['thrsh'].reindex(mux, level=0)
print (m)
thrsh myval
0 0 True
1 True
2 True
3 False
1 0 False
1 False
2 True
3 False
2 0 True
1 True
2 True
3 False
3 0 True
1 True
2 True
3 True
dtype: bool

Related

Convert first value from 0 to 1 based on group and mask columns in a Pandas Dataframe

I am trying to convert the first occurrence of 0 to 1 in a column in a Pandas Dataframe. The Column in question contains 1, 0 and null values. The sample data is as follows:
mask_col
categorical_col
target_col
TRUE
A
1
TRUE
A
1
FALSE
A
TRUE
A
0
FALSE
A
TRUE
A
0
TRUE
B
1
FALSE
B
FALSE
B
FALSE
B
TRUE
B
0
FALSE
B
I want row 4 and 11 to change to 1 and keep row 6 as 0.
How do I do this?
For set first 0 per groups by categorical_col use DataFrameGroupBy.idxmax with compare by 0 for set 1:
df.loc[df['target_col'].eq(0).groupby(df['categorical_col']).idxmax(), 'target_col'] = 1
print (df)
mask_col categorical_col target_col
0 True A 1.0
1 True A 1.0
2 False A NaN
3 True A 1.0
4 False A NaN
5 True A 0.0
6 True B 1.0
7 False B NaN
8 False B NaN
9 False B NaN
10 True B 1.0
11 False B NaN
The logic is not fully clear, so here are two options:
option 1
Considering the stretches of True per group of categorical_col and assuming you want the first N stretches (here N=2) as 1, you can use a custom groupby.apply:
vals = (df.groupby('categorical_col', group_keys=False)['mask_col']
.apply(lambda s: s.ne(s.shift())[s].cumsum())
)
df.loc[vals[vals.le(2)].index, 'target_col'] = 1
option 2
If you literally want to match only the first 0 per group and replace it with 1, you can slice only the 0s and get the first value's index with groupby.idxmax:
df.loc[df['target_col'].eq(0).groupby(df['categorical_col']).idxmax(), 'target_col'] = 1
# variant with idxmin
idx = df[df['target_col'].eq(0)].groupby(df['categorical_col'])['mask_col'].idxmin()
df.loc[idx, 'target_col'] = 1
Output:
mask_col categorical_col target_col
0 True A 1.0
1 True A 1.0
2 False A NaN
3 True A 1.0
4 False A NaN
5 True A 0.0
6 True B 1.0
7 False B NaN
8 False B NaN
9 False B NaN
10 True B 1.0
11 False B NaN
You can update the first zero occurrence for each category with the following loop:
for category in df['categorical_col'].unique():
index = df[(df['categorical_col'] == category) &
(df['target_col'] == 0)].index[0]
df.loc[index, 'target_col'] = 1

Multiply column values but exclude zero values in Pandas

Im looking for way to multiply values of all columns but to exclude columns that has a value of 0. So, result should not be 0 (multiplication by 0). If there is this number of columns and rows, its easy, but what If there are 100 columns and 5000 rows?
import pandas as pd
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
So result should look like this:
print(df)
# result should be multiplication of all column values, but not 0
# zeros should be excluded
6 * 1 * 2
4 * 4
3 * 3
3 * 2
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2],
"Result":[12,16,9,6]})
print(df)
I can not change the data , so changing zeros to 1 does not work
You could simply replace the 0s with 1s.
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
df['Result'] = df.replace(0,1).prod(axis=1)
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
To get technical - in multiplication 1 is the identity function. In addition the identity function is 0. to way oversimply it - An identity function is just a fancy way of saying "return the same result by adding another variable"
To get non technical I think of the quote "Think Smart Not Hard"
May be just replace zero values by one and multiply values:
df['Result'] = df.replace(0,1).apply(np.prod,axis=1)
Simple mask 0 values to NaN and call prod
df['Result'] = df.where(df.ne(0)).prod(1)
Out[1748]:
Col1 Col2 Col3 Result
0 6 1 2 12.0
1 4 0 4 16.0
2 3 0 3 9.0
3 0 3 2 6.0
Or mask 0 to 1
df['Result'] = df.where(df.ne(0), 1).prod(1)
Out[1754]:
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
step by step:
ne(0) return a boolean mask where 0 is masked as True
In [1755]: df.ne(0)
Out[1755]:
Col1 Col2 Col3 Result
0 True True True True
1 True False True True
2 True False True True
3 False True True True
where checks on each location of the boolean mask. On True, it keeps same value. On False it turns to NaN when there is no 2nd parameter.
In [1756]: df.where(df.ne(0))
Out[1756]:
Col1 Col2 Col3 Result
0 6.0 1.0 2 12
1 4.0 NaN 4 16
2 3.0 NaN 3 9
3 NaN 3.0 2 6
prod(1) is the product along axis=1. Prod is defaulted to ignore NaN, so It returns the product of each rows without consider NaN
In [1759]: df.where(df.ne(0)).prod(1)
Out[1759]:
0 12.0
1 16.0
2 9.0
3 6.0
dtype: float64
When specifying the 2nd parameter for where, it is used to replace on False mask.

how to add a new Boolean column if all rows are NULL?

I have a dataframe:
Out[8]:
0 1 2
0 0 1.0 2.0
1 NaN NaN NaN
2 0 0.0 NaN
3 0 1.0 2.0
4 0 1.0 2.0
I want to add a new Boolean column "abc" that if the row has all NaN, the "abc" is "true", otherwise the "abc" is "false", for example:
0 1 2 abc
0 0 1.0 2.0 false
1 NaN NaN NaN true
2 0 0.0 NaN false
3 0 1.0 2.0 false
4 0 1.0 2.0 false
here is my code to check the row
def check_null(df):
return df.isnull().all(axis=1)
it returns part of what I want:
check_null(df)
Out[10]:
0 false
1 true
2 false
3 false
4 false
dtype: bool
So, my question is, how can I add 'abc' as a new column in it?
I tried
df['abc'] = df.apply(check_null, axis=0)
it shows:
ValueError: ("No axis named 1 for object type ", 'occurred at index 0')
using isna with all
df.isna().all(axis = 1)
Out[121]:
0 False
1 True
2 False
3 False
4 False
dtype: bool
And you do not need apply it
def check_null(df):
return df.isnull().all(axis=1)
check_null(df)
Out[123]:
0 False
1 True
2 False
3 False
4 False
dtype: bool
If you do want apply it you need change your function remove the axis= 1
def check_null(df):
return df.isnull().all()
df.apply(check_null,1)
Out[133]:
0 False
1 True
2 False
3 False
4 False
dtype: bool

How to separate null and non-null containing rows into two different DataFrames?

Say I have a big DataFrame (>10000 rows) that has some rows containing one or more nulls. How do I remove all the rows containing a null in one or more of its columns from the original DataFrame and putting the rows into another DataFrame?
e.g.:
Original DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
3 NaN 5 4
4 "foo" NaN 1
Non-Null DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
Null containing DataFrame:
a b c
1 NaN 5 4
2 "foo" NaN 1
Use DataFrame.isna for checking missing values:
print (df.isna())
#print (df.isnull())
a b c
1 False False False
2 False False False
3 True False False
4 False True False
And test if at least True per row by DataFrame.any:
mask = df.isna().any(axis=1)
#oldier pandas versions
mask = df.isnull().any(axis=1)
print (mask)
1 False
2 False
3 True
4 True
dtype: bool
Last filter by boolean indexing - ~ is for inverting boolean mask:
df1 = df[~mask]
df2 = df[mask]
print (df1)
a b c
1 foo 5.0 3
2 bar 9.0 1
print (df2)
a b c
3 NaN 5.0 4
4 foo NaN 1

To display the none cell in the excel sheet by python

I'm dealing a problem with Excel sheet and python. I have successfully retrieved the specific columns and rows in Excel by pandas. Now I want to display only the rows and columns which has the "none" or "empty" value. Sample image of excel sheet
In above image, I need the rows and columns whose values is none. For example in "estfix" column has several none value so I need to check the column value if it is none i need to print it's corresponding row and column. Hope you understand.
Code I tried:
import pandas as pd
wb= pd.ExcelFile(r"pathname details.xlsx")
sheet_1=pd.read_excel(r"pathname details.xlsx",sheetname=0)
c=sheet_1[["bugid","sev","estfix","manager","director"]]
print(c)
I'm using python 3.6. Thanks in advance!
I expecting output like this:
Here Nan is consider as a None.
Use isnull with any for check at least one True:
a = df[df.isnull().any(axis=1)]
For columns with rows:
b = df.loc[df.isnull().any(axis=1), df.isnull().any()]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4.0 7 1.0 5 a
1 b NaN 8 3.0 3 a
2 c 4.0 9 NaN 6 a
3 d 5.0 4 7.0 9 b
4 e 5.0 2 1.0 2 b
5 f 4.0 3 0.0 4 b
a = df[df.isnull().any(1)]
print (a)
A B C D E F
1 b NaN 8 3.0 3 a
2 c 4.0 9 NaN 6 a
b = df.loc[df.isnull().any(axis=1), df.isnull().any()]
print (b)
B D
1 NaN 3.0
2 4.0 NaN
Detail:
print (df.isnull())
A B C D E F
0 False False False False False False
1 False True False False False False
2 False False False True False False
3 False False False False False False
4 False False False False False False
5 False False False False False False
print (df.isnull().any(axis=1))
0 False
1 True
2 True
3 False
4 False
5 False
dtype: bool
print (df.isnull().any())
A False
B True
C False
D True
E False
F False
dtype: bool

Categories