Lets say I am updating my dataframe with another dataframe (df2)
import pandas as pd
import numpy as np
df=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux'],
'A': [1,np.nan,1,1],
'B': [1,np.nan,np.nan,1],
'C': [np.nan,1,np.nan,1],
'D': [1,np.nan,1,np.nan],
}).set_index(['axis1'])
print (df)
df2=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux','A'],
'A': [1,1,np.nan,np.nan,np.nan],
'E': [1,np.nan,1,1,1],
}).set_index(['axis1'])
df = df.reindex(columns=df2.columns.union(df.columns),
index=df2.index.union(df.index))
df.update(df2)
print (df)
Is there a command to get the number of cells that were updated? (changed from Nan to 1)
I want to use this to track changes to my dataframe.
There is no built in method in pandas I can think of, you would have to save the original df prior to the update and then compare, the trick is to ensure that NaN comparisons are treated the same as non-zero values, here df3 is a copy of df prior to the call to update:
In [104]:
df.update(df2)
df
Out[104]:
A B C D E
axis1
A NaN NaN NaN NaN 1
Apple 1 NaN NaN 1 1
Linux 1 1 1 NaN 1
Unix 1 1 NaN 1 1
Window 1 NaN 1 NaN NaN
[5 rows x 5 columns]
In [105]:
df3
Out[105]:
A B C D E
axis1
A NaN NaN NaN NaN NaN
Apple 1 NaN NaN 1 NaN
Linux 1 1 1 NaN NaN
Unix 1 1 NaN 1 NaN
Window NaN NaN 1 NaN NaN
[5 rows x 5 columns]
In [106]:
# compare but notice that NaN comparison returns True
df!=df3
Out[106]:
A B C D E
axis1
A True True True True True
Apple False True True False True
Linux False False False True True
Unix False False True False True
Window True True False True True
[5 rows x 5 columns]
In [107]:
# use numpy count_non_zero for easy counting, note this gives wrong result
np.count_nonzero(df!=df3)
Out[107]:
16
In [132]:
~((df == df3) | (np.isnan(df) & np.isnan(df3)))
Out[132]:
A B C D E
axis1
A False False False False True
Apple False False False False True
Linux False False False False True
Unix False False False False True
Window True False False False False
[5 rows x 5 columns]
In [133]:
np.count_nonzero(~((df == df3) | (np.isnan(df) & np.isnan(df3))))
Out[133]:
5
Related
I am trying to convert the first occurrence of 0 to 1 in a column in a Pandas Dataframe. The Column in question contains 1, 0 and null values. The sample data is as follows:
mask_col
categorical_col
target_col
TRUE
A
1
TRUE
A
1
FALSE
A
TRUE
A
0
FALSE
A
TRUE
A
0
TRUE
B
1
FALSE
B
FALSE
B
FALSE
B
TRUE
B
0
FALSE
B
I want row 4 and 11 to change to 1 and keep row 6 as 0.
How do I do this?
For set first 0 per groups by categorical_col use DataFrameGroupBy.idxmax with compare by 0 for set 1:
df.loc[df['target_col'].eq(0).groupby(df['categorical_col']).idxmax(), 'target_col'] = 1
print (df)
mask_col categorical_col target_col
0 True A 1.0
1 True A 1.0
2 False A NaN
3 True A 1.0
4 False A NaN
5 True A 0.0
6 True B 1.0
7 False B NaN
8 False B NaN
9 False B NaN
10 True B 1.0
11 False B NaN
The logic is not fully clear, so here are two options:
option 1
Considering the stretches of True per group of categorical_col and assuming you want the first N stretches (here N=2) as 1, you can use a custom groupby.apply:
vals = (df.groupby('categorical_col', group_keys=False)['mask_col']
.apply(lambda s: s.ne(s.shift())[s].cumsum())
)
df.loc[vals[vals.le(2)].index, 'target_col'] = 1
option 2
If you literally want to match only the first 0 per group and replace it with 1, you can slice only the 0s and get the first value's index with groupby.idxmax:
df.loc[df['target_col'].eq(0).groupby(df['categorical_col']).idxmax(), 'target_col'] = 1
# variant with idxmin
idx = df[df['target_col'].eq(0)].groupby(df['categorical_col'])['mask_col'].idxmin()
df.loc[idx, 'target_col'] = 1
Output:
mask_col categorical_col target_col
0 True A 1.0
1 True A 1.0
2 False A NaN
3 True A 1.0
4 False A NaN
5 True A 0.0
6 True B 1.0
7 False B NaN
8 False B NaN
9 False B NaN
10 True B 1.0
11 False B NaN
You can update the first zero occurrence for each category with the following loop:
for category in df['categorical_col'].unique():
index = df[(df['categorical_col'] == category) &
(df['target_col'] == 0)].index[0]
df.loc[index, 'target_col'] = 1
I'm still quite new to Python and programming in general. With luck, I have the right idea, but I can't quite get this to work.
With my example df, I want iteration to start when entry == 1.
import pandas as pd
import numpy as np
nan = np.nan
a = [0,0,4,4,4,4,6,6]
b = [4,4,4,4,4,4,4,4]
entry = [nan,nan,nan,nan,1,nan,nan,nan]
df = pd.DataFrame(columns=['a', 'b', 'entry'])
df = pd.DataFrame.assign(df, a=a, b=b, entry=entry)
I wrote a function, with little success. It returns an error, unhashable type: 'slice'. FWIW, I'm applying this function to groups of various lengths.
def exit_row(df):
start = df.index[df.entry == 1]
df.loc[start:,(df.a > df.b), 'exit'] = 1
return df
Ideally, the result would be as below:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1
7 6 4 NaN 1
Any advice much appreciated. I had wondered if I should attempt a For loop instead, though I often find them difficult to read.
You can use boolean indexing:
# what are the rows after entry?
m1 = df['entry'].notna().cummax()
# in which rows is a>b?
m2 = df['a'].gt(df['b'])
# set 1 where both conditions are True
df.loc[m1&m2, 'exit'] = 1
output:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1.0
7 6 4 NaN 1.0
Intermediates:
a b entry notna m1 m2 m1&m2 exit
0 0 4 NaN False False False False NaN
1 0 4 NaN False False False False NaN
2 4 4 NaN False False False False NaN
3 4 4 NaN False False False False NaN
4 4 4 1.0 True True False False NaN
5 4 4 NaN False True False False NaN
6 6 4 NaN False True True True 1.0
7 6 4 NaN False True True True 1.0
I have a dataframe:
Out[8]:
0 1 2
0 0 1.0 2.0
1 NaN NaN NaN
2 0 0.0 NaN
3 0 1.0 2.0
4 0 1.0 2.0
I want to add a new Boolean column "abc" that if the row has all NaN, the "abc" is "true", otherwise the "abc" is "false", for example:
0 1 2 abc
0 0 1.0 2.0 false
1 NaN NaN NaN true
2 0 0.0 NaN false
3 0 1.0 2.0 false
4 0 1.0 2.0 false
here is my code to check the row
def check_null(df):
return df.isnull().all(axis=1)
it returns part of what I want:
check_null(df)
Out[10]:
0 false
1 true
2 false
3 false
4 false
dtype: bool
So, my question is, how can I add 'abc' as a new column in it?
I tried
df['abc'] = df.apply(check_null, axis=0)
it shows:
ValueError: ("No axis named 1 for object type ", 'occurred at index 0')
using isna with all
df.isna().all(axis = 1)
Out[121]:
0 False
1 True
2 False
3 False
4 False
dtype: bool
And you do not need apply it
def check_null(df):
return df.isnull().all(axis=1)
check_null(df)
Out[123]:
0 False
1 True
2 False
3 False
4 False
dtype: bool
If you do want apply it you need change your function remove the axis= 1
def check_null(df):
return df.isnull().all()
df.apply(check_null,1)
Out[133]:
0 False
1 True
2 False
3 False
4 False
dtype: bool
A list has many paths of certain csv's.
How to check if each csv in every loop has any empty columns and delete them if they are.
Code:
for i in list1:
if (list1.columns = '').any():
i.remove that column
Hope this explains what I am talking about.
Sample:
df = pd.DataFrame({
'':list('abcdef'),
'B':[4,5,4,5,5,np.nan],
'C':[''] * 6,
'D':[np.nan] * 6,
'E':[5,3,6,9,2,4],
'F':list('aaabb') + ['']
})
print (df)
B C D E F
0 a 4.0 NaN 5 a
1 b 5.0 NaN 3 a
2 c 4.0 NaN 6 a
3 d 5.0 NaN 9 b
4 e 5.0 NaN 2 b
5 f NaN NaN 4
Removed first column, because empty column name - it means filtering only columns with no empty values with loc and boolean indexing:
df1 = df.loc[:, df.columns != '']
print (df1)
B C D E F
0 4.0 NaN 5 a
1 5.0 NaN 3 a
2 4.0 NaN 6 a
3 5.0 NaN 9 b
4 5.0 NaN 2 b
5 NaN NaN 4
Reoved column C, because filled only empty values - compare all values if not empty values and get at least one True per column by DataFrame.any, also filter by boolean indexing with loc:
df2 = df.loc[:, (df != '').any()]
print (df2)
B D E
0 a 4.0 NaN 5
1 b 5.0 NaN 3
2 c 4.0 NaN 6
3 d 5.0 NaN 9
4 e 5.0 NaN 2
5 f NaN NaN 4
print ((df != ''))
B C D E F
0 True True False True True True
1 True True False True True True
2 True True False True True True
3 True True False True True True
4 True True False True True True
5 True True False True True False
print ((df != '').any())
True
B True
C False
D True
E True
F True
dtype: bool
Removed column D because filled only missing values with function dropna:
df3 = df.dropna(axis=1, how='all')
print (df3)
B C E F
0 a 4.0 5 a
1 b 5.0 3 a
2 c 4.0 6 a
3 d 5.0 9 b
4 e 5.0 2 b
5 f NaN 4
I'm dealing a problem with Excel sheet and python. I have successfully retrieved the specific columns and rows in Excel by pandas. Now I want to display only the rows and columns which has the "none" or "empty" value. Sample image of excel sheet
In above image, I need the rows and columns whose values is none. For example in "estfix" column has several none value so I need to check the column value if it is none i need to print it's corresponding row and column. Hope you understand.
Code I tried:
import pandas as pd
wb= pd.ExcelFile(r"pathname details.xlsx")
sheet_1=pd.read_excel(r"pathname details.xlsx",sheetname=0)
c=sheet_1[["bugid","sev","estfix","manager","director"]]
print(c)
I'm using python 3.6. Thanks in advance!
I expecting output like this:
Here Nan is consider as a None.
Use isnull with any for check at least one True:
a = df[df.isnull().any(axis=1)]
For columns with rows:
b = df.loc[df.isnull().any(axis=1), df.isnull().any()]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4.0 7 1.0 5 a
1 b NaN 8 3.0 3 a
2 c 4.0 9 NaN 6 a
3 d 5.0 4 7.0 9 b
4 e 5.0 2 1.0 2 b
5 f 4.0 3 0.0 4 b
a = df[df.isnull().any(1)]
print (a)
A B C D E F
1 b NaN 8 3.0 3 a
2 c 4.0 9 NaN 6 a
b = df.loc[df.isnull().any(axis=1), df.isnull().any()]
print (b)
B D
1 NaN 3.0
2 4.0 NaN
Detail:
print (df.isnull())
A B C D E F
0 False False False False False False
1 False True False False False False
2 False False False True False False
3 False False False False False False
4 False False False False False False
5 False False False False False False
print (df.isnull().any(axis=1))
0 False
1 True
2 True
3 False
4 False
5 False
dtype: bool
print (df.isnull().any())
A False
B True
C False
D True
E False
F False
dtype: bool