Drop rows with NaNs from pandas dataframe based on multiple conditions - python

I have a dataframe with a lot of NaNs.
y columns mean the count of events, val means values of each event in that yeat, and total means a multiplication of both columns.
Many columns have zeros and many have NaNs because values are not available (up to 80% of data is missing) is 4 columns.
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
3 3 3 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
I want to keep rows with all values with zeros and numbers AND I want to keep rows where first four columns (multiple condition) have zeros.
Expected output
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
Thanks!

Just pass the condition with all
out = df[df.iloc[:,:4].eq(0).all(1) | df.notna().all(1)]
Out[386]:
y17 y18 y19 y20 val17 ... val20 total17 total18 total19 total20
0 1 2 1 2 2.0 ... 2.0 1.0 4.0 2.0 4.0
1 2 2 2 2 2.0 ... 2.0 4.0 4.0 4.0 4.0
3 0 0 0 0 1.0 ... 4.0 0.0 0.0 0.0 0.0
4 0 0 0 0 NaN ... NaN NaN NaN NaN NaN
[4 rows x 12 columns]

Related

I am trying to replace NaN values with mean values

i have to replace the s_months and incidents NaN values with the corresponding means in jupyter notebook.
Input data :
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
I have tried the code below but it does not seem to work and I have tried different variations such as replacing the transform.
df.fillna['s_months'] = df.fillna(df.grouby(['types' , 'o_periods']['s_months','incidents']).tranform('mean'),inplace = True)
s_months incidents
Types o_periods
1 1 911 3
2 1688 8
2 1 26851 36
2 14440 36
3 1 914 2
2 862 1
4 1 296 0
2 889 3
5 1 663 4
2 1046 6
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
Types,c_years,o_periods,s_months,incidents
0,1,1,1,127.0,0.0
1,1,1,2,63.0,0.0
2,1,2,1,1095.0,3.0
3,1,2,2,1095.0,4.0
4,1,3,1,1512.0,6.0
5,1,3,2,3353.0,18.0
6,1,4,1,NaN,NaN
7,1,4,2,2244.0,11.0
14,2,4,1,NaN,NaN"""), sep=',')
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
>>> df[['c_years', 's_months', 'incidents']] = df.groupby(['Types', 'o_periods']).transform(lambda x: x.fillna(x.mean()))
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
The last NaN is here because it belongs to the last group which contains no value in the columns s_months and incidents and therefore, no mean.
Try this df['s_months'].fillna(df['s_months'].mean())
df['s_months'].mean() counts mean without Nan.
Your code is close, you can try modify it as follows to make it work:
df[['s_months','incidents']] = df[['s_months','incidents']].fillna(df.groupby(['Types' , 'o_periods'])[['s_months','incidents']].transform('mean'))
Data Input:
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
Output
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN

How to get the null rows of certain columns in python? [duplicate]

This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 2 years ago.
I am facing a issue with null rows. I want only the null rows of only certain columns of a data frame. Is it possible to get the null rows?
In [57]: df
Out[57]:
a b c d e
0 0 1 2 3 4
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan
Now I want nulls in b,c,e the result should be this one:
Out[57]:
a b c d e
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan
You could use isna() for axis=1.
df = pd.DataFrame({"a":[0,0,0,0,0], "b":[1,np.NaN,0,1,1], "c":[2,0,np.NaN,2,2], "d":[3,1,np.NaN,5,6], "e":[4,5,5,np.NaN,np.NaN]})
>>> df[df.isna().any(axis=1)]
a b c d e
1 0 NaN 0.0 1.0 5.0
2 0 0.0 NaN NaN 5.0
3 0 1.0 2.0 5.0 NaN
4 0 1.0 2.0 6.0 NaN
The same could be done using isnull() function
df[df.isnull().any(axis=1)]

Turn columns' values to headers of columns with values 1 and 0 ( accordingly) [python]

I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan
for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?
You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

reading multiple csv file into a big pandas data frame by appending the columns of different size

so i am creating some data frame in a loop and save them as a csv file. The data frame have the same columns but different length. i would like to be able to concatenate these data frames into a single data frame that has all the columns something like
df1
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 2
df2
A B C
0 0 1 2
1 0 1 0
2 0.2 1 2
df3
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 4
4 1 2 2
5 2.3 3 0
i would like to get something like
df_big
A B C A B C A B C
0 0 1 2 0 1 2 0 1 2
1 0 1 0 0 1 0 0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2 1 2 2 1 4
4 1 2 2
5 2.3 3 0
is this something that can be done in pandas?
You could use pd.concat:
df_big = pd.concat([df1, df2, df3], axis=1)
yields
A B C A B C A B C
0 0.0 1 2 0.0 1 2 0.0 1 2
1 0.0 1 0 0.0 1 0 0.0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2.0 1 2 NaN NaN NaN 2.0 1 4
4 NaN NaN NaN NaN NaN NaN 1.0 2 2
5 NaN NaN NaN NaN NaN NaN 2.3 3 0

Categories