This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 2 years ago.
I am facing a issue with null rows. I want only the null rows of only certain columns of a data frame. Is it possible to get the null rows?
In [57]: df
Out[57]:
a b c d e
0 0 1 2 3 4
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan
Now I want nulls in b,c,e the result should be this one:
Out[57]:
a b c d e
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan
You could use isna() for axis=1.
df = pd.DataFrame({"a":[0,0,0,0,0], "b":[1,np.NaN,0,1,1], "c":[2,0,np.NaN,2,2], "d":[3,1,np.NaN,5,6], "e":[4,5,5,np.NaN,np.NaN]})
>>> df[df.isna().any(axis=1)]
a b c d e
1 0 NaN 0.0 1.0 5.0
2 0 0.0 NaN NaN 5.0
3 0 1.0 2.0 5.0 NaN
4 0 1.0 2.0 6.0 NaN
The same could be done using isnull() function
df[df.isnull().any(axis=1)]
Related
I have a dataframe with a lot of NaNs.
y columns mean the count of events, val means values of each event in that yeat, and total means a multiplication of both columns.
Many columns have zeros and many have NaNs because values are not available (up to 80% of data is missing) is 4 columns.
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
3 3 3 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
I want to keep rows with all values with zeros and numbers AND I want to keep rows where first four columns (multiple condition) have zeros.
Expected output
y17 y18 y19 y20 val17 va18 val19 val20 total17 total18 total19 total20
1 2 1 2 2 2 2 2 1 4 2 4
2 2 2 2 2 2 2 2 4 4 4 4
0 0 0 0 1 2 3 4 0 0 0 0
0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN
Thanks!
Just pass the condition with all
out = df[df.iloc[:,:4].eq(0).all(1) | df.notna().all(1)]
Out[386]:
y17 y18 y19 y20 val17 ... val20 total17 total18 total19 total20
0 1 2 1 2 2.0 ... 2.0 1.0 4.0 2.0 4.0
1 2 2 2 2 2.0 ... 2.0 4.0 4.0 4.0 4.0
3 0 0 0 0 1.0 ... 4.0 0.0 0.0 0.0 0.0
4 0 0 0 0 NaN ... NaN NaN NaN NaN NaN
[4 rows x 12 columns]
I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan
for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN
I have this two Series in a DataFrame:
A B
1 2
2 3
2 1
4 3
5 2
and I would to create a new column df['C] that counts how many times the value in column df['A']is higher than the value in column df['B'] for a rolling window of the previous 2 (or more) rows.
The result would be something like this:
A B C
1 2 NaN
2 3 NaN
2 1 0
4 3 1
5 2 2
I would also like to create a column that sums the data in df['A'] higher than df['B'] always using a rolling window.
With the following result:
A B C D
1 2 NaN NaN
2 3 NaN NaN
2 1 0 0
4 3 1 2
5 2 2 6
Thanks in advance.
IIUC
df.assign(C=df.A.gt(df.B).rolling(2).sum().shift(),D=(df.A.gt(df.B)*df.A).rolling(2).sum().shift())
Out[1267]:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 2 1 0.0 0.0
3 4 3 1.0 2.0
4 5 2 2.0 6.0
I am trying to use the dropna function in pandas. I would like to use it for a specific column.
I can only figure out how to use it to drop NaN if ALL rows have ALL NaN values.
I have a dataframe (see below) that I would like to drop all rows after the first occurance of an NaN in a specific column, column "A"
current code, only works if all row values are NaN.
data.dropna(axis = 0, how = 'all')
data
Original Dataframe
data = pd.DataFrame({"A": (1,2,3,4,5,6,7,"NaN","NaN","NaN"),"B": (1,2,3,4,5,6,7,"NaN","9","10"),"C": range(10)})
data
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 NaN NaN 7
8 NaN 9 8
9 NaN 10 9
What I would like the output to look like:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
Any help on this is appreciated.
Obviously I am would like to do it in the cleanest most efficient way possible.
Thanks!
use iloc + argmax
data.iloc[:data.A.isnull().values.argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 4 3
4 5.0 5 4
5 6.0 6 5
6 7.0 7 6
or with a different syntax
top_data = data[:data['A'].isnull().argmax()]
Re: accepted answer. If column in question has no NaNs, argmax returns 0 and thus df[:argmax] will return an empty dataframe.
Here's my workaround:
max_ = data.A.isnull().argmax()
max_ = len(data) if max_ == 0 else max_
top_data = data[:max_]
I have dataframe where column 1 should have all the values from 1 to 169. If a value doesnt exists, I'd like to add a new row to my dataframe which contains the said value (and some zeros).
I can't get the following code to work, even tho there are no errors:
for i in range(1,170):
if i in df.col1 is False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
Any advices?
It would be better to do something like:
In [37]:
# create our test df, we have vales 1 to 9 in steps of 2
df = pd.DataFrame({'a':np.arange(1,10,2)})
df['b'] = np.NaN
df['c'] = np.NaN
df
Out[37]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
In [38]:
# now set the index to a, this allows us to reindex the values with optional fill value, then reset the index
df = df.set_index('a').reindex(index = np.arange(1,10), fill_value=0).reset_index()
df
Out[38]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
So just to explain the above:
In [40]:
# set the index to 'a', this allows us to reindex and fill missing values
df = df.set_index('a')
df
Out[40]:
b c
a
1 NaN NaN
3 NaN NaN
5 NaN NaN
7 NaN NaN
9 NaN NaN
In [41]:
# now reindex and pass fill_value for the extra rows we want
df = df.reindex(index = np.arange(1,10), fill_value=0)
df
Out[41]:
b c
a
1 NaN NaN
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
6 0 0
7 NaN NaN
8 0 0
9 NaN NaN
In [42]:
# now reset the index
df = df.reset_index()
df
Out[42]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
If you modified your loop to the following then it would work:
In [63]:
for i in range(1,10):
if any(df.a.isin([i])) == False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
df
Out[63]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
EDIT
If you wanted the missing rows to appear at the end of the df then you could just create a temporary df with the full range of values and other columns set to zero and then filter this df based on the values that are missing in the other df and concatenate them:
In [70]:
df_missing = pd.DataFrame({'a':np.arange(10),'b':0,'c':0})
df_missing
Out[70]:
a b c
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 0 0
7 7 0 0
8 8 0 0
9 9 0 0
In [73]:
df = pd.concat([df,df_missing[~df_missing.a.isin(df.a)]], ignore_index=True)
df
Out[73]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
5 0 0 0
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
The expression if i in df.col1 is False always evaluates to false. I think it is looking in the index. Also I think you need to use pandas.concat in modern versions of pandas instead of assigning to df.loc[].
I would recommend gathering all missing values in a list then concatenating them to the dataframe at the end. For instance
>>> df = pd.DataFrame({'col1': range(5) + [i + 6 for i in range(5)], 'col2': range(10)})
>>> print df
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
>>> to_add = []
>>> for i in range(11):
... if i not in df.col1.values:
... to_add.append([i, 0])
... else:
... continue
...
>>> pd.concat([df, pd.DataFrame(to_add, columns=['col1', 'col2'])])
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
0 5 0
I assume you don't care about the index values of the rows you add.