How to select rows with NaN in particular column? - python

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0

Try the following:
df[df['Col2'].isnull()]

#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0

If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

Related

How to convert categorial data into indices with nan values present in Python?

Context
I have created a function, that converts Categorial Data into its unique indices. This works great with all values except NaN.
It seems that the comparison with NaN does not work. This results in the two problems seen below.
Code
col1
0 male
1 female
2 NaN
3 female
def categorial(series: pandas.Series) -> pandas.Series:
series = series.copy()
for index, value in enumerate(series.unique()):
# Problem 1: The output for the Value NaN is always 0.0 %, even though nan is present in the given series.
print(index, value, round(series[series == value].count() / len(series) * 100, 2), '%')
for index, value in enumerate(series.unique()):
# Problem 2: Every unique Value is converted to its Index except NaN.
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
How can I solve the two problems seen in the code above?
How should be encoded missing values nans?
In pandas it is obviously -1:
print (pd.factorize(categorial(df['col1']))[0])
[ 0 1 -1 1]
print (df['col1'].astype('category').cat.codes)
0 1
1 0
2 -1
3 0
dtype: int8
You can use fillna with astype and factorize:
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
Sample:
df = pd.DataFrame({'col1':['a','b',np.nan,'c']})
print (df)
col1
0 a
1 b
2 NaN
3 c
df['col1'] = df['col1'].fillna('nan').astype(str).factorize()[0]
print (df)
col1
0 0
1 1
2 2
3 3

If else function with NaN values

I imported an excel and now I need multiply certain values from the list but if the value from the first column is NaN, Python should take another column for the calculation. I got the following Code:
if pd['Column1'] == 'NaN':
pd['Column2'] * pd['Column3']
else:
pd['Column1'] * pd['Column3']
Thank you for your help.
You can use isna() together with any() or all(). Here is an example:
import pandas as pd
import numpy as np
#generating test data assuming all the values in Col1 are 'NaN'
df = pd.DataFrame({'Col1':[np.nan,np.nan,np.nan,np.nan], 'Col2':[1,2,3,4], 'Col3':[2,3,4,5]})
if df['Col1'].isna().all(): # you can also use 'any()' instead of all()
df['Col4'] = df['Col2']*df['Col3']
else:
df['Col4'] = df['Col1']*df['Col3']
print(df)
Output:
Col1 Col2 Col3 Col4
0 NaN 1 2 2
1 NaN 2 3 6
2 NaN 3 4 12
3 NaN 4 5 20

Get column name based on condition in pandas

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

How can I match values on a matrix on python using pandas?

I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs

change nan values in pandas

In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.
it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest
the first value was a NaN so I had to use bfill method instead. Thanks everyone

Categories