I am trying to find the null values in a DataFrame. Though I reviewed the following post from Stackoverflow that describes the process to determine the null values, I am having a hard time to do the same for my dataset.
How to count the Nan values in the column in Panda Data frame
Working code:
import pandas as pd
a = ['america','britain','brazil','','china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.isnull()
#Output:
False
1 False
2 False
3 False
4 False
5 False
a.isnull().sum()
#Output
#0 0
#dtype: int64
What am I doing wrong?
If you want '', None and NaN to all count as null, you can use the applymap method on each value in the dataframe coerced to a boolean and then use .sum subsequently:
import pandas as pd
import numpy as np
a = ['america','britain','brazil',None,'', np.nan, 'china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.applymap(lambda x: not x or pd.isnull(x)).sum()
# 0 3
# dtype: int64
I hope this helps.
The '' in your list isn't a null value, it's an empty string. To get a null, use None instead. This is described in the pandas.isnull() documentation that missing values are "NaN in numeric arrays, [or] None/NaN in object arrays".
import pandas as pd
a = ['america','britain','brazil',None,'china','jamaica']
a = pd.DataFrame(a)
a.isnull()
0
0 False
1 False
2 False
3 True
4 False
5 False
You can see the difference by printing the two dataframes. In the first case, the dataframe looks like:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3
4 china
5 jamaica
Notice that the value at index 3 is an empty string.
In the second case, you get:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3 None
4 china
5 jamaica
The other posts addressed that '' is not a null value and therefore isn't counted as such with the isnull method...
...However, '' does evaluate to False when interpreted as a bool.
a.astype(bool)
0
0 True
1 True
2 True
3 False
4 True
5 True
This might be useful if you have '' in your dataframe and want to process it this way.
Related
I'm trying to create a column that would identify whether a url is present or not from an existing column called "links". I'd like all NaN values to become zeros and any urls to be denoted as 1, in the new column. I tried the following but was unable to get the correct values.
def url(x):
if x == 'NaN':
return 0
else:
return 1
df['url1'] = df['links'].apply(url)
df.head()
You can use pd.isnull(x) instead of the x == 'NaN' comparison
import pandas as pd
df['url1'] = df['links'].apply(lambda x: 0 if pd.isnull(x) else 1)
See my comment, but the simplest and most performant thing you can to do to get your desired output is to use a pandas method:
input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'links' : [np.nan, 'a', 'b', np.nan]})
In[1]:
links
0 NaN
1 a
2 b
3 NaN
output:
df['url1'] = df['links'].notnull().astype(int)
df
Out[801]:
links url1
0 NaN 0
1 a 1
2 b 1
3 NaN 0
notnull() returns True or False and .astype(int) change True to 1 and False to 0, because True and False are boolean values with an underlying value of 1 and 0, respectively, even though they say True and False. So, when you change the data type to int, it will show its integer underlying value of 1 or 0.
Related to my comment 'True' would also not equal to True and 'False' not equal to False , just like 'NaN' does not equal NaN (notice apostrophes versus no apostrophes).
I've below code to read excel values
import pandas as pd
import numpy as np
import os
df = pd.read_excel(os.path.join(os.getcwd(), 'TestData.xlsx'))
print(df)
Excel data is
Employee ID First Name Last Name Contact Technology Comment
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST 9840112345 JAVA
2 3 TEST 9145612345 AWS
3 4 9123498765 Python
4 5 TEST TEST 9156478965
Below code returns True if any cell holds empty value
print(df.isna())
like below
Employee ID First Name Last Name Contact Technology Comment
0 False False False False False True
1 False False True False False True
2 False True False False False True
3 False True True False False True
4 False False False False True True
I want to add the comment for each row like below
Comment
Last Name is empty
First Name is empty
First Name and Last Name are empty
Technology is empty
One way of doing is iterating over each row to find the empty index and based on the index, comments can be updated.
If the table has huge data, iteration may not be a good idea
Is there a way to achieve this in more pythonic way?
You can simplify solution and instaed is and are use -, use matrix multiplication DataFrame.dot with boolean mask and columns with new value, last remove separator by DataFrame.dot:
#if column exist
df = df.drop('Comment', axis=1)
df['Comment'] = df.isna().dot(df.columns + '-empty, ').str.rstrip(', ')
print (df)
Employee ID First Name Last Name Contact Technology \
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST NaN 9840112345 JAVA
2 3 NaN TEST 9145612345 AWS
3 4 NaN NaN 9123498765 Python
4 5 TEST TEST 9156478965 NaN
Comment
0
1 Last Name-empty
2 First Name-empty
3 First Name-empty, Last Name-empty
4 Technology-empty
I have an issue with a column on a pandas data frame. Due to data input errors I have a column with true and false, but it also contains around 71 decimals.
I am trying to get rid of the decimals and turn them into nan so I can ignore those rows for further analysis.
When I try:
datafinal['any_misread'] = datafinal['any_misread'].where(datafinal['any_misread'] < 1, np.nan)
I get the error:
TypeError: unorderable types: str() < int()
I have also tried logics with .replace and with no success.
What am I missing here?
Let's try using where and astype:
df = pd.DataFrame({'col1':[True, False, 0.12, True, False, .3]})
df.where((df.col1.astype(str) == 'True') | (df.col1.astype(str) == 'False'))
Output:
col1
0 True
1 False
2 NaN
3 True
4 False
5 NaN
You can check if the type of each item in the column is not a bool and change the value.
df = pd.DataFrame([[True],[True],[False],[10.2],[1.0],[False],[0]], columns=['misread'])
df.misread[df.misread.apply(lambda x: not isinstance(x, bool))] = pd.np.nan
df
# returns
misread
0 True
1 True
2 False
3 NaN
4 NaN
5 False
6 NaN
Very simple question everyone, but nearly impossible to find answers to basic questions in official documentation.
I have a dataframe object in Pandas that has rows and columns.
One of the columns, named "CBSM", contains boolean values. I need to delete all rows from the dataframe where the value of the CBSM column = "Y".
I see that there is a method called dataframe.drop()
Label, Axis, and Level are 3 parameters that the drop() method takes in. I have no clue what values to provide these parameters to accomplish my need of deleting the rows in the fashion I described above. I have a feeling the drop() method is not the right way to do what I want.
Please advise, thanks.
This method is called boolean indexing.
You can try loc with str.contains:
df.loc[~df['CBSM'].str.contains('Y')]
Sample:
print df
A CBSM L
0 1 Y 4
1 1 N 6
2 2 N 3
print df['CBSM'].str.contains('Y')
0 True
1 False
2 False
Name: CBSM, dtype: bool
#inverted boolean serie
print ~df['CBSM'].str.contains('Y')
0 False
1 True
2 True
Name: CBSM, dtype: bool
print df.loc[~df['CBSM'].str.contains('Y')]
A CBSM L
1 1 N 6
2 2 N 3
Or:
print df.loc[~(df['CBSM'] == 'Y')]
A CBSM L
1 1 N 6
2 2 N 3
I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.