Pandas - Find empty columns of the row and update in one column - python

I've below code to read excel values
import pandas as pd
import numpy as np
import os
df = pd.read_excel(os.path.join(os.getcwd(), 'TestData.xlsx'))
print(df)
Excel data is
Employee ID First Name Last Name Contact Technology Comment
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST 9840112345 JAVA
2 3 TEST 9145612345 AWS
3 4 9123498765 Python
4 5 TEST TEST 9156478965
Below code returns True if any cell holds empty value
print(df.isna())
like below
Employee ID First Name Last Name Contact Technology Comment
0 False False False False False True
1 False False True False False True
2 False True False False False True
3 False True True False False True
4 False False False False True True
I want to add the comment for each row like below
Comment
Last Name is empty
First Name is empty
First Name and Last Name are empty
Technology is empty
One way of doing is iterating over each row to find the empty index and based on the index, comments can be updated.
If the table has huge data, iteration may not be a good idea
Is there a way to achieve this in more pythonic way?

You can simplify solution and instaed is and are use -, use matrix multiplication DataFrame.dot with boolean mask and columns with new value, last remove separator by DataFrame.dot:
#if column exist
df = df.drop('Comment', axis=1)
df['Comment'] = df.isna().dot(df.columns + '-empty, ').str.rstrip(', ')
print (df)
Employee ID First Name Last Name Contact Technology \
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST NaN 9840112345 JAVA
2 3 NaN TEST 9145612345 AWS
3 4 NaN NaN 9123498765 Python
4 5 TEST TEST 9156478965 NaN
Comment
0
1 Last Name-empty
2 First Name-empty
3 First Name-empty, Last Name-empty
4 Technology-empty

Related

Annotating maximum by iterating each rows and make new column with resultant output

Annotating maximum by iterating each rows. and make new column with resultant output.
Can anyone help using pandas in Python, how to get the result?
text A B C
index
0 Cool False False True
1 Drunk True False False
2 Study False True False
Output:
Text Result
index
0 Cool False
1 Drunk False
2 Study False
If the sum of each row is more than half the length of the columns, True is the more common value.
Try:
df["Result"] = df.drop("text", axis=1).sum(axis=1)>=len(df.columns)//2+1
output = df[["text", "Result"]]
>>> df
text Result
0 Cool False
1 Drunk False
2 Study False

Create function that applies flag based on other dataframe rows

I have a dataframe that looks like this
date id type
02/02/2020 2 A
29/02/2020 2 B
04/03/2020 2 B
02/01/2020 3 B
15/01/2020 3 A
19/01/2020 3 C
... ... ...
I want to create a new column, called flagged. For each row, I want the value of flagged to be equal to True if there exists another row with
The same id
Type A
A date for which the difference in days with the date of the current row is bigger than 0 and smaller than 30
I would want the dataframe above to be transformed to this
date id type flagged
02/02/2020 2 A False
29/02/2020 2 B True
04/03/2020 2 B False
02/01/2020 3 B False
15/01/2020 3 A False
19/01/2020 3 C True
... ... ... ...
My Approach:
I created the following function
def check_type(id, date):
if df[(df.id == id) & (df.type == 'A') & (date - df.date > datetime.timedelta(0)) & (date - df.date < datetime.timedelta(30))].empty:
return False
else:
return True
so that if I run
df['flagged'] = df.apply(lambda x: check_type(x.id, x.date), axis = 1)
I get the desired result.
Questions:
How do I change the function check_type so that it can be applicable to any dataframe, no matter its name? The current function only works if the dataframe that it is used on is called df.
How do I make this process faster? I want to run this function on a large dataframe, and it's not performing as fast as I would want.
Thanks in advance!
I would find the last date with A type and propagate it throughout the id with ffill and find the difference:
last_dates = df.date.where(df['type'].eq('A')).groupby(df['id']).ffill()
# this is the new column
df.date.sub(last_dates).lt(pd.to_timedelta('30D')) & df['type'].ne('A')
Output:
0 False
1 True
2 False
3 False
4 False
5 True
dtype: bool
Note: this works given that you always mask A with False.

Create columns in dataframe based on csv field

I have a pandas dataframe with the column "Values" that has comma separated values:
Row|Values
1|1,2,3,8
2|1,4
I want to create columns based on the CSV, and assign a boolean indicating if the row has that value, as follows:
Row|1,2,3,4,8
1|true,true,true,false,true
2|true,false,false,true,false
How can I accomplish that?
Thanks in advance
Just using get_dummies, check the link here and the astype(bool) change 1 to True 0 to False
df.set_index('Row')['Values'].str.get_dummies(',').astype(bool)
Out[318]:
1 2 3 4 8
Row
1 True True True False True
2 True False False True False

Python Dataframe get null value counts

I am trying to find the null values in a DataFrame. Though I reviewed the following post from Stackoverflow that describes the process to determine the null values, I am having a hard time to do the same for my dataset.
How to count the Nan values in the column in Panda Data frame
Working code:
import pandas as pd
a = ['america','britain','brazil','','china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.isnull()
#Output:
False
1 False
2 False
3 False
4 False
5 False
a.isnull().sum()
#Output
#0 0
#dtype: int64
What am I doing wrong?
If you want '', None and NaN to all count as null, you can use the applymap method on each value in the dataframe coerced to a boolean and then use .sum subsequently:
import pandas as pd
import numpy as np
a = ['america','britain','brazil',None,'', np.nan, 'china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.applymap(lambda x: not x or pd.isnull(x)).sum()
# 0 3
# dtype: int64
I hope this helps.
The '' in your list isn't a null value, it's an empty string. To get a null, use None instead. This is described in the pandas.isnull() documentation that missing values are "NaN in numeric arrays, [or] None/NaN in object arrays".
import pandas as pd
a = ['america','britain','brazil',None,'china','jamaica']
a = pd.DataFrame(a)
a.isnull()
0
0 False
1 False
2 False
3 True
4 False
5 False
You can see the difference by printing the two dataframes. In the first case, the dataframe looks like:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3
4 china
5 jamaica
Notice that the value at index 3 is an empty string.
In the second case, you get:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3 None
4 china
5 jamaica
The other posts addressed that '' is not a null value and therefore isn't counted as such with the isnull method...
...However, '' does evaluate to False when interpreted as a bool.
a.astype(bool)
0
0 True
1 True
2 True
3 False
4 True
5 True
This might be useful if you have '' in your dataframe and want to process it this way.

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories