Create function that applies flag based on other dataframe rows - python

I have a dataframe that looks like this
date id type
02/02/2020 2 A
29/02/2020 2 B
04/03/2020 2 B
02/01/2020 3 B
15/01/2020 3 A
19/01/2020 3 C
... ... ...
I want to create a new column, called flagged. For each row, I want the value of flagged to be equal to True if there exists another row with
The same id
Type A
A date for which the difference in days with the date of the current row is bigger than 0 and smaller than 30
I would want the dataframe above to be transformed to this
date id type flagged
02/02/2020 2 A False
29/02/2020 2 B True
04/03/2020 2 B False
02/01/2020 3 B False
15/01/2020 3 A False
19/01/2020 3 C True
... ... ... ...
My Approach:
I created the following function
def check_type(id, date):
if df[(df.id == id) & (df.type == 'A') & (date - df.date > datetime.timedelta(0)) & (date - df.date < datetime.timedelta(30))].empty:
return False
else:
return True
so that if I run
df['flagged'] = df.apply(lambda x: check_type(x.id, x.date), axis = 1)
I get the desired result.
Questions:
How do I change the function check_type so that it can be applicable to any dataframe, no matter its name? The current function only works if the dataframe that it is used on is called df.
How do I make this process faster? I want to run this function on a large dataframe, and it's not performing as fast as I would want.
Thanks in advance!

I would find the last date with A type and propagate it throughout the id with ffill and find the difference:
last_dates = df.date.where(df['type'].eq('A')).groupby(df['id']).ffill()
# this is the new column
df.date.sub(last_dates).lt(pd.to_timedelta('30D')) & df['type'].ne('A')
Output:
0 False
1 True
2 False
3 False
4 False
5 True
dtype: bool
Note: this works given that you always mask A with False.

Related

A faster method than "for" to scan a DataFrame - Python

I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False

Pandas - Find empty columns of the row and update in one column

I've below code to read excel values
import pandas as pd
import numpy as np
import os
df = pd.read_excel(os.path.join(os.getcwd(), 'TestData.xlsx'))
print(df)
Excel data is
Employee ID First Name Last Name Contact Technology Comment
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST 9840112345 JAVA
2 3 TEST 9145612345 AWS
3 4 9123498765 Python
4 5 TEST TEST 9156478965
Below code returns True if any cell holds empty value
print(df.isna())
like below
Employee ID First Name Last Name Contact Technology Comment
0 False False False False False True
1 False False True False False True
2 False True False False False True
3 False True True False False True
4 False False False False True True
I want to add the comment for each row like below
Comment
Last Name is empty
First Name is empty
First Name and Last Name are empty
Technology is empty
One way of doing is iterating over each row to find the empty index and based on the index, comments can be updated.
If the table has huge data, iteration may not be a good idea
Is there a way to achieve this in more pythonic way?
You can simplify solution and instaed is and are use -, use matrix multiplication DataFrame.dot with boolean mask and columns with new value, last remove separator by DataFrame.dot:
#if column exist
df = df.drop('Comment', axis=1)
df['Comment'] = df.isna().dot(df.columns + '-empty, ').str.rstrip(', ')
print (df)
Employee ID First Name Last Name Contact Technology \
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST NaN 9840112345 JAVA
2 3 NaN TEST 9145612345 AWS
3 4 NaN NaN 9123498765 Python
4 5 TEST TEST 9156478965 NaN
Comment
0
1 Last Name-empty
2 First Name-empty
3 First Name-empty, Last Name-empty
4 Technology-empty

Use boolean series of different length to select rows from dataframe

I have a dataframe that looks like this:
df = pd.DataFrame({"piece": ["piece1", "piece2", "piece3", "piece4"], "No": [1, 1, 2, 3]})
No piece
0 1 piece1
1 1 piece2
2 2 piece3
3 3 piece4
I have a series with an index that corresponds to the "No"-column in the dataframe. It assigns boolean variables to the "No"-values, like so:
s = pd.Series([True, False, True, True])
0 True
1 False
2 True
3 True
dtype: bool
I would like to select those rows from the dataframe where in the series the "No"-value is True. This should result in
No piece
2 2 piece3
3 3 piece4
I've tried a lot of indexing with df["No"].isin(s), or something like df[s["No"] == True]... But it didn't work yet.
I think you need map the value in No column to the true/false condition and use it for subsetting:
df[df.No.map(s)]
# No piece
#2 2 piece3
#3 3 piece4
df.No.map(s)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: No, dtype: bool
You are trying to index into s using df['No'], then use the result as a mask on df itself:
df[s[df['No']].values]
The final mask needs to be extracted as an array using values because the duplicates in the index cause an error otherwise.

How do I delete a row in Pandas dataframe when a specific column contains a value that signals to me that the row should be deleted?

Very simple question everyone, but nearly impossible to find answers to basic questions in official documentation.
I have a dataframe object in Pandas that has rows and columns.
One of the columns, named "CBSM", contains boolean values. I need to delete all rows from the dataframe where the value of the CBSM column = "Y".
I see that there is a method called dataframe.drop()
Label, Axis, and Level are 3 parameters that the drop() method takes in. I have no clue what values to provide these parameters to accomplish my need of deleting the rows in the fashion I described above. I have a feeling the drop() method is not the right way to do what I want.
Please advise, thanks.
This method is called boolean indexing.
You can try loc with str.contains:
df.loc[~df['CBSM'].str.contains('Y')]
Sample:
print df
A CBSM L
0 1 Y 4
1 1 N 6
2 2 N 3
print df['CBSM'].str.contains('Y')
0 True
1 False
2 False
Name: CBSM, dtype: bool
#inverted boolean serie
print ~df['CBSM'].str.contains('Y')
0 False
1 True
2 True
Name: CBSM, dtype: bool
print df.loc[~df['CBSM'].str.contains('Y')]
A CBSM L
1 1 N 6
2 2 N 3
Or:
print df.loc[~(df['CBSM'] == 'Y')]
A CBSM L
1 1 N 6
2 2 N 3

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories