Pandas - Compare single row with all rows in same dataframe - python

I would like to add a column to an existing dataframe which shows a count value. The count value should compare a value in a given row versus all rows in another column.
In my example I want to find the number of times a value in the entire 'end_date' column is earlier than current 'start_date' column. Adding the count to the dataframe like so:
start_date end_date count
1 2020-09-2 2020-09-3 1
2 2020-09-6 2020-09-7 3
3 2020-09-4 2020-09-5 2
4 2020-09-1 2020-09-1 0
I have tried
df['count'] = (df[df['end_date']<df['start_date']]).count()
but this results in the count column being 0 for all rows as the start_date is always less than the end_date within any one row.

import pandas as pd
my_dict = {'start_date': ['2020-09-02', '2020-09-06', '2020-09-04', '2020-09-01']}
df = pd.DataFrame.from_dict(my_dict)
df['count'] = 0
for index, row in df.iterrows():
df.at[index,'count'] = df[df['start_date'] < row['start_date']].count()[1]
print (df)

You want count[i] = number of times compare[:] is less than ref[i].
You did count[:] = number of times compare[i] < ref[i]
A straightforward way is to iterate over the rows and calculate individually.
for i, row in df.iterrows():
df.at[i, 'count'] = (df['end_date'] < row['start_date']).sum()
(df['end_date'] < row[i, 'start_date']) returns a column of True or False depending on whether the condition is satisfied. .sum() takes True values as 1 and False values as 0.

You can try with an outer join
counts = (
pd.merge(
df[["start_date"]].assign(temp=1),
df[["end_date"]].assign(temp=1),
on="temp",
how="outer",
)
.query("start_date>end_date")
.groupby("start_date")
.temp.count()
)
df = df.merge(counts, on="start_date", how="left").fillna(0, downcast="infer")

Related

Efficient way to find row in df2 based on condition from value in df1

I have two dataframes. df1 has ~31,000 rows, while df2 has ~117,000 rows. I want to add a column to df1 based on the following conditions.
(df1.id == df2.id) and (df2.min_value < df1.value <= df2.max_value)
I know that df2 will return either 0 or 1 rows satisfying the condition for each value of id in df1. For each row in df1, I want to add a column from df2 when the above condition is satisfied.
My current code is as follows. It is a line by line approach.
new_df1 = pd.DataFrame(columns = df1.columns.tolist()+[new_col])
for i, row in df1.iterrows():
val = row['value']
id = row['id']
dummy = df2[(df2.id == id) & (df2.max_value >= val) & (df2.min_value < val)]
if dummy.shape[0] == 0:
new_col = np.nan
else:
new_col = dummy.new_column.values[0]
l = len(new_df1)
new_df1.loc[l] = row.tolist()+[new_col]
This is a time costly approach. Is there a way to more efficiently do this problem?
You can merge df1 and df2 based on the id column:
merged_df = df1.merge(df2, on='id', how='left')
Now, any row in DF1 for which the id matches an id of a row in DF2 will have all the DF2 columns placed alongside it. Then, you can simply filter the merged dataframe for your given condition:
merged_df.query('max_value > val and min_value < val')

Comparing values from different rows in groupby

I would like to print each time inconsistency where the a start is different from the end from the previous row, grouped by the 'id' column. In the following data, the last row would be a case of inconsistency.
start,end,id
0,2,1
1,5,2
2,10,1
5,7,2
7,9,2
11,13,1
I have managed to do this using a for loop:
def check_consistency(df):
grouped_df = df.groupby('id')
for key, group in grouped_df:
df = pd.DataFrame()
df['start'] = group['start'].iloc[1:]
df['end'] = group['end'].shift().iloc[1:]
consistent = df['start'] == df['end']
if not all(consistent):
print(key)
print(df[consistent == False])
Is there a way to achieve the same goal without using a for loop and creating an auxiliar DataFrame?
Edit: following is the expected output.
DataFrame:
df = pd.DataFrame({'start': [0,1,2,5,7,11], 'end': [2,5,10,7,9,13], 'id': [1,2,1,2,2,1]})
Expected output:
1
start end
5 11 10.0
Firstly, we sort by id. Then make a mask comparing each start with previous row end and group by id.
For each group, the first entry of mask is defaulted to True since it has no previous row and is not to be selected for our extraction.
Finally, we select those rows with mask being False (start not equal to previous row end) by using .loc with the negation of the boolean mask.
df1 = df.sort_values('id', kind='mergesort') # Merge Sort for stable sort to maintain sequence other than sort key
mask = (df1['start']
.eq(df1['end'].shift())
.groupby(df1['id']).transform(lambda x: [True] + x.iloc[1:].tolist())
)
df1.loc[~mask]
Output:
start end id
5 11 13 1

pandas add column to dataframe aggregate on time series

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)

Pandas: Filter by values within multiple columns

I'm trying to filter a dataframe based on the values within the multiple columns, based on a single condition, but keep other columns to which I don't want to apply the filter at all.
I've reviewed these answers, with the third being the closest, but still no luck:
how do you filter pandas dataframes by multiple columns
Filtering multiple columns Pandas
Python Pandas - How to filter multiple columns by one value
Setup:
import pandas as pd
df = pd.DataFrame({
'month':[1,1,1,2,2],
'a':['A','A','A','A','NONE'],
'b':['B','B','B','B','B'],
'c':['C','C','C','NONE','NONE']
}, columns = ['month','a','b','c'])
l = ['month','a','c']
df = df.loc[df['month'] == df['month'].max(), df.columns.isin(l)].reset_index(drop = True)
Current Output:
month a c
0 2 A NONE
1 2 NONE NONE
Desired Output:
month a
0 2 A
1 2 NONE
I've tried:
sub = l[1:]
df = df[(df.loc[:, sub] != 'NONE').any(axis = 1)]
and many other variations (.all(), [sub, :], ~df.loc[...], (axis = 0)), but all with no luck.
Basically I want to drop any column (within the sub list) that has all 'NONE' values in it.
Any help is much appreciated.
You first want to substitute your 'NONE' with np.nan so that it is recognized as a null value by dropna. Then use loc with your boolean series and column subset. Then use dropna with axis=1 and how='all'
df.replace('NONE', np.nan) \
.loc[df.month == df.month.max(), l].dropna(axis=1, how='all')
month a
3 2 A
4 2 NONE

python pandas if 1 in column A, use value of column B in same row

I have two columns, BinaryCol which is, as you may have guessed, 0s and 1s, and OnsetTime which ranges from 0 to 294. I want to make a new column which will contain the value of OnsetTime only for the rows where BinaryCol = 1
I currently have this:
df['Test'] = df['BinaryCol'].apply(lambda row: ['OnsetTime'] if row['BinaryCol'] > 0 else 0, axis=1)
but it doesn't work.
Just do product of the two columns.
df['Test'] = df['OnsetTime'] * df['BinaryCol']
You can use numpy's where:
df['Test'] = np.where(df['BinaryCol'], df['OnsetTime'], np.NaN)
df['BinaryCol'] is the condition, df['OnsetTime'] is the value if the condition is True, and np.NaN the value if the condition is False.
You need to apply your function to a data frame, not a series
df['Test'] = df.apply(lambda row: (row['OnsetTime'] if row['BinaryCol'] == 1 else 0), axis = 1)

Categories