Comparing values from different rows in groupby - python

I would like to print each time inconsistency where the a start is different from the end from the previous row, grouped by the 'id' column. In the following data, the last row would be a case of inconsistency.
start,end,id
0,2,1
1,5,2
2,10,1
5,7,2
7,9,2
11,13,1
I have managed to do this using a for loop:
def check_consistency(df):
grouped_df = df.groupby('id')
for key, group in grouped_df:
df = pd.DataFrame()
df['start'] = group['start'].iloc[1:]
df['end'] = group['end'].shift().iloc[1:]
consistent = df['start'] == df['end']
if not all(consistent):
print(key)
print(df[consistent == False])
Is there a way to achieve the same goal without using a for loop and creating an auxiliar DataFrame?
Edit: following is the expected output.
DataFrame:
df = pd.DataFrame({'start': [0,1,2,5,7,11], 'end': [2,5,10,7,9,13], 'id': [1,2,1,2,2,1]})
Expected output:
1
start end
5 11 10.0

Firstly, we sort by id. Then make a mask comparing each start with previous row end and group by id.
For each group, the first entry of mask is defaulted to True since it has no previous row and is not to be selected for our extraction.
Finally, we select those rows with mask being False (start not equal to previous row end) by using .loc with the negation of the boolean mask.
df1 = df.sort_values('id', kind='mergesort') # Merge Sort for stable sort to maintain sequence other than sort key
mask = (df1['start']
.eq(df1['end'].shift())
.groupby(df1['id']).transform(lambda x: [True] + x.iloc[1:].tolist())
)
df1.loc[~mask]
Output:
start end id
5 11 13 1

Related

Removing rows based on the combined value of other rows

I want to remove the row with "SubAggregate"='All' if the rows with the same "Month" and "MainAggregate" sums ("ValueTraded") to the same as the corresponding "SubAggregate"='All' value
My idea was to group by "MainAggregate" and "Month" and if the value was equal to two times the value in the row with an 'All', then it would be deleted. I only got to the point where I grouped the data
Tester = data.groupby(["Month", "MainAggregate"], as_index=False)["ValueTraded"].sum()
Tester["ValueTraded"] = Tester["ValueTraded"] / 2
Below is an example of the data and the desired output:
DATA
Desired Output
You can compare sum per groups in new columns in GroupBy.transform with first All value by GroupBy.first and because duplicates remove only first duplicated value by add DataFrame.duplicated:
mask = df['subaAggregate'].eq('All')
s = df.groupby(["Month", "MainAggregate"])["valueTrad"].transform('sum').div(2)
s1 = (df.assign(new = df['valueTrad'].where(mask))
.groupby(["Month", "MainAggregate"])["new"].transform('first'))
out = df[s.ne(s1) | ~mask | df.duplicated(['Month','MainAggregate','subaAggregate'])]
Or compare filtred rows with only All in boolean indexing with remove duplicates by DataFrame.drop_duplicates for indices for remove from original DataFrame:
#necessary unique indices
#df = df.reset_index(drop=True)
mask = df['subaAggregate'].eq('All')
s = df.groupby(["Month", "MainAggregate"])["valueTrad"].sum().div(2).rename('new')
df1 = df[mask].drop_duplicates().join(s, on=['Month','MainAggregate'])
out = df.drop(df1.index[df1['valueTrad'].eq(df1['new'])])

Pandas - Compare single row with all rows in same dataframe

I would like to add a column to an existing dataframe which shows a count value. The count value should compare a value in a given row versus all rows in another column.
In my example I want to find the number of times a value in the entire 'end_date' column is earlier than current 'start_date' column. Adding the count to the dataframe like so:
start_date end_date count
1 2020-09-2 2020-09-3 1
2 2020-09-6 2020-09-7 3
3 2020-09-4 2020-09-5 2
4 2020-09-1 2020-09-1 0
I have tried
df['count'] = (df[df['end_date']<df['start_date']]).count()
but this results in the count column being 0 for all rows as the start_date is always less than the end_date within any one row.
import pandas as pd
my_dict = {'start_date': ['2020-09-02', '2020-09-06', '2020-09-04', '2020-09-01']}
df = pd.DataFrame.from_dict(my_dict)
df['count'] = 0
for index, row in df.iterrows():
df.at[index,'count'] = df[df['start_date'] < row['start_date']].count()[1]
print (df)
You want count[i] = number of times compare[:] is less than ref[i].
You did count[:] = number of times compare[i] < ref[i]
A straightforward way is to iterate over the rows and calculate individually.
for i, row in df.iterrows():
df.at[i, 'count'] = (df['end_date'] < row['start_date']).sum()
(df['end_date'] < row[i, 'start_date']) returns a column of True or False depending on whether the condition is satisfied. .sum() takes True values as 1 and False values as 0.
You can try with an outer join
counts = (
pd.merge(
df[["start_date"]].assign(temp=1),
df[["end_date"]].assign(temp=1),
on="temp",
how="outer",
)
.query("start_date>end_date")
.groupby("start_date")
.temp.count()
)
df = df.merge(counts, on="start_date", how="left").fillna(0, downcast="infer")

pandas add column to dataframe aggregate on time series

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)

Get only first row per subject in dataframe

I was wondering if there is an easy way to get only the first row of each grouped object (subject id for example) in a dataframe. Doing this:
for index, row in df.iterrows():
# do stuff
gives us each one of the rows, but I am interested in doing something like this:
groups = df.groupby('Subject id')
for index, row in groups.iterrows():
# give me the first row of each group
continue
Is there a pythonic way to do the above?
Direct solution - without .groupby() - by .drop_duplicates()
what you want is to keep only the rows with first occurrencies in a specific column:
df.drop_duplicates(subset='Subject id', keep='first')
General solution
Using the .apply(func) in Pandas:
df.groupby('Subject id').apply(lambda df: df.iloc[0, :])
It applies a function (mostly on the fly generated with lambda) to every data frame in the list of data frames returned by df.groupby() and aggregates the result to a single final data frame.
However, the solution by #AkshayNevrekar is really nice with .first(). And like he did there, you could also attach here - a .reset_index() at the end.
Let's say this is the more general solution - where you could also take any nth row ... - however, this works only if all sub-dataframes have at least n rows.
Otherwise, use:
n = 3
col = 'Subject id'
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < (df.shape[0]):
res_df = res_df.append(df.reset_index().iloc[n, :])
Or as a function:
def group_by_select_nth_row(df, col, n):
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < df.shape[0]:
res_df = res_df.append(df.reset_index().iloc[n, :])
return res_df
Quite confusing is that df.append() in contrast to list.append() only returns the appended value but leaves the original df unchanged.
Therefore you should always reassign it if you want an 'in place' appending, like one is used from list.append().
Use first() to get first row of each group.
df = pd.DataFrame({'subject_id': [1,1,2,2,2,3,4,4], 'val':[20,32,12,34,45,43,23,10]})
# print(df.groupby('subject_id').first().reset_index())
print(df.groupby('subject_id', as_index=False).first())
Output:
subject_id val
0 1 20
1 2 12
2 3 43
3 4 23

Pandas: Filter by values within multiple columns

I'm trying to filter a dataframe based on the values within the multiple columns, based on a single condition, but keep other columns to which I don't want to apply the filter at all.
I've reviewed these answers, with the third being the closest, but still no luck:
how do you filter pandas dataframes by multiple columns
Filtering multiple columns Pandas
Python Pandas - How to filter multiple columns by one value
Setup:
import pandas as pd
df = pd.DataFrame({
'month':[1,1,1,2,2],
'a':['A','A','A','A','NONE'],
'b':['B','B','B','B','B'],
'c':['C','C','C','NONE','NONE']
}, columns = ['month','a','b','c'])
l = ['month','a','c']
df = df.loc[df['month'] == df['month'].max(), df.columns.isin(l)].reset_index(drop = True)
Current Output:
month a c
0 2 A NONE
1 2 NONE NONE
Desired Output:
month a
0 2 A
1 2 NONE
I've tried:
sub = l[1:]
df = df[(df.loc[:, sub] != 'NONE').any(axis = 1)]
and many other variations (.all(), [sub, :], ~df.loc[...], (axis = 0)), but all with no luck.
Basically I want to drop any column (within the sub list) that has all 'NONE' values in it.
Any help is much appreciated.
You first want to substitute your 'NONE' with np.nan so that it is recognized as a null value by dropna. Then use loc with your boolean series and column subset. Then use dropna with axis=1 and how='all'
df.replace('NONE', np.nan) \
.loc[df.month == df.month.max(), l].dropna(axis=1, how='all')
month a
3 2 A
4 2 NONE

Categories