Pandas dataframe: keep rows with duplicates

Pandas dataframe: keep rows with duplicates - python

This question is slightly more complicated than Remove duplicate rows in pandas dataframe based on condition:
Instead of one 'valu' column, I now have two columns 'valu1', 'valu2':
t valu1 valu2
2015-08-01 1 10
2015-08-01 2 11
2015-08-01 3 12
2015-09-31 4 15
2015-10-31 5 13
In the dataframe above, I want to remove the duplicate rows (i.e. row where the column 't' is repeated) by retaining the row with a higher value in the valu1 column and a lower value in the value2 column.
Expected outcome:
t valu1 valu2
2015-08-01 3 10
2015-09-31 4 15
2015-10-31 5 13
The df.sort_values() and drop_duplicates with keep='last' mentioned in the linked question obviously don't work.
Something I can think of now is:
#Let's call the dataframe df
dups = df[df['t'].duplicated()]['t'].drop_duplicates() #get duplicated dates
for d in dups:
max_v1 = df[df['t'] == d]['valu1'].max() #find the max of valu1 on day d
min_v2 = df[df['t'] == d]['valu2'].min() #find the min of valu2 on day d
df[df['t'] == d]['valu1'] = max_v1 #set valu1 of day d to max_v1
df[df['t'] == d]['valu2'] = min_v2 #set valu2 of day d to min_v2
df = df[~df.index.duplicated()] #drop everything duplicated
I think this should work, but it really seems unsophisticated, especially I actually need to do this for a large dataset. Any idea of how I should approach this problem?

I think you are looking for
df.groupby('t').agg({'valu1':'max','valu2':'min'}).reset_index()
t valu1 valu2
0 2015-08-01 3 10
1 2015-09-31 4 15
2 2015-10-31 5 13

Related

Groupby over periods of time

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:
row
id
date
target
yearmonth
0
A
2015-03-16
0
2015-03
1
A
2015-05-29
1
2015-05
2
A
2015-08-02
1
2015-08
3
A
2015-09-05
1
2015-09
4
A
2015-09-22
0
2015-09
5
A
2015-10-15
1
2015-10
6
A
2015-11-09
1
2015-11
7
B
2015-04-17
0
2015-04
I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')
agg_dict = {
"Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
"Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),
"Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}
df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)
I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.
id
yearmonth
Total_Transactions
Fail_Count
Perc_Monthly_Fails
A
2015-03
1
0
0
A
2015-05
1
1
100
A
2015-08
1
1
100
A
2015-09
2
1
50
A
2015-10
1
1
100
A
2015-11
1
1
100
B
2015-04
1
0
0

You can do this by merging the DataFrame with itself on 'id'.
First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.
With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).
If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]') # Credit #anky
df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))
df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3),
df1.fom-pd.offsets.DateOffset(months=1))]
df['Pct_fail'] = df1.groupby('index').target.mean()*100
id date target fom Pct_fail
0 A 2015-03-16 0 2015-03-01 NaN # No Rows to Avg
1 A 2015-05-29 1 2015-05-01 0.000000 # Avg Rows 0
2 A 2015-08-02 1 2015-08-01 100.000000 # Avg Rows 1
3 A 2015-09-05 1 2015-09-01 100.000000 # Avg Rows 2
4 A 2015-09-22 0 2015-09-01 100.000000 # Avg Rows 2
5 A 2015-10-15 1 2015-10-01 66.666667 # Avg Rows 2,3,4
6 A 2015-11-09 1 2015-11-01 75.000000 # Avg Rows 2,3,4,5
7 B 2015-04-17 0 2015-04-01 NaN # No Rows to Avg
If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.
def get_prev_avg(row, df):
df = df[df['id'].eq(row['id'])
& df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3),
row['fom']-pd.offsets.DateOffset(months=1))]
if not df.empty:
return df['target'].mean()*100
else:
return np.NaN
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')
df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)

I have modified #ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:
df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
def get_prev_avg(row, df, columnname, lastxmonths):
df = df[df[columnname].eq(row[columnname])
& df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths),
row['date']-pd.offsets.DateOffset(days=1))]
if not df.empty:
NrTransactions= len(df['target'])
PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
else:
return pd.Series([np.NaN, np.NaN, np.NaN])
for lastxmonths in [3, 4]:
for columnname in ['Id','Type']:
df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)
Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem

I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i

You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5

You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})

How can I create a new column in a pandas pivot table with only matching values of populated columns?

I have a pandas pivot table that lists individuals in rows and data sources across the columns. There are hundreds of individuals going down amongst the rows and hundreds of sources going across along the columns.
Desired_Value Source_1 Source_2 Source_3 ... Source_50
person1 20 20 20 20
person2 5 5 5 5
person3 Review 3 4 4 4
...
person50 1 1 1
What I want to do is create the Desired_Value column above. I want to pull in a value so long as it matches across all values (ignoring blank fields). If values do not match I want to show Review.
I use this pandas command to print my df to excel currently (without any Desired_Value column):
df13 = df12.pivot_table(index='person', columns = 'source_name', values = 'actual_data', aggfunc='first')
I'm new to Python so apologies if this is a silly question.

This is one method to do it:
df = df13.copy()
df = df.astype('Int64') # So NaN and Int values can coexist
# Create new column at the front of the data frame
df['Desired_Value'] = np.nan
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# Loop over all rows and flag columns for review
for idx, row in df.iterrows():
val = row.dropna().unique()
if len(val) == 1:
df.loc[idx, 'Desired_Value'] = val
else:
df.loc[idx, 'Desired_Value'] = 'Review'
print(df)
Desired_Value Source_1 Source_2 Source_3 Source_50
person1 20 20 20 NaN 20
person2 5 5 NaN 5 5
person3 Review 3 4 4 4
person50 1 1 NaN NaN 1

how to append data from different data frame in python?

I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?

One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN

Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe: keep rows with duplicates - python

I think you are looking for df.groupby('t').agg({'valu1':'max','valu2':'min'}).reset_index() t valu1 valu2 0 2015-08-01 3 10 1 2015-09-31 4 15 2 2015-10-31 5 13

Related

Groupby over periods of time

Pandas vectorization for a multiple data frame operation

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

How can I create a new column in a pandas pivot table with only matching values of populated columns?

how to append data from different data frame in python?

Categories

Resources