Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical - python

Imagine that I've got the following DataFrame
A | B | C | D
-------------------------------
2000-01-01 00:00:00 | 1 | 1 | 1
2000-01-01 00:04:30 | 1 | 2 | 2
2000-01-01 00:04:30 | 2 | 3 | 3
2000-01-02 00:00:00 | 1 | 4 | 4
And I want to drop rows where B are equal, and the values in A are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.
So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False), I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func}). With
def some_func(ts1, ts2):
delta = ts1 - ts2
return abs(delta.total_seconds()) >= 5 * 60
Is there a way to do this in Pandas?

m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
Details
m gets a mask of all rows within 5 minutes of each other.
m
0 False
1 True
2 False
3 False
Name: A, dtype: bool
m2 is the final mask of all items that must be dropped.
m2
0 True
1 True
2 False
3 False
dtype: bool

I break down the steps ...And you can test with your real data to see whether it works or not ..
df['dropme']=df.A.diff().shift(-1).dt.seconds/60
df['dropme2']=df.A
df.loc[df.dropme<=5,'dropme2']=1
df.drop_duplicates(['dropme2'],keep=False).drop(['dropme','dropme2'],axis=1)
Out[553]:
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4

write a function that accepts a data frame, calculates the delta between two successive timestamps, and return the filtered dataframe. Then groupby & apply.
import pandas as pd
import datetime
# this one preserves 1 row from two or more closeby rows.
def filter_window(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(10.0)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df2 = df.groupby('B').apply(filter_window).reset_index()
# With your sample dataset, this is the output of df2
A B C D
0 2000-01-01 00:00:00 1 1 1
1 2000-01-02 00:00:00 1 4 4
2 2000-01-01 00:04:30 2 3 3
# this one drops all closeby rows.
def filter_window2(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt2'] = (df.A.shift(-1) - df.A) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(df.filt2)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df3 = df.groupby('B').apply(filter_window2).reset_index()
# With your sample dataset, this is the output of df3
A B C D
0 2000-01-02 00:00:00 1 4 4
1 2000-01-01 00:04:30 2 3 3

Related

How to remove NaNs and squeeze in a DataFrame - pandas

I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.

You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4

This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0

We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4

select the last 2 values in the groupby with condition

I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists).
My data:
df = pd.DataFrame({
"user_id": [1,1,1, 2,2,2]
,'subscription': [1,1,2,3,4,5]
,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Expected output:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
What I've tried: drop_duplicates and groupby, both give the same result, only with the last value
df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)

You can use boolean masking and return three different conditions that are True or False with variables a, b, or c. Then, filter for when the data a, b, or c returns True with the or operator |:
a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df
#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric, dtype: bool
#b if something has a count of one, then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric, dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric, dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02

This is one way, but in my opinion, slow, since we are iterating through the grouping :
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
for key, value in df.groupby(["user_id", "date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02

Counting cumulative occurrences of values based on date window in Pandas

I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.

My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]

You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1

We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.

Add rank field to pandas dataframe by unique groups and sorting by multiple columns

say I have this dataframe and I want every unique user ID to have its own rank value based on the date stamp:
In [93]:
df = pd.DataFrame({
'userid':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'date_stamp':['2016-02-01', '2016-02-01', '2016-02-04', '2016-02-08', '2016-02-04', '2016-02-10', '2016-02-10', '2016-02-12'],
'tie_breaker':[1,2,3,4,1,2,3,4]})
df['date_stamp'] = df['date_stamp'].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
df['rank'] = df.groupby(['userid'])['date_stamp'].rank(ascending=True, method='min')
df
Out[93]:
date_stamp tie_breaker userid rank
0 2016-02-01 1 a 1
1 2016-02-01 2 a 1
2 2016-02-04 3 a 3
3 2016-02-08 4 a 4
4 2016-02-04 1 b 1
5 2016-02-10 2 b 2
6 2016-02-10 3 b 2
7 2016-02-12 4 b 4
So that's fine, but what if I wanted to add another field to serve as a tie-breaker when there are two of the same dates? I was hoping something would be as easy as:
df['rank'] = df.groupby(['userid'])[['date_stamp','tie_breaker']].rank(ascending=True, method='min')
but that doesn't work- any ideas?
ideal output:
date_stamp tie_breaker userid rank
0 2/1/16 1 a 1
1 2/1/16 2 a 2
2 2/4/16 3 a 3
3 2/8/16 4 a 4
4 2/4/16 1 b 1
5 2/10/16 2 b 2
6 2/10/16 3 b 3
7 2/12/16 4 b 4
Edited to have real data
Looks like the top solution here doesn't handle zeros in the tie_breaker field correctly - any idea what's going on?
df = pd.DataFrame({
'userid':['10010012083198581013', '10010012083198581013', '10010012083198581013', '10010012083198581013','10010012083198581013'],
'date_stamp':['2015-12-26 13:24:37', '2015-11-25 11:24:13', '2015-10-25 12:13:59', '2015-02-16 22:59:58','2015-08-17 11:43:43'],
'tie_breaker':[460000156735858, 460000152444239, 460000147374709, 11083155016444116916,0]})
df['date_stamp'] = df['date_stamp'].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
df['userid'] = df['userid'].astype(str)
df['tie_breaker'] = df['tie_breaker'].astype(str)
def myrank(g):
return pd.DataFrame(1 + np.lexsort((g['tie_breaker'].rank(),
g['date_stamp'].rank())),
index=g.index)
df['rank']=df.groupby(['userid']).apply(myrank)
df.sort('date_stamp')
Out[101]:
date_stamp tie_breaker userid rank
3 2015-02-16 11083155016444116916 10010012083198581013 2
4 2015-08-17 0 10010012083198581013 1
2 2015-10-25 460000147374709 10010012083198581013 3
1 2015-11-25 460000152444239 10010012083198581013 5
0 2015-12-26 460000156735858 10010012083198581013 4

With this dataframe:
df = pd.DataFrame({
'userid':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'date_stamp':['2016-02-01', '2016-02-01', '2016-02-04', '2016-02-08',
'2016-02-04', '2016-02-10', '2016-02-10', '2016-02-12'],
'tie_breaker':[1,2,3,4,1,2,3,4]})
One way to do it is:
def myrank(g):
return pd.DataFrame(1 + np.lexsort((g['tie_breaker'].rank(),
g['date_stamp'].rank())),
index=g.index)
df['rank']=df.groupby(['userid']).apply(myrank)
Output:
date_stamp tie_breaker userid rank
0 2016-02-01 1 a 1
1 2016-02-01 2 a 2
2 2016-02-04 3 a 3
3 2016-02-08 4 a 4
4 2016-02-04 1 b 1
5 2016-02-10 2 b 2
6 2016-02-10 3 b 3
7 2016-02-12 4 b 4

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani

You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3

This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical - python

Related

How to remove NaNs and squeeze in a DataFrame - pandas

select the last 2 values in the groupby with condition

Counting cumulative occurrences of values based on date window in Pandas

Add rank field to pandas dataframe by unique groups and sorting by multiple columns

Adding a column to pandas data frame fills it with NA

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical - python

Related

How to remove NaNs and squeeze in a DataFrame - pandas

select the last 2 values ​in the groupby with condition

Counting cumulative occurrences of values based on date window in Pandas

Add rank field to pandas dataframe by unique groups and sorting by multiple columns

Adding a column to pandas data frame fills it with NA

Categories

Resources

select the last 2 values in the groupby with condition