Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas) - python

I have read miscellaneous posts with a similar question but couldn't find exactly this question.
I have two pandas DataFrames that I want to merge.
They have timestamps as indexes.
The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.
I would like to remove these rows because they share everything: index and values in columns.
If they don't share both index and values in columns, I want to keep them.
So far, I could point out:
Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns
DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.
To give an example, I am re-using the data given in below answer.
df1
Value
2012-02-01 12:00:00 10
2012-02-01 12:30:00 10
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
df2
Value
2012-02-01 12:30:00 20
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
2012-02-02 14:00:00 10
Result I would like to obtain is the following one:
Value
2012-02-01 12:00:00 10 #(from df1)
2012-02-01 12:30:00 10 #(from df1)
2012-02-01 12:30:00 20 #(from df2 - same index than in df1, but different value)
2012-02-01 13:00:00 20 #(in df1 & df2, only one kept)
2012-02-01 13:30:00 30 #(in df1 & df2, only one kept)
2012-02-02 14:00:00 10 #(from df2)
Please, any idea?
Thanks for your help!
Bests

Assume that you have 2 following DataFrames:
df:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 13:00:00 20
3 2012-02-01 13:30:00 30
4 2012-02-02 14:00:00 10
5 2012-02-02 14:30:00 10
6 2012-02-02 15:00:00 20
7 2012-02-02 15:30:00 30
df2:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 21
2 2012-02-01 12:40:00 22
3 2012-02-01 13:00:00 20
4 2012-02-01 13:30:00 30
To generate the result, run:
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates().reset_index(drop=True)
The result, for the above data, is:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 12:30:00 21
3 2012-02-01 12:40:00 22
4 2012-02-01 13:00:00 20
5 2012-02-01 13:30:00 30
6 2012-02-02 14:00:00 10
7 2012-02-02 14:30:00 10
8 2012-02-02 15:00:00 20
9 2012-02-02 15:30:00 30
drop_duplicates drops duplicated rows, keeping the first.
Since no subset parameter has been passed, the criterion to treat
2 rows as duplicates is identity of all columns.

Just improving the first answer, insert Date inside drop_duplicates
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)

Related

How to group by column and a fixed time window/frequency

EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.

For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.

SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3

Pandas Merging Dataframes One with Different Time Stamps

Suppose we have two dataframes, one with a timestamp and the other with start and end timestamps. df1 and df2 as:
DateTime
Value1
StartDateTime
EnddDateTime
Value2
2020-01-11 12:30:00
1
2020-01-11 12:23:12
2020-01-11 13:10:00
a
2020-01-11 13:00:00
2
2020-01-11 14:12:20
2020-01-11 14:20:34
b
2020-02-11 13:30:00
3
2020-01-11 15:20:00
2020-01-11 15:28:10
c
2020-02-11 14:00:00
4
2020-01-11 15:45:20
2020-01-11 16:26:23
d
2020-02-11 14:30:00
5
2020-02-11 15:00:00
6
2020-02-11 15:30:00
7
2020-02-11 16:00:00
8
The timestamp of df1 represents half an hour starting from the time in the DateTime column. I want to match df2 start and end time with these 20 minutes periods. A value of df2 may fall in two rows of df1 if its period (the time between start and end) matches with two DateTime in df1, even for only one second. The outcome should be a dataframe as below.
DateTime
Value1
Value2
2020-01-11 12:30:00
1
a
2020-01-11 13:00:00
2
a
2020-02-11 13:30:00
3
Nan
2020-02-11 14:00:00
4
b
2020-02-11 14:30:00
5
Nan
2020-02-11 15:00:00
6
c
2020-02-11 15:30:00
7
d
2020-02-11 16:00:00
8
d
Any suggestions to efficiently merge large data?

There maybe shorter better answers out there because I am going longhand.
melt the second data frame
df3=pd.melt(df2, id_vars=['Value2'], value_vars=['StartDateTime', 'EnddDateTime'],value_name='DateTime').sort_values(by='DateTime')
Create temp columns on both dfs. The reason is, you want to get the time from datetime, append that time to a uniform date to be used in the merge
df1['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df1['DateTime']).dt.time.astype(str)
df3['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df3['DateTime']).dt.time.astype(str)
Convert the new column date times computed above to datetime
df3["DateTime1"]=pd.to_datetime(df3["DateTime1"])
df1["DateTime1"]=pd.to_datetime(df1["DateTime1"])
Finally, mergeasof with a time tolerance
final = pd.merge_asof(df1, df3, on="DateTime1",tolerance=pd.Timedelta("39M"),suffixes=('_', '_df2')).drop(columns=['DateTime1','variable','DateTime_df2'])
DateTime_ Value1 Value2
0 2020-01-11 13:00:00 2 a
1 2020-02-11 13:30:00 3 a
2 2020-02-11 14:00:00 4 NaN
3 2020-02-11 14:30:00 5 b
4 2020-02-11 15:00:00 6 NaN
5 2020-02-11 15:30:00 7 c
6 2020-02-11 16:00:00 8 d

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!

You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2

I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

best way to fill up gaps by yearly dates in Python dataframe

all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!

You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0

I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0

How to properly pivot or reshape a timeseries dataframe in Pandas?

I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
I need to be able to set the number of rows to be sliced as a parameter (length).
I need to split date and time from the index, and use date in the reshape as the column names and keep time as the index.
Current df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
What have I done:
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
I'm stuck. I appreciate any input.

import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
yields
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00 is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.

First of all I'm assuming your datetime column is actually a datetime type if not use df['t'] = pd.to_datetime(df['t']) to convert.
Then set your index using a multindex and unstack...
df.index = pd.MultiIndex.from_tuples(df['t'].apply(lambda x: [x.time(),x.date()]))
df['v'].unstack()

This would be a canonical approach for pandas:
First, setup with imports and data:
import pandas as pd
import StringIO
txt = '''2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
Now read in the DataFrame, and pivot on the correct columns:
df1 = pd.read_csv(StringIO.StringIO(txt), sep=' ',
names=['d', 't', 'n'], )
print(df1.pivot(index='t', columns='d', values='n'))
prints a pivoted df:
d 2007-08-07 2007-08-08 2007-11-02 2007-11-03
t
00:00:00 NaN 2 NaN 7
06:00:00 NaN 3 NaN 8
12:00:00 NaN 4 NaN 9
18:00:00 1 5 6 10
You won't get a length of 5, though. The following,
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
is incorrect, as you have 18:00:00 twice for the same date, and in your initial data, they apply to different dates.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas) - python

Just improving the first answer, insert Date inside drop_duplicates pd.concat([df, df2]).sort_values('Date')\ .drop_duplicates('Date').reset_index(drop=True)

Related

How to group by column and a fixed time window/frequency

Pandas Merging Dataframes One with Different Time Stamps

How to use pandas Grouper to get sum of values within each hour

best way to fill up gaps by yearly dates in Python dataframe

How to properly pivot or reshape a timeseries dataframe in Pandas?

Categories

Resources