Compare two dataframes and delete not same dates - python

I have two dataframes and want to compare them and delete the days in the df2 which are not the same as in df1. I tried to use:
df2[~df2.Date.isin(df1.Date)]
but this does not work and getting an empty dataframe. df2 should look like df1. The dataframe's looks like the following:
df1
Date
0 20-12-16
1 21-12-16
2 22-12-16
3 23-12-16
4 27-12-16
5 28-12-16
6 29-12-16
7 30-12-16
8 02-01-17
9 03-01-17
10 04-01-17
11 05-01-17
12 06-01-17
df2
Date
0 20-12-16
1 21-12-16
2 22-12-16
3 23-12-16
4 24-12-16
5 25-12-16
6 26-12-16
7 27-12-16
8 28-12-16
9 29-12-16
10 30-12-16
11 31-12-16
12 01-01-17
13 02-01-17
14 03-01-17
15 04-01-17
16 05-01-17
17 06-01-17

It seems dtypes are different. For comparing need same.
Check it by:
print (df1.Date.dtype)
print (df2.Date.dtype)
and then convert if necessary:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
I add another 2 solutions - first with numpy.in1d and second with merge, because need default inner join:
df = df2[np.in1d(df2.Date, df1.Date)]
print (df)
Date
0 2016-12-20
1 2016-12-21
2 2016-12-22
3 2016-12-23
7 2016-12-27
8 2016-12-28
9 2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06
df = df1.merge(df2, on='Date')
print (df)
Date
0 2016-12-20
1 2016-12-21
2 2016-12-22
3 2016-12-23
7 2016-12-27
8 2016-12-28
9 2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06
Sample:
d1 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
d2 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '24-12-16', '25-12-16', '26-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '31-12-16', '01-01-17', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
print (df1.Date.dtype)
object
print (df2.Date.dtype)
object
df1['Date'] = pd.to_datetime(df1['Date'], format='%d-%m-%y')
df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y')

Your mistake is from logic. You want to select the df2 date that are df1. So you should write
df2[df2.Date.isin(df1.Date)]
not the contrary of the boolean where comparison/inclusion in df1 is true
You could also obtain the same result with
set(b.Date)-(set(b.Date)-set(a.Date))
Which should then be used through:
pd.DataFrame(sorted((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )
While the sorting is not optimal and you may change it in pandas by better logic.
df = pd.DataFrame(list((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )
df.Date = [date.date() for date in df.Date]
or
df.Date.dt.date
(see How do I convert dates in a Pandas data frame to a 'date' data type?)

Related

Monthly climatology across several years, repeated for each day in that month over all years

I need to find the monthly climatology of some data that has daily values across several years. The code below sufficiently summarizes what I am trying to do. monthly_mean holds the averages over all years for specific months. I then need to assign that average in a new column for each day in a specific month over all of the years. For whatever reason, my assignment, df['A Climatology'] = group['A Climatology'], is only assigning values to the month of December. How can I make the assignment happen for all months?
data = np.random.randint(5,30,size=(365*3,3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'], index=pd.date_range('2021-01-01', periods=365*3))
df['A Climatology'] = np.nan
monthly_mean = df['A'].groupby(df.index.month).mean()
for month, group in df.groupby(df.index.month):
group['A Climatology'] = monthly_mean.loc[month]
df['A Climatology'] = group['A Climatology']
df
Your code is setting the column == to the group, so every iteration of your loop you're setting the df's values only for that group---which is why your df ends on December, the last month in the list.
monthly_mean = df['A'].groupby(df.index.month).mean()
for month, group in df.groupby(df.index.month):
df.loc[lambda df: df.index.month == month, 'A Climatology'] = monthly_mean.loc[month]
Instead, you could directly set the df's values where the month == the iterable month.
merged_df = pd.merge(df,
monthly_mean,
how='left',
left_on=df.index.month,
right_on=monthly_mean.index).drop('key_0', axis=1).set_index(df.index)
A_x B C A Climatology A_y
2021-01-01 12 20 18 NaN 16.752688
2021-01-02 24 26 11 NaN 16.752688
2021-01-03 18 27 15 NaN 16.752688
2021-01-04 18 5 22 NaN 16.752688
2021-01-05 10 15 25 NaN 16.752688
... ... ... ... ... ...
2023-12-27 19 15 11 16.11828 16.118280
2023-12-28 16 23 25 16.11828 16.118280
2023-12-29 6 13 16 16.11828 16.118280
2023-12-30 10 9 14 16.11828 16.118280
2023-12-31 15 22 17 16.11828 16.118280
Or to do this without creating a new data frame:
df = df.reset_index().merge(monthly_mean, how='left', left_on=df.index.month, right_on=monthly_mean.index).set_index('index')
monthly_means:
1 16.752688
2 16.476190
3 16.795699
4 17.111111
5 17.795699
6 18.111111
7 16.806452
8 15.236559
9 15.600000
10 18.279570
11 16.555556
12 16.118280
Name: A, dtype: float64

How to replace timestamp across the columns using pandas

df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12

how to compare two dates columns with common category in pandas?

I have a data set of of the same category. I want to compare the two date columns of the same category
I want to see if DATE1 less than in values in DATE2 of the same CATEGORY and find the earliest DATE it is greater than
I'm trying this but i'm not getting the results that I am looking for
df['test'] = np.where(m['DATE1'] < df['DATE2'], Y, N)
CATEGORY DATE1 DATE2 GREATERTHAN GREATERDATE
0 23 2015-01-18 2015-01-15 Y 2015-01-10
1 11 2015-02-18 2015-02-19 N 0
2 23 2015-03-18 2015-01-10 Y 2015-01-10
3 11 2015-04-18 2015-08-18 Y 2015-02-19
4 23 2015-05-18 2015-02-21 Y 2015-01-10
5 11 2015-06-18 2015-08-18 Y 2015-02-19
6 15 2015-07-18 2015-02-18 0 0
df['DATE1'] = pd.to_datetime(df['DATE1'])
df['DATE2'] = pd.to_datetime(df['DATE2'])
df['GREATERTHAN'] = np.where(df['DATE1'] > df['DATE2'], 'Y', 'N')
## Getting the earliest date for which data is available, per category
earliest_dates = df.groupby(['CATEGORY']).apply(lambda x: x['DATE1'].append(x['DATE2']).min()).to_frame()
## Merging to get the earliest date column per category
df.merge(earliest_dates, left_on = 'CATEGORY', right_on = earliest_dates.index, how = 'left')

How to make a title that would show the range of the date column?

I am writing to a txt the results of a group by in pandas. I would like to make a sentence that would refer to the range the info is about. Example:
data for date 12/09/2018 to 16/09/2018
dates user quantity
0 Sep user_05 23
1 Sep user_06 22
2 Sep user_06 23
3 Sep user_07 22
4 Sep user_11 22
5 Sep user_12 20
6 Sep user_20 34
7 Sep user_20 34
If I do this:
x['dates'].max()
gives:
Timestamp('2018-09-16 00:00:00')
and
x['dates'].min()
gives:
Timestamp('2018-09-12 00:00:00')
But how can I make it appear in a sentence before the results?
Use:
#sample data
rng = pd.date_range('2017-04-03', periods=10)
x = pd.DataFrame({'dates': rng, 'a': range(10)})
print (x)
dates a
0 2017-04-03 0
1 2017-04-04 1
2 2017-04-05 2
3 2017-04-06 3
4 2017-04-07 4
5 2017-04-08 5
6 2017-04-09 6
7 2017-04-10 7
8 2017-04-11 8
9 2017-04-12 9
#convert timestamps to strings
maxval = x['dates'].max().strftime('%d/%m/%Y')
minval = x['dates'].min().strftime('%d/%m/%Y')
#create sentence, 3.6+ solution
a = f'data for date {minval} to {maxval}'
#solution bellow 3.6
a = 'data for date {} to {}'.format(minval, maxval)
print (a)
data for date 03/04/2017 to 12/04/2017
#write sentence to file
df1 = pd.Series(a)
df1.to_csv('output.csv', index=False, header=None)
#append DataFrame to file
x.to_csv('output.csv', mode='a', index=False)

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Categories