Join two panda dataframe with duplicate value [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two tables
masterblok:
BLOCKID PLANTINGDATE PLANTED_HA
A001 01-JAN-08 13.86
A002 01-JAN-08 13.24
A002 31-MAR-18 1.99
A003 01-JAN-08 14.76
A003 31-MAR-18 2.48
pest_perperiod: (note that there is FIELDCODE other than A002)
FIELDCODE PERIOD
A002 2019-01-01
A002 2019-02-01
A002 2019-03-01
A002 2019-04-01
A002 2019-05-01
I want to join two dataframe so that for each data in pest_perperiod will have one or many corresponding PLANTINGDATE (like cross join in SQL) so I can calculate retention rate since active month for each BLOCKID and PLANTINGDATE
i tried using: (vice versa)
pest_perperiod.join(masterblok.set_index('BLOCKID'), on='FIELDCODE')
returned error because duplicated value still exist, how to do this?

I think you just want merge
pest_perperiod.merge(masterblock, left_on='BLOCKID', right_on='FIELDCODE')
output:
BLOCKID PLANTINGDATE PLANTED_HA FIELDCODE PERIOD
0 A002 01-JAN-08 13.24 A002 2019-01-01
1 A002 01-JAN-08 13.24 A002 2019-02-01
2 A002 01-JAN-08 13.24 A002 2019-03-01
3 A002 01-JAN-08 13.24 A002 2019-04-01
4 A002 01-JAN-08 13.24 A002 2019-05-01
5 A002 31-MAR-18 1.99 A002 2019-01-01
6 A002 31-MAR-18 1.99 A002 2019-02-01
7 A002 31-MAR-18 1.99 A002 2019-03-01
8 A002 31-MAR-18 1.99 A002 2019-04-01
9 A002 31-MAR-18 1.99 A002 2019-05-01

Related

How to crate new column based on time interval?

I want to create a new column, based on time interval of 6hours from datetime column how can I do that?
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456W BMT 05/29/2021 00:00:00 REGULAR 7578734 2590325
1 A002 R051 02-00-00 59 ST NQR456W BMT 05/29/2021 04:00:00 REGULAR 7578740 2590327
2 A002 R051 02-00-00 59 ST NQR456W BMT 05/29/2021 08:00:00 REGULAR 7578749 2590340
3 A002 R051 02-00-00 59 ST NQR456W BMT 05/29/2021 12:00:00 REGULAR 7578789 2590386
4 A002 R051 02-00-00 59 ST NQR456W BMT 05/29/2021 16:00:00 REGULAR 7578897 259041
pandas has floor function for time
df['DATETIME'].dt.floor('6H')
this column needs to be datetime type
0 1
0 2021-06-06 00:00:00 2021-06-06 00:00:00
1 2021-06-06 01:00:00 2021-06-06 00:00:00
2 2021-06-06 02:00:00 2021-06-06 00:00:00
3 2021-06-06 03:00:00 2021-06-06 00:00:00
4 2021-06-06 04:00:00 2021-06-06 00:00:00
5 2021-06-06 05:00:00 2021-06-06 00:00:00
6 2021-06-06 06:00:00 2021-06-06 06:00:00
7 2021-06-06 07:00:00 2021-06-06 06:00:00
8 2021-06-06 08:00:00 2021-06-06 06:00:00
9 2021-06-06 09:00:00 2021-06-06 06:00:00
If you want to get a new column with date/time being 6 hours offset from DATETIME column, you can use pd.DateOffset, as follows:
df['New_DATETIME'] = pd.to_datetime(df['DATETIME']) + pd.DateOffset(hours=6)

Pandas grouping by start of the month with pd.Grouper

I have a DataFrame with hourly timestamps:
2019-01-01 0:00:00 1
2019-01-01 1:00:00 2
2019-01-11 3:00:00 1
2019-01-21 4:00:00 2
2019-02-01 0:00:00 1
2019-03-05 1:00:00 2
2019-03-21 3:00:00 1
2019-04-08 4:00:00 2
I am using the Pandas Grouper to group and sum the data monthly:
monthly_data = [pd.Grouper(freq='M', label='left')].sum()
Expected output:
2019-01-01 0:00:00 6
2019-02-01 0:00:00 1
2019-03-01 0:00:00 3
2019-04-01 0:00:00 2
Actual output:
2018-12-31 0:00:00 6
2019-01-31 0:00:00 1
2019-02-28 0:00:00 3
2019-03-30 0:00:00 2
How can I get the labels of the groups to be the first element in the group?
Thank you
Use the freq MS (Month Start), rather than M (Month End).
See dateoffset objects in the docs.
Use resample to aggregate on DatetimeIndex:
df.resample('MS').sum()
value
date
2019-01-01 6
2019-02-01 1
2019-03-01 3
2019-04-01 2

Calculating difference in minutes based on 30 minute interval?

I had a df such as
ID | Half Hour Bucket | clock in time | clock out time | Rate
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
I want my output to be
ID | Half Hour Bucket | clock in time | clock out time | Rate | Mins
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23 |
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
Where minutes represents the difference between clock out time and clock in time.
But I can only contain the minutes value for the half hour bucket on the same row it corresponds to.
For example for id 342 it would be ten minutes and the 10 mins would be on that row.
But for ID 232 the clock in to clock out time spans 3 hours. I would only want the 30 mins for 8 to 830 in the first row and the 18 mins in the third row. for the minutes in the half hour bucket like 830-9 or 9-930 that dont exist in the first row, I would want to create a new row in that same df that contains nans for everything except the half hour bucket and mins field for the minutes that do not exist in the original row.
the 30 mins from 8-830 would stay in the first row, but I would want 5 new rows for all the half hour buckets that aren't 4/1/19 8:00 PM as new rows with only the half hour bucket and the rate carrying over from the row. Is this possible?
I thank anyone for their time!
Realised my first answer probably wasn't what you wanted. This version, hopefully, is. It was a bit more involved than I first assumed!
Create Data
First of all create a dataframe to work with, based on that supplied in the question. The resultant formatting isn't quite the same but that would be easily fixed, so I've left it as-is here.
import math
import numpy as np
import pandas as pd
# Create a dataframe to work with from the data provided in the question
columns = ['id', 'half_hour_bucket', 'clock_in_time', 'clock_out_time' , 'rate']
data = [[232, '4/1/19 8:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54],
[342, '4/1/19 8:30 PM', '4/1/19 7:12 PM', '4/1/19 07:22 PM ', 0.23],
[232, '4/1/19 7:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54]]
df = pd.DataFrame(data, columns=columns)
def convert_cols_to_dt(df):
# Convert relevant columns to datetime format
for col in df:
if col not in ['id', 'rate']:
df[col] = pd.to_datetime(df[col])
return df
df = convert_cols_to_dt(df)
# Create the mins column
df['mins'] = (df.clock_out_time - df.clock_in_time)
Output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 0 days 00:10:00.000000000
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
Solution
Next define a simple function to return a list of length equal to the number of 30-minute intervals in the min column.
def upsample_list(x):
multiplier = math.ceil(x.total_seconds() / (60 * 30))
return list(range(multiplier))
And apply this to the dataframe:
df['samples'] = df.mins.apply(upsample_list)
Next, create a new row for each list item in the 'samples' column (using the answer provided by Roman Pekar here):
s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'sample'
Join s to the dataframe and clean up the extra columns:
df = df.drop('samples', axis=1).join(s, how='inner').drop('sample', axis=1)
Which gives us this:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
Nearly there!
Reset the index:
df = df.reset_index(drop=True)
Set duplicate rows to NaN:
df = df.mask(df.duplicated())
Which gives:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 NaN NaT NaT NaT NaN NaT
2 NaN NaT NaT NaT NaN NaT
3 NaN NaT NaT NaT NaN NaT
4 NaN NaT NaT NaT NaN NaT
5 NaN NaT NaT NaT NaN NaT
6 NaN NaT NaT NaT NaN NaT
7 NaN NaT NaT NaT NaN NaT
8 342.0 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
10 NaN NaT NaT NaT NaN NaT
11 NaN NaT NaT NaT NaN NaT
12 NaN NaT NaT NaT NaN NaT
13 NaN NaT NaT NaT NaN NaT
14 NaN NaT NaT NaT NaN NaT
15 NaN NaT NaT NaT NaN NaT
16 NaN NaT NaT NaT NaN NaT
Lastly, forward fill the half_hour_bucket and rate columns.
df[['half_hour_bucket', 'rate']] = df[['half_hour_bucket', 'rate']].ffill()
Final output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
1 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
2 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
3 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
4 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
5 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
6 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
7 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
8 342.0 2019-04-01 20:30:00 2019-04-01_19:12:00 2019-04-01_19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
10 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
11 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
12 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
13 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
14 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
15 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
16 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT

Subtract dates if indexes match each other in Python Pandas

I have two dataframes:
print (df1)
ID Birthday
0 A000 1990-01-01
1 A001 1991-05-05
2 A002 1970-10-01
3 A003 1980-07-07
4 A004 1945-08-15
print (df2)
ID Date from
0 A000 2010.01
1 A001 2012.01
2 A002 2010.01
3 A002 2010.01
4 A002 2010.11
5 A003 2009.05
6 A003 2010.01
7 A004 2010.01
8 A005 2007.11
9 A006 2017.01
df1 consists of IDs and and the birthday and df2 contains ID and the dates. Some of the values in df2.ID are not in df1.ID (i.e. A005 and A006).
What I am trying:
I'd like to calculate the difference between df1.Birthday and df2.Date if df2.ID exists in df1.ID.
What I have done so far:
df1['Birthday'] = pd.to_datetime(df1['Birthday'])
df2['Date from'] = pd.to_datetime(df2['Date from'])
x1 = df1.set_index(['ID'])['Birthday']
x2 = df2.set_index(['ID'])['Date from']
x3 = x2.sub(x1,fill_value=0)
print(x3)
ID
A000 -7305 days +00:00:00.000002
A001 -7794 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A003 -3840 days +00:00:00.000002
A003 -3840 days +00:00:00.000002
A004 8905 days 00:00:00.000002
A005 0 days 00:00:00.000002
A006 0 days 00:00:00.000002
dtype: timedelta64[ns]
There is an error as ID A003 have a same value but it consists of different dates. I am not sure how I could go proceed to the next step. Thank you in advance for any assistance you can provide.
First, I would left merge the dataframes to make sure things were lining up properly. Then subtract the two date columns in a new column:
import pandas
from io import StringIO
data1 = StringIO("""\
ID Birthday
A000 1990-01-01
A001 1991-05-05
A002 1970-10-01
A003 1980-07-07
A004 1945-08-15
""")
data2 = StringIO("""\
ID Date_from
A000 2010.01
A001 2012.01
A002 2010.01
A002 2010.01
A002 2010.11
A003 2009.05
A003 2010.01
A004 2010.01
A005 2007.11
A006 2017.01
""")
x1 = pandas.read_table(data1, sep='\s+', parse_dates=['Birthday'])
x2 = pandas.read_table(data2, sep='\s+', parse_dates=['Date_from'])
data = (
x2.merge(right=x1, left_on='ID', right_on='ID', how='left')
.assign(Date_diff=lambda df: df['Date_from'] - df['Birthday'])
)
print(data)
And that gives me:
ID Date_from Birthday Date_diff
0 A000 2010-01-01 1990-01-01 7305 days
1 A001 2012-01-01 1991-05-05 7546 days
2 A002 2010-01-01 1970-10-01 14337 days
3 A002 2010-01-01 1970-10-01 14337 days
4 A002 2010-11-01 1970-10-01 14641 days
5 A003 2009-05-01 1980-07-07 10525 days
6 A003 2010-01-01 1980-07-07 10770 days
7 A004 2010-01-01 1945-08-15 23515 days
8 A005 2007-11-01 NaT NaT
9 A006 2017-01-01 NaT NaT
use dateutil package to get the diference in years, month, days:
from dateutil import relativedelta as rdelta
from datetime import date
d1 = date(2010,5,1)
d2 = date(2012,1,1)
rd = rdelta.relativedelta(d2,d1)
'

In pandas how to filter based on a particular weekday and range of time

My data frame looks something like this.The notebook is here
C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn
0 A002 R051 02-00-00 08-18-12 00:00:00 REGULAR 3759779 1297676
1 A002 R051 02-00-00 08-18-12 04:00:00 REGULAR 3759809 1297680
2 A002 R051 02-00-00 08-18-12 08:00:00 REGULAR 3759820 1297701
3 A002 R051 02-00-00 08-18-12 12:00:00 REGULAR 3759879 1297799
4 A002 R051 02-00-00 08-18-12 16:00:00 REGULAR 3760073 1297863
5 A002 R051 02-00-00 08-18-12 20:00:00 REGULAR 3760367 1297920
6 A002 R051 02-00-00 08-19-12 00:00:00 REGULAR 3760494 1297958
7 A002 R051 02-00-00 08-19-12 04:00:00 REGULAR 3760525 1297962
8 A002 R051 02-00-00 08-19-12 08:00:00 REGULAR 3760545 1297983
9 A002 R051 02-00-00 08-19-12 12:00:00 REGULAR 3760603 1298048
10 A002 R051 02-00-00 08-19-12 16:00:00 REGULAR 3760750 1298104
11 A002 R051 02-00-00 08-19-12 20:00:00 REGULAR 3760982 1298137
12 A002 R051 02-00-00 08-20-12 00:00:00 REGULAR 3761088 1298175
13 A002 R051 02-00-00 08-20-12 04:00:00 REGULAR 3761098 1298186
14 A002 R051 02-00-00 08-20-12 08:00:00 REGULAR 3761130 1298265
this code will filter out the month of july
july_station = df[['COUNTn']]\
[(df.DATETIMEn >= datetime.datetime.strptime('07-01-13', '%m-%d-%y')) &\
(df.DATETIMEn <= datetime.datetime.strptime('07-31-13', '%m-%d-%y'))]\
.groupby(df.UNIT)\
.sum()
The above code filters only the month
what if i have to filter out entries between midnight & 4am on Fridays in July 2013? Is this right approach?
july_station1 = df[['COUNTn']]\
[(df.DATETIMEn >= datetime.datetime.strptime('07-01-13 00:00 5', '%m-%d-%y %H:%M %A')) &\
(df.DATETIMEn <= datetime.datetime.strptime('07-31-13 04:00 5', '%m-%d-%y %H:%M %A'))]\
.groupby(df.UNIT)\
.sum()
If your column is a datetime column, you can get weekday and hour with column.dt.weekday (monday = 0, sunday = 6), and column.dt.hour. Also you can use between on your series to do range comparison more elegantly:
df.DATEn = pd.to_datetime(df.DATEn)
df.TIMEn = pd.to_datetime(df.TIMEn)
mask = (df.DATEn == 4) & df.TIMEn.dt.hour.between(0,4)

Categories