I have a data frame with several columns and rows, I have a column of 'Date' ( month/day/ year Hour:Min: Sec PM' and I need to get from the data frame only the rows that that match the Hour:Min:Sec part of that column. The column has the data as object.
df.loc[df['Date']== 'month/day/year 11:00:00 PM'].copy
It only works when I specify the month/day/year, but I want to obtain the rows that correspond to the time no matter the day. Does any one know how can this be achieved ?
This is in 2 steps. The first creates an intermediate col with the time only. The second does the filtering.
>>> import datetime
>>> import pandas as pd
>>> df =pd.DataFrame([[datetime(2018,1,1,2,2,2),1], [ datetime(2018,1,1,3,3,3),2]], columns=['Date','Val'])
Date Val
0 2018-01-01 02:02:02 1
1 2018-01-01 03:03:03 2
1) Create intermediate col
>>> df['new'] = df['Date'].transform(lambda x: x.time())
>>> df
Date Val new
0 2018-01-01 02:02:02 1 02:02:02
1 2018-01-01 03:03:03 2 03:03:03
2) Do filtering
>>> df[df['new'] == datetime.time(2,2,2)]
Date Val new
0 2018-01-01 02:02:02 1 02:02:02
Related
I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day
Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
Try:
Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and iter on a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False
Set the index as an intervalindex, and use groupby to get your overlapping IDs:
(df.set_index(pd.IntervalIndex
.from_arrays(df.Date_min,
df.Date_max,
closed='both'))
.groupby('ID')
.apply(lambda df: df.index.is_overlapping)
)
ID
1 True
2 False
3 False
dtype: bool
I have a data frame with a column indicating the number of months. I would like to create a new column, starting from an initial date, let’s say 2015-01-01 and add all the months to this initial date. For example, if the month column has values [0, 1, 2, …,72], then I would like to have a column called Date of the form [2015-01-01,2015-02-01,2015-03-01,…].
How could I achieve this?
Use offsets.DateOffset and add to datetime:
df = pd.DataFrame({'n': [0,1,2,72]})
start = '2015-01-01'
df['new'] = pd.to_datetime(start) + df['n'].apply(lambda x: pd.offsets.DateOffset(months=x))
print (df)
n new
0 0 2015-01-01
1 1 2015-02-01
2 2 2015-03-01
3 72 2021-01-01
I have a dataframe like this:
ID
Date
01
2020-01-02
01
2020-01-03
02
2020-01-02
I need to create a new column, that for each specific ID and Date, gives me the number of rows that have the same ID but of an earlier date.
So the output of the example df will look like this
ID
Date
Count
01
2020-01-02
0
01
2020-01-03
1
02
2020-01-02
0
I have tried working with aux tables, and also with group by using a lambda function, but with no real idea how to continue
This will create a new column with the count.
df['Date'] = pd.to_datetime(df['Date'])
df['Count'] = df.groupby('ID')['Date'].rank(ascending=True).astype(int) - 1
First you need to be sure that you are comparing dates.
df["Date"] = pd.to_datetime(df['Date'], format="%Y-%m-%d")
Then you can create new column called 'Count' iterating over each row using df.apply.
def count_earlier_dates(row):
return df[df['Date'] < row['Date']].count()['ID']
df['Count'] = df.apply(lambda row: count_earlier_dates(row), axis=1)
Let us try factorize
df['new'] = df.sort_values('Date').groupby('ID')['Date'].transform(lambda x : x.factorize()[0])
df
ID Date new
0 1 2020-01-02 0
1 1 2020-01-03 1
2 2 2020-01-02 0
I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?
df = df1[column_names1].merge(df2[column_names2], on='Code')
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to df1 with the closest date from the appropriate group in df2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.
I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832
>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194
Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08
Finally, we can merge these together with a straightforward call to pd.merge:
>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.
Here's an alternative solution:
Merge on Code.
Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.
Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.
Code:
df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()