Throw out rows of same ID that are close in time - python

I have a pandas dataframe that includes a column with an ID called VIN and a column with a date. If the same VIN has multiple rows with dates that are less than 2 months apart, I would like to throw out the later dates. Here's a minimal example:
rng = pd.date_range('2015-02-24', periods=5, freq='M')
df = pd.DataFrame({ 'Date': rng, 'ID': ['ABD','ABD','CDE','CDE','FEK'] })
df.head()
Here I would like to throw out row 1 and 3.

You can use .groupby() on column ID and get the difference between 2 dates with .diff() and check whether it is less than 2 months by comparing with np.timedelta64(2, 'M'). Then filter by .loc on the boolean mask of negation of the condition.
mask = df.groupby('ID')['Date'].diff() < np.timedelta64(2, 'M')
df_filtered = df.loc[~mask]
Result:
print(df_filtered)
Date ID
0 2015-02-28 ABD
2 2015-04-30 CDE
4 2015-06-30 FEK

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

How to join data from two seperate columns based upon condition to another dataframe?

I want to add data from df2 if date is greater than 01/01/2015 and df1 if its below than 01/01/2015. Unsure how to do this as the columns are of difference length.
I have a main DF df1, and two dataframes containing data called df2 and df3.
df2 looks something like this:
day Value
01/01/2015 5
02/01/2015 6
...
going up to today,
I also have DF3 which is the same data but goes from 2000 to today so like
day Value
01/01/2000 10
02/01/2000 15
...
I want to append a Value column to DF1 that is the values of DF3 when date is less than 2015, and the values of DF2 when the date is more than 01/01/2015( including). Unsure how to do this using a condition. I think there is a way with np.where but unsure how.
To add more context.
I want to codify this statement:
df1[values] = DF2[Values] when date is bigger than 1/1/2015 and df1[values] = DF3[Values] when date is less than 1/1/2015
If you want to join 2 dataframe, you can use df1.mrege (df2,on="col_name"). If you want a condition on one dataframe and join with the other,
do it like this,
import pandas as pd
#generating 2 dataframes
date1 = pd.date_range(start='1/01/2015', end='2/02/2015')
val1 = [2,3,5]*11. #length equal to date
dict1 = {"Date":date2,"Value":val1}
df1 = pd.DataFrame(dict1)
df1. #view df1
date2 = pd.date_range(start='1/01/2000', end='2/02/2020')
val2 = [21,15]*3669. #length equal to date
dict2 = {"Date":date2,"Value":val2}
df2 = pd.DataFrame(dict2)
df2. #view df1
df3 =df1[df1["Date"]<"2015-02-01"] #whatever condition you want apply
and store on different dataframe
df2.merge(df3,on="Date")
df2 #joined dataframe
This is how you can join 2 dataframe on date with a condition, just apply
the conditon and store i another dataframe and join it with the first
dataframe by df.merge(df3,on="Date")

Drop rows based on date condition

I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10
Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]
I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']

How to check and count if value of dataframe one exist in other dataframe?

I have 2 dataframes, added with pd.read_csv. I create dataframe like this:
df1= pd.read_csv('exo.csv', delimiter=';', encoding='latin1', parse_dates=['date'], dayfirst=True)
the 2 datafames are:`
df1:
date number
jan-16
feb-17
march-17
april-17
Df2:
date
09/01/2016
08/02/2017
15/02/2017
13/03/2017
25/08/2017
I would like to check if value of df1.date exists in df2.value. If yes, the column df1['number'] will count the number of appearance. The result of Df1 should then be like this:
date number
jan-16 1
feb-17 2 (=> for instance, feb-17 has found 2 times in Df2['date'])
How can i do this ? do I need to change the date format ?
I thank you in advance,
you need to group by df2.date and then count 2
than you can merge df df2 to df1 by 'date1'
df2['date2'] = pd.to_datetime(df2['date'],format='%d/%m/%Y')
df2['date1'] = df2.date2.dt.strftime('%b-%y').astype(str).str.lower()
b = pd.DataFrame(df2.groupby('date1')['date'].count())
b.columns = ['number']
b = b.reset_index()
then merge
df1['date']=df1.date.str.lower()
df1.merge(b,right_on='date1' , left_on='date',how='left')

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Categories