Remove rows based on date column [Pandas] - python

I would like to filter for customer_id'sthat first appear after a certain date in this case 2019-01-10 and then create a new df with a list of new customers
df
date customer_id
2019-01-01 429492
2019-01-01 344343
2019-01-01 949222
2019-01-10 429492
2019-01-10 344343
2019-01-10 129292
Output df
customer_id
129292
This is what I have tried so far but this gives me also customer_id's that were active before 10th January 2019
s = df.loc[df["date"]>="2019-01-10", "customer_id"]
df_new = df[df["customer_id"].isin(s)]
df_new

You can use boolean indexing with filtering with Series.isin:
df["date"] = pd.to_datetime(df["date"])
mask1 = df["date"]>="2019-01-10"
mask2 = df["customer_id"].isin(df.loc[~mask1,"customer_id"])
df = df.loc[mask1 & ~mask2, ['customer_id']]
print (df)
customer_id
5 129292

df['date'] = pd.to_datetime(df['date'])
cutoff = pd.to_datetime('2019-01-10')
mask = df['date'] >= cutoff
customers_before = df.loc[~mask, 'customer_id'].unique().tolist()
customers_after = df.loc[mask, 'customer_id'].unique().tolist()
result = set(customers_after) - set(customers_before)

"then create a new df with a list of new customers" so in this case your output is null, because 2019-01-10 is last date, there is no new customers after this date
but if you want to get list of customers after certain date or equal than :
df=pd.DataFrame({
'date':['2019-01-01','2019-01-01','2019-01-01',
'2019-01-10','2019-01-10','2019-01-10'],
'customer_id':[429492,344343,949222,429492,344343,129292]
})
certain_date=pd.to_datetime('2019-01-10')
df.date=pd.to_datetime(df.date)
df=df[
df.date>=certain_date
]
print(df)
date customer_id
3 2019-01-10 429492
4 2019-01-10 344343
5 2019-01-10 129292

If your 'date' column has datetime objects you just have to do:
df_new = df[df['date'] >= datetime(2019, 1, 10)]['customer_id']
If your 'date' column doesn't contain datetime objects, you should convert it first it by using to_datetime method:
df['date'] = pd.to_datetime(df['date'])
And then apply the methodology described above.

Related

use groupby() and for loop to count column values with conditions

The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64

Date not equal to today

id
date
0
2021-18-01
1
2021-17-01
How can I keep rows if the date column has a value not equal to today's date (17th Jan)?
df[df.date != datetime.datetime.today().date()]
Expected Output
id
date
0
2021-18-01
try:
df.date = pd.to_datetime(df.date, format="%Y-%d-%m")
df[df.date!=str(datetime.datetime.today().date())]
date
1 2021-01-17
Setup:
Here we populate a dataframe and convert the data to correct datatypes
df = pd.DataFrame([dict(id=0, date='2021-01-17'),
dict(id=1, date='2021-01-18'),
dict(id=2, date='2021-01-19')])
df = df.set_index('id')
df.date = pd.to_datetime(df.date)
The result would look like this:
id
date
0
2021-01-17
1
2021-01-18
2
2021-01-19
Filtering:
df.loc[df.date.dt.date != pd.Timestamp.now().date()]
The result would look like this (In my timezone it is January, 18th already):
id
date
0
2021-01-17
2
2021-01-19
Explanation
We use LocIndexer .loc accessor to filter the dataframe via an array of booleans.
In order to make the comparison correct we take the date() part from the current timestamp on the right hand side and use the DateTimeProperties .dt accessor to use date property of the underlying Datetime object for the left hand side.

Create new date column in python pandas

I'm trying to create a new date column based on an existing date column in my dataframe. I want to take all the dates in the first column and make them the first of the month in the second column so:
03/15/2019 = 03/01/2019
I know I can do this using:
df['newcolumn'] = pd.to_datetime(df['oldcolumn'], format='%Y-%m-%d').apply(lambda dt: dt.replace(day=1)).dt.date
My issues is some of the data in the old column is not valid dates. There is some text data in some of the rows. So, I'm trying to figure out how to either clean up the data before I do this like:
if oldcolumn isn't a date then make it 01/01/1990 else oldcolumn
Or, is there a way to do this with try/except?
Any assistance would be appreciated.
At first we generate some sample data:
df = pd.DataFrame([['2019-01-03'], ['asdf'], ['2019-11-10']], columns=['Date'])
This can be safely converted to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
mask = df['Date'].isnull()
df.loc[mask, 'Date'] = dt.datetime(1990, 1, 1)
Now you don't need the slow apply
df['New'] = df['Date'] + pd.offsets.MonthBegin(-1)
Try with the argument errors=coerce.
This will return NaT for the text values.
df['newcolumn'] = pd.to_datetime(df['oldcolumn'],
format='%Y-%m-%d',
errors='coerce').apply(lambda dt: dt.replace(day=1)).dt.date
For example
# We have this dataframe
ID Date
0 111 03/15/2019
1 133 01/01/2019
2 948 Empty
3 452 02/10/2019
# We convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y', errors='coerce')
Output
ID Date
0 111 2019-03-15
1 133 2019-01-01
2 948 NaT
3 452 2019-02-10

Comparing today date with date in dataframe

Comparing today date with date in dataframe
Sample Data
id date
1 1/2/2018
2 1/5/2019
3 5/3/2018
4 23/11/2018
Desired output
id date
2 1/5/2019
4 23/11/2018
My current code
dfdateList = pd.DataFrame()
dfDate= self.df[["id", "date"]]
today = datetime.datetime.now()
today = today.strftime("%d/%m/%Y").lstrip("0").replace(" 0", "")
expList = []
for dates in dfDate["date"]:
if dates <= today:
expList.append(dates)
dfdateList = pd.DataFrame(expList)
Currently my code is printing every single line despite the conditions, can anyone guide me? thanks
Pandas has native support for a large class of operations on datetimes, so one solution here would be to use pd.to_datetime to convert your dates from strings to pandas' representation of datetimes, pd.Timestamp, then just create a mask based on the current date:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df[df['date'] > pd.Timestamp.now()]
For example:
In [34]: df['date'] = pd.to_datetime(df['date'], dayfirst=True)
In [36]: df
Out[36]:
id date
0 1 2018-02-01
1 2 2019-05-01
2 3 2018-03-05
3 4 2018-11-23
In [37]: df[df['date'] > pd.Timestamp.now()]
Out[37]:
id date
1 2 2019-05-01
3 4 2018-11-23

Date comparison in python

i have already two datasets each one has 2 columns (date, close)
i want to compare date of the first dataset to the date of the second dataset if they are the same date the close of the second dataset takes the value relative to the date in question else it takes the value of the date of previous day.
This is the dataset https://www.euronext.com/fr/products/equities/FR0000120644-XPAR
https://fr.finance.yahoo.com/quote/%5EFCHI/history?period1=852105600&period2=1528873200&interval=1d&filter=history&frequency=1d
This is my code:
import numpy as np
from datetime import datetime , timedelta
import pandas as pd
#import cac 40 stock index (dataset1)
df = pd.read_csv('cac 40.csv')
df = pd.DataFrame(df)
#import Danone index(dataset2)
df1 = pd.read_excel('Price_Data_Danone.xlsx',header=3)
df1 = pd.DataFrame(df1)
#check the number of observation of both datasets and get the minimum number
if len(df1)>len(df):
size=len(df)
elif len(df1)<len(df):
size=len(df1)
else:
size=len(df)
#get new close values of dataset2 relative to the date in datset1
close1=np.zeros((size))
for i in range(0,size,1):
# find the date of dataset1 in dataset 2
if (df['Date'][i]in df1['Date']):
#get the index of the date and the corresponding value of close and store it in close1
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
else:
#if the date doesen't exist in datset2
#take value of close of previous date of datatset1
close1[i]=df['close'][df1.loc['Date'][i-1], df['Date']]
This is my trail, i got this error :
KeyError: 'the label [Date] is not in the [index]'
Examples:
we look for the value df['Date'][1] = '5/06/2009' in the column df1['Date']
we get its index in df1['Date']
then close1=df1['close'][index]
else if df['Date'][1] = '5/06/2009' not in df1['Date']
we get the index of the previous date df['Date'][0] = '4/06/2009'
close1=df1['close'][previous index]
Your error happens in line:
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
If your goal here is to get close value from df given i index you should write:
close[i] = df['close'][i]
See if that helps, unfortunately I don't understand fully what you are trying to accomplish, for example why do you set size to the length of shorter table?
Also, as long as I downloaded correct files, your condition df['Date'][i]in df1['Date'] might not work, one date format uses - and the other \.
Solution
import pandas as pd
pd.set_option('expand_frame_repr', False)
# load both files
df = pd.read_csv('CAC.csv')
df1 = pd.read_csv('DANONE.csv', header=3)
# ensure date format is the same between two
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
# you need only Date and Close columns as far as I understand
keep_columns = ['Date', 'Close']
# let's keep only these columns then
df = df[keep_columns]
df1 = df1[keep_columns]
# merge two tables on Date, method is left so that for every row in df we
# 'append' row from df1 if possible, if not there will be NaN value,
# for readability I added suffixes df - CAC and df1 - DANONE
merged = pd.merge(df,
df1,
on='Date',
how='left',
suffixes=['CAC', 'DANONE'])
# now for all missing values in CloseDANONE, so if there is Date in df
# but not in df1 we fill this value with LAST available
merged.CloseDANONE.fillna(method='ffill', inplace=True)
# we get values from CloseDANONE column as long as it's not null
close1 = merged.loc[merged.CloseDANONE.notnull(), 'CloseDANONE'].values
Below you can see:
last 6 values from df - CAC
Date Close
5522 2018-06-06 5457.560059
5523 2018-06-07 5448.359863
5524 2018-06-08 5450.220215
5525 2018-06-11 5473.910156
5526 2018-06-12 5453.370117
5527 2018-06-13 5468.240234
last 6 values from df1 - DANONE:
Date Close
0 2018-06-06 63.86
1 2018-06-07 63.71
2 2018-06-08 64.31
3 2018-06-11 64.91
4 2018-06-12 65.43
last 6 rows from merged:
Date CloseCAC CloseDANONE
5522 2018-06-06 5457.560059 63.86
5523 2018-06-07 5448.359863 63.71
5524 2018-06-08 5450.220215 64.31
5525 2018-06-11 5473.910156 64.91
5526 2018-06-12 5453.370117 65.43
5527 2018-06-13 5468.240234 65.43
For every value that was present in df we get value from df1, but 2018-06-13 is not present in df1 so I fill it with last available value which is 65.43 from 2018-06-12.

Categories