Apply a function to already existing Date column - python

I have below fictitious code (My code is sensitive):
df
record_id date sick funny happy
XK2C0001-3 7/10/2018 2 1 1
XK2C0002-1 7/10/2018 2 4 1
XK2C0003-9 7/11/2018 2 4 1
ZT2C0004-7 7/11/2018 2 4 1
XK2C0005-4 7/11/2018 1 1 1
XK2C0001-3 7/10/2018 2 4 1
XK2C0002-1 7/10/2018 2 4 1
XK2C0003-9 7/11/2018 1 4 1
XK2C0004-7 7/11/2018 2 4 1
ZT2C0005-4 7/11/2018 2 4 1
male_gender=df.loc[(df['sick'] == 1) | (df['funny'] == 1) | (df['happy'] == 1)]
male_gender['date'].value_counts().head()
2018-10-02 22
2018-10-03 14
2018-10-05 10
2018-11-01 10
2018-10-22 10
Name: date, dtype: int64
and I have below working function to filter last 7 weekdays:
prev_days = [today - timedelta(days=i) for i in range(10)]
prev_days = [d for d in prev_days if d.weekday() < 5]
for d in prev_days[:7]:
print(d)
My question is: How to apply the function above to the dataframe column "date"? I just want the idea, the data above are fictitious, you may give another example.
Edit: I want to know how many male_gender do I have in the last 7 weekdays relative to today only.

Convert your df['date'] to a datetime series, filter your dataframe, and then use pd.Series.value_counts:
df['date'] = pd.to_datetime(df['date'])
m1 = (df['sick'] == 1) | (df['funny'] == 1) | (df['happy'] == 1) # custom conditions
m2 = df['date'] >= pd.Timestamp('today') - pd.DateOffset(days=7) # last 7 days
m3 = ~df['date'].dt.weekday.isin([5, 6]) # not Sat or Sun
res = df.loc[m1 & m2 & m3, 'date'].value_counts()

Related

Select rows from dataframe where the difference in time is smallest per group

QUESTION: How do I find all rows in a pandas data frame which have the min time difference when compared to another time of an advice?
Example:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 **2** <-- zone 16 is closest to advicehour of A
1 A 1 A 16 **3**
2 A 2 A 18 5
3 A 2 A 18 8
4 B 4 B 19 18
5 B 8 B 20 **12** <-- zone 20 is closest to advicehour of B
Expected output:
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 3
1 A 1 A 16 2
5 B 8 B 20 12
It is not possible that setdownnr is before advice, and it should also not be possible that an advice for a different zone has a timestamp before the previous one ended.
First create column bor absolute differencies between columns and then get Zone by minimal difference per groups and select all rows which matched:
df['diff'] = df['Setdownhour'].sub(df['Advicehour']).abs()
s = df.set_index('Zone').groupby('Advicenr', sort=False)['diff'].transform('idxmin')
df = df[(s == s.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour diff
0 A 1 A 16 2 1
1 A 1 A 16 3 2
5 B 8 B 20 12 4
Solution without helper column in output:
s = df['Setdownhour'].sub(df['Advicehour']).abs()
s1 = df.assign(s = s).set_index('Zone').groupby('Advicenr')['s'].transform('idxmin')
df = df[(s1 == s1.index).to_numpy()]
print (df)
Advicenr Advicehour Setdownnr Zone Setdownhour
0 A 1 A 16 2
1 A 1 A 16 3
5 B 8 B 20 12
Thanks to advice from Jezrael, ended up doing:
df['diff'] = df['Setdownhour'].sub(inner_join_tote_nr['Advicehour']).abs()
df['avg_diff'] = df.groupby(['Setdownnr', 'Advicehour', 'Zone'])['diff'].transform('min')
s = df.groupby(['Advicenr', 'Advicehour'], sort=False)['avg_diff'].min().reset_index()
selected = pd.merge(s, inner_join_tote_nr, left_on=['Advicenr','Advicehour', 'avg_diff'], right_on = ['Advicenr','Advicehour', 'avg_diff'])

How to modify a Data to remove ids within certain bounds of another time veriable

I have a dataframe looks like :
id TakingTime
1 03-01-2015
1 18-07-2015
1 22-10-2015
1 14-01-2016
2 11-02-2015
2 28-02-2015
2 18-04-2015
2 19-05-2015
3 11-02-2015
3 16-11-2015
3 19-02-2016
3 21-04-2016
4 03-01-2015
4 03-01-2015
4 03-01-2015
4 03-01-2015
The output desired is :
id TakingTime
1 03-01-2015
1 18-07-2015
1 22-10-2015
1 14-01-2016
3 11-02-2015
3 16-11-2015
3 19-02-2016
3 21-04-2016
When I want to remove all id which have a difference time between the first and last taking time one year minimum.
I tried with
df[df.groupby('ID')['takingtime'].transform(lambda x: x.nunique() > 1)]
But I'm not sure if it's the right way to do this and if yes what is meaning of > 5 ? Days, Months, Years ... ?
Use:
idx = df.groupby('id').TakingTime.transform(lambda x: x.dt.year.diff().sum().astype(bool))
df[idx]
Output:
id TakingTime
0 1 2015-03-01
1 1 2015-07-18
2 1 2015-10-22
3 1 2016-01-14
8 3 2015-11-02
9 3 2015-11-16
10 3 2016-02-19
11 3 2016-04-21
Explanation:
For each id, take the difference across the years. If there's any difference greater than 0 (i.e. sum().astype(bool)), returns True. We used transform to replicate the output for the whole group. Finally, slice the dataframe with the output indexes.
Edit:
To analyze a specific amount of time (in days):
days = 865
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()).days >= days)
or:
from datetime import timedelta
days = timedelta(865)
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()) >= days)

Selecting row from a group on highest score based on two columns

Data
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
1 Date Docum 4 3 269004158
2 As of Dat 4 1 269004158
3 Date Docum 5 3 345973060
4 x Indicate 4 1 372529352
5 Date Docum 5 3 372529352
6 1 Financial 9 1 372529352
7 020 per shar 2 0 372529352
8 Date $ in 8 1 372529352
9 Date $ in 9 4 372529352
10 4 --------- 4 1 372529352
11 Date Begin 1 0 372529352
Required Output
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
3 Date Docum 5 3 345973060
9 Date $ in 9 4 372529352
Objective
Group by version id, get the row with max Score_unigram, if results are more than one, then check the Score_Bigram column and get the row with the highest value (If there are more than one such rows return all)
What have I tried
maximum = 0
index_to_pick = []
for index,row_data in a.iterrows():
if row_data['Score_Unigram'] > maximum:
maximum = row_data['Score_Unigram']
score_bigram = row_data['Score_Bigram']
index_to_pick.append(index)
elif row_data['Score_Unigram'] == maximum:
if row_data['Score_Bigram'] > score_bigram:
maximum = row_data['Score_Unigram']
score_bigram = row_data['Score_Bigram']
index_to_pick = []
index_to_pick.append(index)
elif row_data['Score_Bigram'] == score_bigram:
index_to_pick.append(index)
a.loc[[index_to_pick[0]]]
Output
Sentence Score_Unigram Score_Bigram versionId
5 Date $ in 9 4 372529352
Okay the approach is not pretty i guess (since data is large), looking for a efficient one.
I tried idxmax but that returns the only the top one. Might be a duplicate but wasn't able to find one. Thanks for the help!!.
Use double filtering by boolean indexing - first by max of first column Score_Unigram and then by Score_Bigram:
df = df[ df['Sentence'].duplicated(keep=False)]
df = df[df.groupby('Sentence')['Score_Unigram'].transform('max') == df['Score_Unigram']]
df = df[df.groupby(['Sentence', 'Score_Unigram'])['Score_Bigram'].transform('max') == df['Score_Bigram']]
print (df)
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
3 Date Docum 5 3 345973060
5 Date Docum 5 3 372529352
9 Date $ in 9 4 372529352
try this on your df :
df.sort_values(['Score_Unigram','Score_Bigram'],ascending=False).head(1)
Output:
Sentence Score_Unigram Score_Bigram versionId
5 Date $ in 9 4 372529352
I believe you don't need to sort data, just compare to the max value of those 2 columns
df[ (df['Score_Unigram'] == df['Score_Unigram'].max()) &
(df['Score_Bigram'] == df['Score_Bigram'].max()) ]

How to extract features using date range within a month?

I would like to extract features from a datetime column for a day/date for example between day 1 to 10, the output is stored under a column called
early_month
as 1 or 0 otherwise.
The following question I posted earlier gave me a solution using indexer_between_time in order to use time ranges.
How to extract features using time range?
I am using the following code to extract days of the month from date.
df["date_of_month"] = df["purchase_date"].dt.day
Thank you.
It's not clear from your question, but if you are trying to create a column that contains a 1 if the day is between 1 and 10, or 0 otherwise, it's very simple:
df['early_month'] = df['date_of_month'].apply(lambda x: 1 if x <= 10 else 0)
df['mid_month'] = df['date_of_month'].apply(lambda x: 1 if x >= 11 and x <= 20 else 0)
As a python beginner, if you would rather avoid lambda functions, you could achieve the same result by creating a function and then applying it as so:
def create_date_features(day, min_day, max_day):
if day >= min_day and day <= max_day:
return 1
else:
return 0
df['early_month'] = df['date_of_month'].apply(create_date_features, min_day=1, max_day=10)
df['mid_month'] = df['date_of_month'].apply(create_date_features, min_day=11, max_day=20)
I believe you need convert boolean mask to integers - Trues are processes like 1s:
rng = pd.date_range('2017-04-03', periods=10, freq='17D')
df = pd.DataFrame({'purchase_date': rng, 'a': range(10)})
m2 = df["purchase_date"].dt.day <= 10
df['early_month'] = m2.astype(int)
print (df)
purchase_date a early_month
0 2017-04-03 0 1
1 2017-04-20 1 0
2 2017-05-07 2 1
3 2017-05-24 3 0
4 2017-06-10 4 1
5 2017-06-27 5 0
6 2017-07-14 6 0
7 2017-07-31 7 0
8 2017-08-17 8 0
9 2017-09-03 9 1
Detail:
print (df["purchase_date"].dt.day <= 10)
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 False
9 True
Name: purchase_date, dtype: bool
Maybe you need this one:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a':[1,2,3,4,5], 'time':['11.07.2018','12.07.2018','13.07.2018','14.07.2018','15.07.2018']})
df.time = pd.to_datetime(df.time, format='%d.%m.%Y')
df[df.time>datetime(2018,7,13)] #if you need filter for date
df[df.time>datetime(2018,7,13).day] #if you need filter for day

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories