How to extract features using date range within a month? - python

I would like to extract features from a datetime column for a day/date for example between day 1 to 10, the output is stored under a column called
early_month
as 1 or 0 otherwise.
The following question I posted earlier gave me a solution using indexer_between_time in order to use time ranges.
How to extract features using time range?
I am using the following code to extract days of the month from date.
df["date_of_month"] = df["purchase_date"].dt.day
Thank you.

It's not clear from your question, but if you are trying to create a column that contains a 1 if the day is between 1 and 10, or 0 otherwise, it's very simple:
df['early_month'] = df['date_of_month'].apply(lambda x: 1 if x <= 10 else 0)
df['mid_month'] = df['date_of_month'].apply(lambda x: 1 if x >= 11 and x <= 20 else 0)
As a python beginner, if you would rather avoid lambda functions, you could achieve the same result by creating a function and then applying it as so:
def create_date_features(day, min_day, max_day):
if day >= min_day and day <= max_day:
return 1
else:
return 0
df['early_month'] = df['date_of_month'].apply(create_date_features, min_day=1, max_day=10)
df['mid_month'] = df['date_of_month'].apply(create_date_features, min_day=11, max_day=20)

I believe you need convert boolean mask to integers - Trues are processes like 1s:
rng = pd.date_range('2017-04-03', periods=10, freq='17D')
df = pd.DataFrame({'purchase_date': rng, 'a': range(10)})
m2 = df["purchase_date"].dt.day <= 10
df['early_month'] = m2.astype(int)
print (df)
purchase_date a early_month
0 2017-04-03 0 1
1 2017-04-20 1 0
2 2017-05-07 2 1
3 2017-05-24 3 0
4 2017-06-10 4 1
5 2017-06-27 5 0
6 2017-07-14 6 0
7 2017-07-31 7 0
8 2017-08-17 8 0
9 2017-09-03 9 1
Detail:
print (df["purchase_date"].dt.day <= 10)
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 False
9 True
Name: purchase_date, dtype: bool

Maybe you need this one:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a':[1,2,3,4,5], 'time':['11.07.2018','12.07.2018','13.07.2018','14.07.2018','15.07.2018']})
df.time = pd.to_datetime(df.time, format='%d.%m.%Y')
df[df.time>datetime(2018,7,13)] #if you need filter for date
df[df.time>datetime(2018,7,13).day] #if you need filter for day

Related

Iterate through data frame

My code pulls a dataframe object and I'd like to mask the dataframe.
If a value <= 15 then change value to 1 else change value to 0.
import pandas as pd
XTrain = pd.read_excel('C:\\blahblahblah.xlsx')
for each in XTrain:
if each <= 15:
each = 1
else:
each = 0
Im coming from VBA and .NET so I know it's not very pythonic, but it seems super easy to me...
The code hits an error since it iterates through the df header.
So I tried to check for type
for each in XTrain:
if isinstance(each, str) is False:
if each <= 15:
each = 1
else:
each = 0
This time it got to the final header but did not progress into the dataframe.
This makes me think I am not looping through thr dataframe correctly?
Been stumped for hours, could anyone send me a little help?
Thank you!
for each in XTrain always loops through the column names only. That's how Pandas designs it to be.
Pandas allows comparison/ arithmetic operations with numbers directly. So you want:
# le is less than or equal to
XTrains.le(15).astype(int)
# same as
# (XTrain <= 15).astype(int)
If you really want to iterate (don't), remember that a dataframe is two dimensional. So something like this:
for index, row in df.iterrows():
for cell in row:
if cell <= 15:
# do something
# cell = 1 might not modify the cell in original dataframe
# this is a python thing and you will get used to it
else:
# do something else
SetUp
df = pd.DataFrame({'A' : range(0, 20, 2), 'B' : list(range(10, 19)) + ['a']})
print(df)
A B
0 0 10
1 2 11
2 4 12
3 6 13
4 8 14
5 10 15
6 12 16
7 14 17
8 16 18
9 18 a
Solution : pd.to_numeric
to avoid problems with str values and DataFrame.le
df.apply(lambda x: pd.to_numeric(x, errors='coerce')).le(15).astype(int)
Output
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 0
If you want keep string values:
df2 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
new_df = df2.where(lambda x: x.isna(), df2.le(15).astype(int)).fillna(df)
print(new_df)
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 a
Use applymap to apply the function to each element of the dataframe and lambda to write the function.
df.applymap(lambda x: x if isinstance(each, str) else 1 if x <= 15 else 0)

Efficiently create row based on date pandas

Currently I have a series of columns which I am creating that contain a boolean based on a date in the Dataframe I am using
df['bool1'] = [1 if x > pd.to_datetime('20190731') else 0 for x in df['date']]
df['bool2'] = [1 if x > pd.to_datetime('20190803') else 0 for x in df['date']]
df['bool3'] = [1 if x > pd.to_datetime('20190813') else 0 for x in df['date']]
I figured that a list comprehension like this is a pythonic way of solving the problem. I feel like my code is very clear in what it is doing, and somebody could easily follow it.
There is a potential improvement to be made in say creating a dictionary for {bool1:'20190731'} then loop through Key:Value pairs so that I don't repeat the line of code. But this will only decrease line number whilst increase readability and scalability. It won't actually made my code run faster.
However my problem is that this code is actually very slow to run. Should I be using a lambda function to speed this up? What is the fastest way to write this code?
I think dictionary for new columns with values for compare is nice idea.
d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}
Then is possible create new columns in loop:
for k, v in d.items():
df[k] = (df['date'] > pd.to_datetime(v)).astype(int)
#alternative
#df[k] = np.where(df['date'] > pd.to_datetime(v), 1, 0)
For improve performance use broadcasting in numpy:
rng = pd.date_range('20190731', periods=20)
df = pd.DataFrame({'date': rng})
d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}
#pandas 0.24+
mask = df['date'].to_numpy()[:, None] > pd.to_datetime(list(d.values())).to_numpy()
#pandas below
#mask = df['date'].values[:, None] > pd.to_datetime(list(d.values())).values
arr = np.where(mask, 1, 0)
df = df.join(pd.DataFrame(arr, columns=d.keys()))
print (df)
date bool1 bool2 bool3
0 2019-07-31 0 0 0
1 2019-08-01 1 0 0
2 2019-08-02 1 0 0
3 2019-08-03 1 0 0
4 2019-08-04 1 1 0
5 2019-08-05 1 1 0
6 2019-08-06 1 1 0
7 2019-08-07 1 1 0
8 2019-08-08 1 1 0
9 2019-08-09 1 1 0
10 2019-08-10 1 1 0
11 2019-08-11 1 1 0
12 2019-08-12 1 1 0
13 2019-08-13 1 1 0
14 2019-08-14 1 1 1
15 2019-08-15 1 1 1
16 2019-08-16 1 1 1
17 2019-08-17 1 1 1
18 2019-08-18 1 1 1
19 2019-08-19 1 1 1
with numpy.where it should be faster
df['bool1'] = np.where(df['date'] > pd.to_datetime('20190731'), 1, 0)
df['bool2'] = np.where(df['date'] > pd.to_datetime('20190803'), 1, 0)
df['bool3'] = np.where(df['date'] > pd.to_datetime('20190813'), 1, 0)

Apply a function to already existing Date column

I have below fictitious code (My code is sensitive):
df
record_id date sick funny happy
XK2C0001-3 7/10/2018 2 1 1
XK2C0002-1 7/10/2018 2 4 1
XK2C0003-9 7/11/2018 2 4 1
ZT2C0004-7 7/11/2018 2 4 1
XK2C0005-4 7/11/2018 1 1 1
XK2C0001-3 7/10/2018 2 4 1
XK2C0002-1 7/10/2018 2 4 1
XK2C0003-9 7/11/2018 1 4 1
XK2C0004-7 7/11/2018 2 4 1
ZT2C0005-4 7/11/2018 2 4 1
male_gender=df.loc[(df['sick'] == 1) | (df['funny'] == 1) | (df['happy'] == 1)]
male_gender['date'].value_counts().head()
2018-10-02 22
2018-10-03 14
2018-10-05 10
2018-11-01 10
2018-10-22 10
Name: date, dtype: int64
and I have below working function to filter last 7 weekdays:
prev_days = [today - timedelta(days=i) for i in range(10)]
prev_days = [d for d in prev_days if d.weekday() < 5]
for d in prev_days[:7]:
print(d)
My question is: How to apply the function above to the dataframe column "date"? I just want the idea, the data above are fictitious, you may give another example.
Edit: I want to know how many male_gender do I have in the last 7 weekdays relative to today only.
Convert your df['date'] to a datetime series, filter your dataframe, and then use pd.Series.value_counts:
df['date'] = pd.to_datetime(df['date'])
m1 = (df['sick'] == 1) | (df['funny'] == 1) | (df['happy'] == 1) # custom conditions
m2 = df['date'] >= pd.Timestamp('today') - pd.DateOffset(days=7) # last 7 days
m3 = ~df['date'].dt.weekday.isin([5, 6]) # not Sat or Sun
res = df.loc[m1 & m2 & m3, 'date'].value_counts()

Iterating through DataFrame and keeping track of a certain sequence duration

I'd like to figure out how often a negative values occurs and how long that negative price occurs.
example df
d = {'value': [1,2,-3,-4,-5,6,7,8,-9,-10], 'period':[1,2,3,4,5,6,7,8,10]}
df = pd.DataFrame(data=d)
I checked which rows had negative values. df['value'] < 0
I thought I could just iterate through each row, keep a counter for when a negative value occurs and perhaps moving that row to another df, as I would like to save the beginning period and ending period.
What I'm currently trying
def count_negatives(df):
df_negatives = pd.DataFrame(columns=['start','end', 'counter'])
for index, row in df.iterrows():
counter = 0
df_negative_index = 0
while(row['value'] < 0):
# if its the first one add it to df as start ?
# grab the last one and add it as end
#constantly overwrite the counter?
counter += 1
#add counter to df row
df_negatives['counter'] = counter
return df_negatives
Except that gives me an infinite loop I think. If I replace while with an if I'm stuck comming up with a way to keep track of how long.
I think better is avoid loops:
#compare by <
a = df['value'].lt(0)
#running sum
b = a.cumsum()
#counter only for negative consecutive values
df['counter'] = b-b.mask(a).ffill().fillna(0).astype(int)
print (df)
value period counter
0 1 1 0
1 2 2 0
2 -3 3 1
3 -4 4 2
4 -5 5 3
5 6 6 0
6 7 7 0
7 8 8 0
8 -9 9 1
9 -10 10 2
Or if dont need reset counter :
a = df['value'].lt(0)
#repalce values per mask a to 0
df['counter'] = a.cumsum().where(a, 0)
print (df)
value period counter
0 1 1 0
1 2 2 0
2 -3 3 1
3 -4 4 2
4 -5 5 3
5 6 6 0
6 7 7 0
7 8 8 0
8 -9 9 4
9 -10 10 5
If want start and end period:
#comapre for negative mask
a = df['value'].lt(0)
#inverted mask
b = (~a).cumsum()
#filter only negative rows
c = b[a].reset_index()
#aggregate first and last value per groups
df = (c.groupby('value')['index']
.agg([('start', 'first'),('end', 'last')])
.reset_index(drop=True))
print (df)
start end
0 2 4
1 8 9
I would like to save the beginning period and ending period.
If this is your requirement, you can use itertools.groupby. Note also a period series is not required, as Pandas provides a natural integer index (beginning at 0) if not explicitly provided.
from itertools import groupby
from operator import itemgetter
d = {'value': [1,2,-3,-4,-5,6,7,8,-9,-10]}
df = pd.DataFrame(data=d)
ranges = []
for k, g in groupby(enumerate(df['value'][df['value'] < 0].index), lambda x: x[0]-x[1]):
group = list(map(itemgetter(1), g))
ranges.append((group[0], group[-1]))
print(ranges)
[(2, 4), (8, 9)]
Then, to convert to a dataframe:
df = pd.DataFrame(ranges, columns=['start', 'end'])
print(df)
start end
0 2 4
1 8 9

Consolidate time periods data in pandas

How do I consolidate time periods data in Python pandas?
I want to manipulate data from
person start end
1 2001-1-8 2002-2-14
1 2002-2-14 2003-3-1
2 2001-1-5 2002-2-16
2 2002-2-17 2003-3-9
to
person start end
1 2001-1-8 2002-3-1
2 2001-1-5 2002-3-9
I want to check first if the last end and new start are within 1 day first. If not, then keep the original data structure, if so, then consolidate.
df.sort_values(["person", "start", "end"], inplace=True)
def condense(df):
df['prev_end'] = df["end"].shift(1)
df['dont_condense'] = (abs(df['prev_end'] - df['start']) > timedelta(days=1))
df["group"] = df['dont_condense'].fillna(False).cumsum()
return df.groupby("group").apply(lambda x: pd.Series({"person": x.iloc[0].person,
"start": x.iloc[0].start,
"end": x.iloc[-1].end}))
df.groupby("person").apply(condense).reset_index(drop=True)
You can use if each group contains only 2 rows and need difference 1 and 0 days, also all data are sorted:
print (df)
person start end
0 1 2001-1-8 2002-2-14
1 1 2002-2-14 2003-3-1
2 2 2001-1-5 2002-2-16
3 2 2002-2-17 2003-3-9
4 3 2001-1-2 2002-2-14
5 3 2002-2-17 2003-3-10
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
def f(x):
#if need difference only 0 days, use
#a = (x['start'] - x['end'].shift()) == pd.Timedelta(days=0)
a = (x['start'] - x['end'].shift()).isin([pd.Timedelta(days=1), pd.Timedelta(days=0)])
if a.any():
x.end = x['end'].shift(-1)
return (x)
df1 = df.groupby('person').apply(f).dropna().reset_index(drop=True)
print (df1)
person start end
0 1 2001-01-08 2003-03-01
1 2 2001-01-05 2003-03-09
2 3 2001-01-02 2002-02-14
3 3 2002-02-17 2003-03-10

Categories