I have a dataframe that looks something like this
dt user
0 2016-01-01 a
1 2016-01-02 a
2 2016-01-03 a
3 2016-01-04 a
4 2016-01-05 a
5 2016-01-06 a
6 2016-01-01 b
7 2016-01-02 b
8 2016-01-03 b
9 2016-01-04 b
10 2016-01-05 b
11 2016-01-06 b
12 2016-01-07 b
13 2015-12-31 c
14 2016-01-01 c
15 2016-01-02 c
16 2016-01-03 c
17 2016-01-04 c
18 2016-01-05 c
19 2016-01-06 c
20 2016-01-07 c
21 2016-01-08 c
22 2016-01-09 c
23 2016-01-10 c
I want to find the missing dates for each user. For the date ranges, the minimum date is 2015-12-31 and the maximum date is 2016-01-10. The result would look like this:
user missing_days
a 5
b 4
c 0
Use isin to check the date range against each group of user and agg.sum the returned boolean mask of each group
df['dt'] = pd.to_datetime(df['dt']) #if `dt` columns already in datetime dtype, ignore this
check_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
s = df.groupby('user').dt.agg(lambda x: (~check_dates.isin(x)).sum())
Out[920]:
user
a 5
b 4
c 0
Name: dt, dtype: int64
### Convert your dates to datetime
df['dt'] = pd.to_datetime(df['dt'], infer_datetime_format=True)
### Create the list of dates per user
user_days = df.groupby('user')['dt'].apply(list)
### Initialize the final dataframe
df_miss_dates = pd.DataFrame(user_days)
all_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
### Find the number of missing dates per user
df_miss_dates['missing_days'] = df_miss_dates['dt'].apply(lambda x: len(set(all_dates) - set(x)))
df_miss_dates.drop(columns='dt', inplace=True)
print(df_miss_dates)
Output:
missing_days
user
a 5
b 4
c 0
You can do it this way
from datetime import date, timedelta
sdate = date(2015, 12, 31) # start date
edate = date(2016, 1, 10) # end date
delta = edate - sdate # as timedelta
days=[]
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
days.append(str(day))
user=[]
missing_days = []
for user_n in df.user.unique():
user_days = df.loc[df.user ==user_n,'dt' ].to_list()
md = len([day for day in days if day not in user_days])
user.append(user_n)
missing_days.append(md)
new_df = pd.DataFrame({'user': user,'missing_days': missing_days})
new_df
output
user missing_days
a 5
b 4
Define the following function:
def missingDates(grp : pd.Series, d1 : pd.Timestamp, d2 : pd.Timestamp):
ndTotal = (d2 - d1).days + 1
ndPresent = grp[grp.between(d1, d2)].index.size
return ndTotal - ndPresent
Then apply it to each group and change into a DataFrame (as I see
from your post, you want just a DataFrame, with 2 columns):
result = df.groupby('user')['dt'].apply(missingDates,
pd.to_datetime('2015-12-31'), pd.to_datetime('2016-01-10'))\
.rename('missing_days').reset_index()
The result is:
user missing_days
0 a 5
1 b 4
2 c 0
My solution relies on the fact that dates within each group are unique
and all dates are without the time part. If these conditions were not
met, there should be added dates normalization and invoking of unique
function.
Additional remark: Change dt (the column name) to some other name,
because dt is the name of date accessor in Pandas.
It is a bad practice to "cover" standard pandasonic names with e.g.
either column or variable names.
Related
I have a pandas dataframe with date index. Like this
A B C
date
2021-04-22 2 1 3
2021-05-22 3 2 4
2021-06-22 4 3 5
2021-07-22 5 4 6
2021-08-22 6 5 7
I want to create a new dataframe that selects rows that are only for 2 days previous for a given date. So for example if I give selected = '2021-08-22', what I need is a new dataframe like below
A B C
date
2021-07-22 5 4 6
2021-08-22 6 5 7
can someone please help me with this? Many thanks for your help
You can convert convert the index to DatetimeIndex, then use df[start_date : end_date]
df.index = pd.to_datetime(df.index)
selected = '2021-08-22'
res = df[(pd.to_datetime(selected)-pd.Timedelta(days=2)) : selected]
print(res)
A B C
2021-08-22 6 5 7
I'm assuming that you meant months instead of days.
You can use the df.apply method in order to filter the dataframe rows with a function.
Here is a function that received the inputs you described and returns the new dataframe:
Working example
def filter_df(df, date, num_months):
def diff_month(row):
date1 = datetime.strptime(row["date"], '%Y-%m-%d')
date2 = datetime.strptime(date, '%Y-%m-%d')
return ((date1.year - date2.year) * 12 + date1.month - date2.month)
return df[df.apply(diff_month, axis=1) > - num_months]
print(filter_df(df, "2021-08-22", 2))
If create_date field does not correspond to period between from_date and to_date, I want to extract only the large index records using group by 'indicator' and record correspond to period between from_date to end_date.
from_date = '2022-01-01'
to_date = '2022-04-10'
indicator create_date
0 A 2022-01-03
1 B 2021-12-30
2 B 2021-07-11
3 C 2021-02-10
4 C 2021-09-08
5 C 2021-07-24
6 C 2021-01-30
Here is the result I want:
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
I've been looking for a solution for a long time, but I only found a way "How to get the index of smallest value", and I can't find a way to compare the index number.
You can create helper column for maximal index values per indicator created by DataFrameGroupBy.idxmax, last select rows by DataFrame.loc:
df2 = df.loc[df.assign(tmp=df.index).groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
EDIT: If need seelct maximal index only per not match values between from_date, to_date use boolean indexing with join by concat:
from_date = '2022-01-01'
to_date = '2022-04-10'
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df2 = df.loc[df.assign(tmp=df.index)[~m].groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
2 B 2021-07-11
6 C 2021-01-30
df = pd.concat([df[m], df2])
print (df)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
You can try
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df_ = df[~m].groupby('indicator', as_index=False).apply(lambda g: g.loc[[max(g.index)]]).droplevel(level=0)
out = pd.concat([df[m], df_], axis=0).sort_index()
print(out)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
I have a DataFrame which contains data from last year, but the dates column has some missing dates
date
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
4 2019-11-05
I want to create a dictionary of gaps between dates, so keys would be start dates and values as end dates, something like:
dates_gaps = {2019-10-21:2019-10-29, 2019-10-29:2019-11-01,2019-11-01:2019-11-04 ...}
so I created a column to indicate whether a gap exists with the following:
df['missing_dates'] = df[DATE].diff().dt.days > 1
which outputs the following:
# True indicates if there's a gap or not
0 2019-10-21 False
1 2019-10-29 True
2 2019-11-01 True
3 2019-11-04 True
4 2019-11-05 False
and I'm having trouble going forward from here
You can add condition for compare missing values, convert date columnto strings by Series.dt.strftime and last create dictionary with zip:
diff = df['date'].diff()
s = df.loc[(diff.dt.days > 1) | diff.isna(), 'date'].dt.strftime('%Y-%m-%d')
print (s)
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
Name: date, dtype: object
d = dict(zip(s, s.shift(-1)[:-1]))
print (d)
{'2019-10-21': '2019-10-29', '2019-10-29': '2019-11-01', '2019-11-01': '2019-11-04'}
just convert these dates into datetime and find the difference between two adjacent dates.
a = pd.to_datetime('1900-01-01', format='%Y-%m-%d')
b = pd.to_datetime('1900-02-01', format='%Y-%m-%d')
c = a-b
c.days # -31
I've a list of dates and alphabets. I've to find count of alphabets occurring within a week. I'm trying to group by them by alphabets and re-sample it by '1w'. But i get some weird data frame which contains MultiIndex. How can i do all this and get the DataFrame with just three rows containing score, new re-sample date and count?
PS: What i'm looking for is a week and count for occurrence of every alphabet in that week.
something like that
datetime alphabet count
2016-12-27 22:57:45.407246 a 1
2016-12-30 22:57:45.407246 a 2
2017-01-02 22:57:45.407246 a 0
2016-12-27 22:57:45.407246 b 0
2016-12-30 22:57:45.407246 b 1
2017-01-02 22:57:45.407246 b 4
2016-12-27 22:57:45.407246 c 7
2016-12-30 22:57:45.407246 c 0
2017-01-02 22:57:45.407246 c 0
Here is the code
import random
import pandas as pd
import datetime
def randchar(a, b):
return chr(random.randint(ord(a), ord(b)))
# Create a datetime variable for today
base = datetime.datetime.today()
# Create a list variable that creates 365 days of rows of datetime values
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
score_list =[randchar('a', 'h') for i in range(365)]
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df.index = df['datetime']
# Create a column from the numeric score variable
df['score'] = score_list
df_s = tt = df.groupby('score').resample('1w').count()
You can apply a groupby + count with 2 predicates -
pd.Grouper with a frequency of a week
score column
Finally, unstack the result.
df = df.groupby([pd.Grouper(freq='1w'), 'score']).count().unstack(fill_value=0)
df.head()
datetime
score a b c d e f g h
datetime
2016-12-25 0 0 1 1 0 1 0 1
2017-01-01 1 0 0 1 3 0 2 0
2017-01-08 0 3 1 1 1 0 0 1
2017-01-15 1 2 0 2 0 0 1 1
2017-01-22 0 1 2 1 1 2 0 0
I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1