find alphabets count per date in python pandas - python

I've a list of dates and alphabets. I've to find count of alphabets occurring within a week. I'm trying to group by them by alphabets and re-sample it by '1w'. But i get some weird data frame which contains MultiIndex. How can i do all this and get the DataFrame with just three rows containing score, new re-sample date and count?
PS: What i'm looking for is a week and count for occurrence of every alphabet in that week.
something like that
datetime alphabet count
2016-12-27 22:57:45.407246 a 1
2016-12-30 22:57:45.407246 a 2
2017-01-02 22:57:45.407246 a 0
2016-12-27 22:57:45.407246 b 0
2016-12-30 22:57:45.407246 b 1
2017-01-02 22:57:45.407246 b 4
2016-12-27 22:57:45.407246 c 7
2016-12-30 22:57:45.407246 c 0
2017-01-02 22:57:45.407246 c 0
Here is the code
import random
import pandas as pd
import datetime
def randchar(a, b):
return chr(random.randint(ord(a), ord(b)))
# Create a datetime variable for today
base = datetime.datetime.today()
# Create a list variable that creates 365 days of rows of datetime values
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
score_list =[randchar('a', 'h') for i in range(365)]
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df.index = df['datetime']
# Create a column from the numeric score variable
df['score'] = score_list
df_s = tt = df.groupby('score').resample('1w').count()

You can apply a groupby + count with 2 predicates -
pd.Grouper with a frequency of a week
score column
Finally, unstack the result.
df = df.groupby([pd.Grouper(freq='1w'), 'score']).count().unstack(fill_value=0)
df.head()
datetime
score a b c d e f g h
datetime
2016-12-25 0 0 1 1 0 1 0 1
2017-01-01 1 0 0 1 3 0 2 0
2017-01-08 0 3 1 1 1 0 0 1
2017-01-15 1 2 0 2 0 0 1 1
2017-01-22 0 1 2 1 1 2 0 0

Related

Select rows based on column condition and date time

From below picture, we see that serial C was failed on 3rd January, and A failed on 5th January within 6 days period. I am interested to take samples for 3 days before the failure of each serial number.
My codes:
from pickle import TRUE
import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='\t')
df['date'] = pd.to_datetime(df['date'])
#df.drop(columns=df.columns[0], axis=1, inplace=True)
df = df.sort_values(by="date")
d = datetime.timedelta(days = 3)
df_fail_date = df[df['failure']==1].groupby(['serial_number'])['date'].min()
df_fail_date = df_fail_date - d
df_fail_date
I was not not able to move further to sample my data. I am interested to get the following data, that is 3 days before the failure. Serial C had only 1 day available before failure so I wanna keep that one as well. It would be nice to add duration column to count the days before failure occurred. I appreciate your suggestions. Thanks!
Expected output dataframe:
You can use a groupby.rolling to get the dates/serials to keep, then merge to select:
df['date'] = pd.to_datetime(df['date'])
N = 3
m = (df.sort_values(by='date')
.loc[::-1]
.groupby('serial_number', group_keys=False)
.rolling(f'{N+1}d', on='date')
['failure'].max().eq(1)
.iloc[::-1]
)
out = df.merge(m[m], left_on=['serial_number', 'date'],
right_index=True, how='right')
Output:
date serial_number failure_x smart_5_raw smart_187_raw failure_y
2 2014-01-01 C 0 0 80 True
8 2014-01-02 C 0 0 200 True
4 2014-01-03 C 1 0 120 True
7 2014-01-02 A 0 0 180 True
5 2014-01-03 A 0 0 140 True
9 2014-01-04 A 0 0 280 True
14 2014-01-05 A 1 0 400 True
Another possible solution:
N = 4
df['date'] = pd.to_datetime(df['date'])
(df[df.groupby('serial_number')['failure'].transform(sum) == 1]
.sort_values(by=['serial_number', 'date'])
.groupby('serial_number')
.apply(lambda g:
g.assign(duration=1+np.arange(min(0, min(N, len(g))-len(g)), min(N, len(g)))))
.loc[lambda x: x['duration'] > 0]
.reset_index(drop=True))
Output:
date serial_number failure smart_5_raw smart_187_raw duration
0 2014-01-02 A 0 0 180 1
1 2014-01-03 A 0 0 140 2
2 2014-01-04 A 0 0 280 3
3 2014-01-05 A 1 0 400 4
4 2014-01-01 C 0 0 80 1
5 2014-01-02 C 0 0 200 2
6 2014-01-03 C 1 0 120 3

Count cases by dates and save it in a new dataframe

In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4

Find missing days and grouping

I have a dataframe that looks something like this
dt user
0 2016-01-01 a
1 2016-01-02 a
2 2016-01-03 a
3 2016-01-04 a
4 2016-01-05 a
5 2016-01-06 a
6 2016-01-01 b
7 2016-01-02 b
8 2016-01-03 b
9 2016-01-04 b
10 2016-01-05 b
11 2016-01-06 b
12 2016-01-07 b
13 2015-12-31 c
14 2016-01-01 c
15 2016-01-02 c
16 2016-01-03 c
17 2016-01-04 c
18 2016-01-05 c
19 2016-01-06 c
20 2016-01-07 c
21 2016-01-08 c
22 2016-01-09 c
23 2016-01-10 c
I want to find the missing dates for each user. For the date ranges, the minimum date is 2015-12-31 and the maximum date is 2016-01-10. The result would look like this:
user missing_days
a 5
b 4
c 0
Use isin to check the date range against each group of user and agg.sum the returned boolean mask of each group
df['dt'] = pd.to_datetime(df['dt']) #if `dt` columns already in datetime dtype, ignore this
check_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
s = df.groupby('user').dt.agg(lambda x: (~check_dates.isin(x)).sum())
Out[920]:
user
a 5
b 4
c 0
Name: dt, dtype: int64
### Convert your dates to datetime
df['dt'] = pd.to_datetime(df['dt'], infer_datetime_format=True)
### Create the list of dates per user
user_days = df.groupby('user')['dt'].apply(list)
### Initialize the final dataframe
df_miss_dates = pd.DataFrame(user_days)
all_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
### Find the number of missing dates per user
df_miss_dates['missing_days'] = df_miss_dates['dt'].apply(lambda x: len(set(all_dates) - set(x)))
df_miss_dates.drop(columns='dt', inplace=True)
print(df_miss_dates)
Output:
missing_days
user
a 5
b 4
c 0
You can do it this way
from datetime import date, timedelta
sdate = date(2015, 12, 31) # start date
edate = date(2016, 1, 10) # end date
delta = edate - sdate # as timedelta
days=[]
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
days.append(str(day))
user=[]
missing_days = []
for user_n in df.user.unique():
user_days = df.loc[df.user ==user_n,'dt' ].to_list()
md = len([day for day in days if day not in user_days])
user.append(user_n)
missing_days.append(md)
new_df = pd.DataFrame({'user': user,'missing_days': missing_days})
new_df
output
user missing_days
a 5
b 4
Define the following function:
def missingDates(grp : pd.Series, d1 : pd.Timestamp, d2 : pd.Timestamp):
ndTotal = (d2 - d1).days + 1
ndPresent = grp[grp.between(d1, d2)].index.size
return ndTotal - ndPresent
Then apply it to each group and change into a DataFrame (as I see
from your post, you want just a DataFrame, with 2 columns):
result = df.groupby('user')['dt'].apply(missingDates,
pd.to_datetime('2015-12-31'), pd.to_datetime('2016-01-10'))\
.rename('missing_days').reset_index()
The result is:
user missing_days
0 a 5
1 b 4
2 c 0
My solution relies on the fact that dates within each group are unique
and all dates are without the time part. If these conditions were not
met, there should be added dates normalization and invoking of unique
function.
Additional remark: Change dt (the column name) to some other name,
because dt is the name of date accessor in Pandas.
It is a bad practice to "cover" standard pandasonic names with e.g.
either column or variable names.

How to extract features using date range within a month?

I would like to extract features from a datetime column for a day/date for example between day 1 to 10, the output is stored under a column called
early_month
as 1 or 0 otherwise.
The following question I posted earlier gave me a solution using indexer_between_time in order to use time ranges.
How to extract features using time range?
I am using the following code to extract days of the month from date.
df["date_of_month"] = df["purchase_date"].dt.day
Thank you.
It's not clear from your question, but if you are trying to create a column that contains a 1 if the day is between 1 and 10, or 0 otherwise, it's very simple:
df['early_month'] = df['date_of_month'].apply(lambda x: 1 if x <= 10 else 0)
df['mid_month'] = df['date_of_month'].apply(lambda x: 1 if x >= 11 and x <= 20 else 0)
As a python beginner, if you would rather avoid lambda functions, you could achieve the same result by creating a function and then applying it as so:
def create_date_features(day, min_day, max_day):
if day >= min_day and day <= max_day:
return 1
else:
return 0
df['early_month'] = df['date_of_month'].apply(create_date_features, min_day=1, max_day=10)
df['mid_month'] = df['date_of_month'].apply(create_date_features, min_day=11, max_day=20)
I believe you need convert boolean mask to integers - Trues are processes like 1s:
rng = pd.date_range('2017-04-03', periods=10, freq='17D')
df = pd.DataFrame({'purchase_date': rng, 'a': range(10)})
m2 = df["purchase_date"].dt.day <= 10
df['early_month'] = m2.astype(int)
print (df)
purchase_date a early_month
0 2017-04-03 0 1
1 2017-04-20 1 0
2 2017-05-07 2 1
3 2017-05-24 3 0
4 2017-06-10 4 1
5 2017-06-27 5 0
6 2017-07-14 6 0
7 2017-07-31 7 0
8 2017-08-17 8 0
9 2017-09-03 9 1
Detail:
print (df["purchase_date"].dt.day <= 10)
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 False
9 True
Name: purchase_date, dtype: bool
Maybe you need this one:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a':[1,2,3,4,5], 'time':['11.07.2018','12.07.2018','13.07.2018','14.07.2018','15.07.2018']})
df.time = pd.to_datetime(df.time, format='%d.%m.%Y')
df[df.time>datetime(2018,7,13)] #if you need filter for date
df[df.time>datetime(2018,7,13).day] #if you need filter for day

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories