Count cases by dates and save it in a new dataframe - python

In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...

Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4

You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4

Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4

Related

pandas calculate date difference in a new column referencing previous cell

I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1

Just keep the first value of every minute in pandas dataframe

I want to reduce my data. My initial dataframe looks as follows:
index
time [hh:mm:ss]
value1
value2
0
0 days 00:00:00.000000
3
4
1
0 days 00:00:04.000000
5
2
2
0 days 00:02:02.002300
7
9
3
0 days 00:02:03.000000
9
7
4
0 days 03:02:03.000000
4
3
Now I want to reduce my data in order to only keep the cells of every new minute (respectively also new hour and days). the other way around: only the first line of a new minute should be kept. all remaining lines of this minute should be dropped.
So the resulting table looks as follows:
index
time
value1
value2
0
0 days 00:00:00.000000
3
4
2
0 days 00:02:02.002300
7
9
4
0 days 03:02:03.000000
4
3
Any ideas how to approach this?
There is used timedeltas so is possible create TimedeltaIndex and use DataFrame.resample by 1Minute with Resampler.first, only are added all minutes, so removed only NaNs rows:
df.index = pd.to_timedelta(df['time [hh:mm:ss]'])
df = df.resample('1Min').first().dropna(how='all').reset_index(drop=True)
print (df)
time [hh:mm:ss] value1 value2
0 0 days 00:00:00.000000 3.0 4.0
1 0 days 00:02:02.002300 7.0 9.0
2 0 days 03:02:03.000000 4.0 3.0
You could extract the D:HH:MM using apply and multiple splits, and then delete the duplicates, choosing the first value.
dms = df['time [hh:mm:ss]'].apply(lambda x: ':'.join( [x.split(' days ')[0], *x.split('days ')[1].split(':')[:2]]) )
df.iloc[dms.drop_duplicates().index]
d = '''index,time,value1,value2 0,0 days 00:00:00.000000,3,4 1,0 days
00:00:04.000000,5,2 2,0 days 00:02:02.002300,7,9 3,0 days
00:02:03.000000,9,7 4,0 days 03:02:03.000000,4,3'''
df = pd.read_csv(StringIO(d),parse_dates=True)
df
df['time1'] = pd.to_datetime(df['time'].str.slice(7))
df.set_index('time1',inplace=True)
df
df.groupby([df.index.hour,df.index.minute]).head(1).sort_index().reset_index(drop=True)

How to group and aggregate data starting from constant and ending on changing date? [duplicate]

This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4

How to add a column with conditions on another Dataframe?

Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4

Python - Appending row number with reverse counter

I have yet another Python question. This one probably can be achieved with help of a loop, however I was looking for a leaner solution
Suppose that I have a data frame like this one:
I am looking for a code to generate column ID which is no more than a descending counter for when the value in column Sold changes - ie, for each Salesman I would like to have the ID column retrieving the number of days left until the sold value changes.
For example, on date 01/01/2018, salesman Joe would be having ID = 2 because the signal changes in 2 days.
Any ideas on how to solve this one?
Many thanks.
J
Setup:
df = pd.DataFrame([
pd.Series(pd.date_range('1/1/2018', '1/7/2018').append(pd.date_range('1/1/2018', '1/7/2018'))),
pd.Series(['Joe']*7 + ['Helen']*7),
pd.Series([1,1,0,0,0,0,1,0,1,1,0,1,0,0]),
]).T
df.columns = ['date', 'salesman', 'sold']
df['date'] = pd.to_datetime(df['date'])
Computation:
df['changes'] = df.groupby('salesman')['sold'].expanding().apply(lambda x: (np.diff(x) != 0).sum()).reset_index(drop = True)
df['id'] = df.groupby(['salesman', 'changes']).apply(lambda grp: pd.Series(len(grp) - grp.sort_values('date').reset_index().index)).reset_index(drop = True)
df.drop('changes', axis = 1, inplace = True)
Results:
>>> df
date salesman sold id
0 2018-01-01 Joe 1 2
1 2018-01-02 Joe 1 1
2 2018-01-03 Joe 0 4
3 2018-01-04 Joe 0 3
4 2018-01-05 Joe 0 2
5 2018-01-06 Joe 0 1
6 2018-01-07 Joe 1 1
7 2018-01-01 Helen 0 1
8 2018-01-02 Helen 1 2
9 2018-01-03 Helen 1 1
10 2018-01-04 Helen 0 1
11 2018-01-05 Helen 1 1
12 2018-01-06 Helen 0 2
13 2018-01-07 Helen 0 1
Explanation:
create a 'changes' column that increments every-time an individual salesperson's 'sold' field changes. Then for each increment group (still grouped by salesperson), get the length of this group (which is equal to how subsequent rows of this value there are) and subtract from that value the index of each row, sorted by date. The result of that subtraction will be a series that descends from the length of the group to 1. Reset the index and merge back to your original dataframe. It's a bit of a confusing solution but it should work.

Categories