Summary of data for each month - python

I do have health diagnosis data for last year and I did like get count of diagnosis for each month. Here is my data:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
print (df2)
How can I get diagnosis counts for each month. Example is how many diagnosis of bacteria sepsis we had for that month. Final result is a table showing value counts of diagnosis for each month

If you want to see results per month, you can use pivot_table.
df2.pivot_table(index=['outcome','diagnosis'], columns=pd.to_datetime(df2['Date']).dt.month, aggfunc='size', fill_value=0)
Date 4 5 6
outcome diagnosis
alive Risk sepsis 0 1 0
bacteria sepsis 2 0 0
dead Neonatal sepsis 0 0 1
Sepsis 0 1 0
4,5,6 are the months in your dataset.
Try playing around with the parameters here, you might be able to land on a better view that suits your ideal result better.

I modified your dataframe by setting the Date column as index:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
df2.index = pd.to_datetime(df2['Date']) # <--- I set your Date column as the index (also convert it to datetime)
df2.drop('Date',inplace=True, axis=1) # <--- Drop the Date column
print (df2)
if you groupby the dataframe by a pd.Grouper and the columns you want to group with (diagnosis and outcome):
df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
Output:
ID
Date diagnosis outcome
2020-04-30 bacteria sepsis alive 1
2020-05-31 Risk sepsis alive 1
Sepsis dead 1
2020-06-30 Neonatal sepsis dead 1
2021-04-30 bacteria sepsis alive 1
Note: the freq='M' in pd.Grouper groups the dataframe by month. Read more about the freq attribute here
Edit: Assigning the grouped dataframe to new_df and resetting the other indices except Date:
new_df = df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
new_df.reset_index(level=[1,2],inplace=True)
Iterate over each month and get the table separately inside df_list:
df_list = [] # <--- this will contain each separate table for each month
for month in np.unique(new_df.index):
df_list += [pd.DataFrame(new_df.loc[[month]])]
df_list[0] # <-- get the first dataframe in df_list
will return:
diagnosis outcome ID
Date
2020-04-30 bacteria sepsis alive 1

First you need to create a month variable through to_datetime() function; then you can group by the month and make a value_counts() within the month
import pandas as pd
df2['month'] = pd.to_datetime(df2['Date']).dt.month
df2.groupby('month').apply(lambda x: x['diagnosis'].value_counts())
month
4 bacteria sepsis 2
5 Risk sepsis 1
Sepsis 1
6 Neonatal sepsis 1
Name: diagnosis, dtype: int64

I think what you mean by for each month is not only mean month figure only, but year-month combination. As such, let's approach as follows:
First, we create a 'year-month' column according to the Date column. Then use .groupby() on this new year-month column and get .value_counts() on column diagnosis, as follows:
df2['year-month'] = pd.to_datetime(df2['Date']).dt.strftime("%Y-%m")
df2.groupby('year-month')['diagnosis'].value_counts().to_frame(name='Count').reset_index()
Result:
year-month diagnosis Count
0 2020-04 bacteria sepsis 1
1 2020-05 Risk sepsis 1
2 2020-05 Sepsis 1
3 2020-06 Neonatal sepsis 1
4 2021-04 bacteria sepsis 1

Related

How to subset and eliminate rows by condition

I have a dataset that has many columns: among them AMS card number, registration date, and first purchase date. The data has duplicates for a large number of AMS card numbers. The final dataset needs to be unique on card number. I need to keep the rows in the dataset corresponding to the latest registration date and earliest first purchase date and this is how I've done it. I'm pretty sure it works, but it's too slow, since the dataset has over 1 million rows. In the grand scheme of python and pandas this is not an exorbitant number, which is why I'm certain my algorithm is poor and needs to be rewritten. I'm new to Pandas and fairly new to Python.
amsset = set(df["AMS Card"]) #capture all unique AMS numbers for each in amsset:
samecarddf = df.loc[df["AMS Card"] == each] #put all rows of df with same ams numbers in samecarddf
lensamecarddf = len(samecarddf)
if lensamecarddf > 1: #if there is more than one row with the same ams number in samecarddf
latestreg = samecarddf['Customer Reg Date'].max() #find the latest registration date
samecarddf = samecarddf.loc[samecarddf['Customer Reg Date'] == latestreg] #keep the rows with the latest registration date
earliestpur = samecarddf['Customer First Purchase Date'].min() #find earliest first purchase date
samecarddf = samecarddf.loc[samecarddf["Customer First Purchase Date"] == earliestpur] #keep the rows with the earliest first purchase date
dffinal = dffinal.append(samecarddf).drop_duplicates() #put all rows with 1 ams or those with latest registration and earliest first purchase and drop any remaining duplicates
Here's a way to do what your question asks:
# Update df to contain only unique `AMS Card` values,
# and in case of duplicates, choose the row with latest `Customer Reg Date` and
# (among duplicates thereof) earliest `Customer First Purchase Date`.
dffinal = ( df
.sort_values(['AMS Card', 'Customer Reg Date', 'Customer First Purchase Date'], ascending=[True, False, True])
.drop_duplicates(['AMS Card'])
.drop_duplicates(['AMS Card', 'Customer Reg Date']) )
Sample input:
AMS Card Customer Reg Date Customer First Purchase Date some_data
0 1 2020-01-01 2021-01-01 1
1 2 2020-01-01 2021-02-01 2
2 2 2020-01-01 2021-03-01 3
3 3 2020-01-01 2021-04-01 4
4 3 2020-02-01 2021-05-01 5
5 3 2020-02-01 2021-06-01 6
Output:
AMS Card Customer Reg Date Customer First Purchase Date some_data
0 1 2020-01-01 2021-01-01 1
1 2 2020-01-01 2021-02-01 2
4 3 2020-02-01 2021-05-01 5
As an alternative, the sorting can be split into two parts (so that we sort on Customer First Purchase Date only after removing duplicates of Customer Reg Date):
dffinal = ( df
.sort_values(['AMS Card', 'Customer Reg Date'], ascending=[True, False])
.drop_duplicates(['AMS Card'])
.sort_values(['AMS Card', 'Customer First Purchase Date'], ascending=[True, True])
.drop_duplicates(['AMS Card', 'Customer Reg Date']) )

Pandas substract above row

Basically this is the challenge I have
Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.
123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021
Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.
I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...
The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.
Is there any recommendations or ideas in how to accomplish that task?
The df:
df = pd.read_csv(StringIO(
"""id,trans_date
123,1/1/2021
345,1/1/2021
123,1/2/2021
123,1/20/2021
345,1/3/2021
"""
)) # added extra record for demo
df
id trans_date
0 123 1/1/2021
1 345 1/1/2021
2 123 1/2/2021
3 123 1/20/2021
4 345 1/3/2021
df['trans_date'] = pd.to_datetime(df['trans_date'])
As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.
result = df.groupby('id')['trans_date'].apply(
lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123 True
345 False
Name: trans_date, dtype: bool
If you just need the required ids, then
result.index[result].values
array([123])
The context and data you've provided about your situation are scanty, but you can probably do something like this:
>>> df
id type date
0 123 transaction 2021-01-01
1 345 transaction 2021-01-01
2 123 transaction 2021-01-02
3 123 transaction 2021-01-20
>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0 False
1 False
2 True
3 False
Name: id, dtype: bool
There, item 2 (the third item) is True because 123 already occured in the past week.
As far as I can understand the question, I think this is what you need.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"id": [123, 345, 123, 123],
"name": ["transaction", "transaction", "transaction", "transaction"],
"date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})
def dates_in_range(dates):
num_days_frame = 6
processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame for i in range(1, len(processed_dates)))
return difference_in_range and 1 or 0
group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))
print(df_new)
"""
print(df_new)
id
123 1
345 0
"""
Here you first group by the id such that you get all dates for that particular id in the same row.
After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.
Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.
All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.

Shift time in multi-index to merge

I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.
My approach is to shift mid-month dates to business day end-of-month dates.
Data:
dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()
output:
0
date id
2011-01-15 1 -0.598584
2 -0.484455
3 -2.044912
2011-02-15 1 -0.017512
2 0.852843
This is what I want (I removed the performance warning):
In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
However, indexes are immutable, so this does not work:
In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()
TypeError: 'FrozenList' does not support mutable operations.
The only solution I've got is to reset_index(), change the dates, then set_index() again:
df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)
This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?
Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.
You can use set_levels in order to set multiindex levels:
df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
level='date', inplace=True)
>>> df.head()
0
date id
2011-01-31 1 -1.410646
2 0.642618
3 -0.537930
2011-02-28 1 -0.418943
2 0.983186
You could just build it again:
df.index = pd.MultiIndex.from_arrays(
[
df.index.get_level_values(0) + BMonthEnd(),
df.index.get_level_values(1)
])
set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.
Since you want to merge anyway, you can forget about changing the index and use use pandas.merge_asof()
Data
df1
0
date id
2011-01-15 1 -0.810581
2 1.177235
3 0.083883
2011-02-15 1 1.217419
2 -0.970804
3 1.262364
2011-03-15 1 -0.026136
2 -0.036250
3 -1.103929
2011-04-15 1 -1.303298
And here is one with last business day of the month, df2
0
date id
2011-01-31 1 -0.277675
2 0.086539
3 1.441449
2011-02-28 1 1.330212
2 -0.028398
3 -0.114297
2011-03-31 1 -0.031264
2 -0.787093
3 -0.133088
2011-04-29 1 0.938732
merge
Use df1 as your left DataFrame and then choose the merge direction as forward since the last business day is always after the 15th. Optionally, you can set a tolerance. This is useful in the situation where you are missing a month in the right DataFrame and will prevent you from merging 03-31-2011 to 02-15-2011 if you are missing data for the last business day February.
import pandas as pd
pd.merge_asof(df1.reset_index(), df2.reset_index(), by='id', on='date',
direction='forward', tolerance=pd.Timedelta(days=20)).set_index(['date', 'id'])
Results in
0_x 0_y
date id
2011-01-15 1 -0.810581 -0.277675
2 1.177235 0.086539
3 0.083883 1.441449
2011-02-15 1 1.217419 1.330212
2 -0.970804 -0.028398
3 1.262364 -0.114297
2011-03-15 1 -0.026136 -0.031264
2 -0.036250 -0.787093
3 -1.103929 -0.133088
2011-04-15 1 -1.303298 0.938732

How to change value of second duplicate in row

I have not been able to find an answer to the following online. Your valuable help would be much appreciated.
I have a DataFrame like this with 20k rows:
ID Date Country
2008-0001 2008-01-02 India
2008-0001 2008-01-02 France
2008-0002 2008-01-03 USA
I want to take all the duplicates in ID such as in rows 1 and 2 and then increment the second ID to the highest number after the dash.
So for instance because there is already 2008-0002 (assume that 0002 is the highest number after the dash in that column for that year) then I want to increment to one above that so one of the duplicate id values 2008-0001 would become 2008-0003.
I can identify and drop duplicates using the following code
drop_duplicate_df = train_df.drop_duplicates(['ID'])
but this is not what I need.
I believe this will get it done:
isdup = df.duplicated(subset=['ID1', 'ID2'])
dups, uniques = df[isdup], df[~isdup]
ids = ['ID1', 'ID2']
for i, row in dups.iterrows():
while (row[ids] == uniques[ids]).all(axis=1).any():
row.loc['ID2'] += 1
uniques = uniques.append(row)
id1 = uniques.ID1.astype(str)
id2 = uniques.ID2.astype(str).str.zfill(4)
uniques.loc[:, 'ID'] = id1 + '-' + id1
uniques.drop(['ID1', 'ID2'], axis=1, inplace=True)
print uniques.sort_index()
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA
The below works with the sample data, and assumes you have data for several years that you all want to relabel according to the same logic:
df.Date = pd.to_datetime(df.Date) # to datetime to extract years
years = df.groupby(df.Date.dt.year) # analysis per year
new_df = pd.DataFrame()
for year, data in years:
data.loc[data.duplicated(subset='ID'), 'ID'] = '{0}-{1}'.format(year, str(int(df.ID.max().split('-')[1]) + 1).zfill(4))
new_df = pd.concat([new_df, data])
to get:
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA

Pandas - SQL case statement equivalent

NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates
I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.
I believe this is simple in SQL: How to do this in Pandas?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
EDIT:
Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.
So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.
Here is a dump of the two tables in CSV:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
and table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.
Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.
I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
My test output:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
UPDATE:
I think it doesn't make much sense to use pandas for processing data that can't fit into your memory. Of course there are some tricks how to deal with that, but it's painful.
If you want to process your data efficiently you should use a proper tool for that.
I would recommend to have a closer look at Apache Spark SQL where you can process your distributed data on multiple cluster nodes, using much more memory/processing power/IO/etc. compared to one computer/IO subsystem/CPU pandas approach.
Alternatively you can try use RDBMS like Oracle DB (very expensive, especially software licences! and their free version is full of limitations) or free alternatives like PostgreSQL (can't say much about it, because of lack of experience) or MySQL (not that powerful compared to Oracle; for example there is no native/clear solution for dynamic pivoting which you most probably will want to use, etc.)
OLD answer:
you can do it this way (please find explanations as comments in the code):
#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]
countries = ['c1', 'c2', 'c3']
t1 = pd.DataFrame({
'date': dates1,
'country': np.random.choice(countries, len(dates1)),
'cluster': np.random.randint(1, 4, len(dates1)),
'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#
# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')
# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
& \
(merged.date >= merged.date1)
].drop(['date1'], axis=1)
# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates,
# reset the index,
# pivot: convert `cluster` values to columns,
# taking sum's of `count` as values,
# NaN's will be replaced with zeroes
# and finally reset the index
r = merged.groupby(['country', 'date', 'cluster'])\
.sum()\
.reset_index()\
.pivot_table(index=['country','date'],
columns='cluster',
values='count',
aggfunc='sum',
fill_value=0)\
.reset_index()
# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)
Output (for my datasets):
In [124]: r
Out[124]:
cluster country date cluster_1 cluster_2 cluster_3
0 c1 2016-04-01 8 0 11
1 c2 2016-04-01 0 34 22
2 c3 2016-05-01 4 18 36

Categories