This is my dataframe:
ID number
Date purchase
1
2022-05-01
1
2021-03-03
1
2020-01-03
2
2019-01-03
2
2018-01-03
I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.
So like this:
ID number
Date 1
Date 2
Date 3
1
2022-05-01
2021-03-03
2020-01-03
2
2019-01-03
2018-01-03
After I did this I want to calculate the difference between these dates.
First step is GroupBy.cumcount with DataFrame.pivot:
df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
.assign(g=lambda x: x.groupby('ID number').cumcount())
.pivot('ID number','g','Date purchase')
.rename(columns = lambda x: f'Date {x + 1}'))
print (df1)
g Date 1 Date 2 Date 3
ID number
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03 NaT
Then for differencies between columns use DataFrame.diff:
df2 = df1.diff(-1, axis=1)
print (df2)
g Date 1 Date 2 Date 3
ID number
1 424 days 425 days NaT
2 365 days NaT NaT
If need averages:
df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
ID number Avg Dates
0 1 2021-03-02 16:00:00
1 2 2018-07-04 12:00:00
Could you do something like this?
def format_dataframe(df):
"""
Function formats the dataframe to the following:
| ID number| Date 1 | Date 2 | Date 3 |
| -------- | -------------- | -------------- | -------------- |
| 1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
| 2 | 2019-01-03 | 2018-01-03 | |
"""
df = df.sort_values(by=['ID number', 'Date purchase'])
df = df.drop_duplicates(subset=['ID number'], keep='first')
df = df.drop(columns=['Date purchase'])
df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
return df
initial situation:
d = {'IdNumber': [1,1,1,2,2], 'Date': ['2022-05-01', '2021-03-03','2020-01-03','2019-01-03','2018-01-03']}
df = pd.DataFrame(data=d)
date conversion:
df['Date'] = pd.to_datetime(df['Date'])
creating new column:
df1=df.assign(Col=lambda x: x.groupby('IdNumber').cumcount())
pivoting:
df1=df1.pivot(index=["IdNumber"],columns=["Col"],values="Date")
reset index:
df1 = df1.reset_index(level=0)
rename column:
for i in range(1,len(df1.columns)):
df1.columns.values[i]='Date{0}'.format(i)
final result:
Col IdNumber Date1 Date2 Date3
0 1 2022-05-01 2021-03-03 2020-01-03
1 2 2019-01-03 2018-01-03 NaT
Related
I have two dataframes:
EDIT:
df1 = pd.DataFrame(index = [0,1,2], columns=['timestamp', 'order_id', 'account_id', 'USD', 'CAD'])
df1['timestamp']=['2022-01-01','2022-01-02','2022-01-03']
df1['account_id']=['usdcad','usdcad','usdcad']
df1['order_id']=['11233123','12313213','12341242']
df1['USD'] = [1,2,3]
df1['CAD'] = [4,5,6]
df1:
timestamp account_id order_id USD CAD
0 2022-01-01 usdcad 11233123 1 4
1 2022-01-02 usdcad 12313213 2 5
2 2022-01-03 usdcad 12341242 3 6
df2 = pd.DataFrame(index = [0,1], columns = ['timestamp','account_id', 'currency','balance'])
df2['timestamp']=['2021-12-21','2021-12-21']
df2['account_id']=['usdcad','usdcad']
df2['currency'] = ['USD', 'CAD']
df2['balance'] = [2,3]
df2:
timestamp account_id currency balance
0 2021-12-21 usdcad USD 2
1 2021-12-21 usdcad CAD 3
I would like to add a row to df1 at index 0, and fill that row with the balance of df2 based on currency. So the final df should look like this:
df:
timestamp account_id order_id USD CAD
0 0 0 0 2 3
1 2022-01-01 usdcad 11233123 1 4
2 2022-01-02 usdcad 12313213 2 5
3 2022-01-03 usdcad 12341242 3 6
How can I do this in a pythonic way? Thank you
Set the index of df2 to currency then transpose the index to columns, then append this dataframe with df1
df_out = df2.set_index('currency').T.append(df1, ignore_index=True).fillna(0)
print(df_out)
USD CAD order_id
0 2 3 0
1 1 4 11233123
2 2 5 12313213
3 3 6 12341242
I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308
Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days
I have dataframe like this:
Date Location_ID Problem_ID
---------------------+------------+----------
2013-01-02 10:00:00 | 1 | 43
2012-08-09 23:03:01 | 5 | 2
...
How can I count how often a Problem occurs per day and per Location?
Use groupby with converting Date column to dates or Grouper with aggregate size:
print (df)
Date Location_ID Problem_ID
0 2013-01-02 10:00:00 1 43
1 2012-08-09 23:03:01 5 2
#if necessary convert column to datetimes
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.date, 'Location_ID']).size().reset_index(name='count')
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
Or:
df1 = (df.groupby([pd.Grouper(key='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
If first column is index:
print (df)
Location_ID Problem_ID
Date
2013-01-02 10:00:00 1 43
2012-08-09 23:03:01 5 2
df.index = pd.to_datetime(df.index)
df1 = (df.groupby([df.index.date, 'Location_ID'])
.size()
.reset_index(name='count')
.rename(columns={'level_0':'Date'}))
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
df1 = (df.groupby([pd.Grouper(level='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.
Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1
I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1