I have what I assume might be a complexed ask.
I have a few columns in my dataframe, for each date column I want to store the date headers in the column called "Dates" and then I want to create 2 new columns to store the max and min values
DataFrame
DataFrame
ID
Item
DateMade_Min
DateMade_Max
DelDate_Min
DelDate_Max
ExpDate_Min
ExpDate_Max
1
2322
01/01/2020
01/01/2020
05/06/2020
07/06/2020
06/05/2022
09/09/2022
2
4454
03/04/2020
01/01/2021
07/08/2020
31/08/2020
15/12/2022
09/01/2023
Desired Output
ID
Item
Dates
Min
Max
1
2322
DateMade
01/01/2020
01/01/2020
1
2322
DelDate
05/06/2020
07/06/2020
1
2322
ExpDate
06/05/2022
09/09/2022
2
4454
DateMade
03/04/2020
01/01/2021
2
4454
DelDate
07/08/2020
31/08/2020
2
4454
ExpDate
15/12/2022
09/01/2023
You can reshape with an intermediate stacking and a MultiIndex:
out = (df
.set_index(['ID', 'Item'])
.pipe(lambda d: d.set_axis(d.columns.str.split('_', expand=True), axis=1))
.stack(0)
.reset_index().rename(columns={'level_2': 'Dates'})
)
output:
ID Item Dates Max Min
0 1 2322 DateMade 01/01/2020 01/01/2020
1 1 2322 DelDate 07/06/2020 05/06/2020
2 1 2322 ExpDate 09/09/2022 06/05/2022
3 2 4454 DateMade 01/01/2021 03/04/2020
4 2 4454 DelDate 31/08/2020 07/08/2020
5 2 4454 ExpDate 09/01/2023 15/12/2022
alternative
Alternatively, you can use the janitor helper module and its pivot_longer function:
# pip install janitor
import janitor
out = df.pivot_longer(
index=['ID', 'Item'],
names_to=('Dates', '.value'),
names_sep = '_',
sort_by_appearance=True
)
here is one way to doit
# melt the DF
df2=df.melt(['ID','Item'])
# split the date field on '_' to form two columns
df2[['date','minmax']]=df2['variable'].str.split('_', expand=True)
# use pivot to reformat the resultset
df2.pivot(index=['ID','Item','date'], columns='minmax', values='value').reset_index()
minmax ID Item date Max Min
0 1 2322 DateMade 01/01/2020 01/01/2020
1 1 2322 DelDate 07/06/2020 05/06/2020
2 1 2322 ExpDate 09/09/2022 06/05/2022
3 2 4454 DateMade 01/01/2021 03/04/2020
4 2 4454 DelDate 31/08/2020 07/08/2020
5 2 4454 ExpDate 09/01/2023 15/12/2022
Related
DataFrame
ID
DateMade
DelDate
ExpDate
1
01/01/2020
05/06/2020
06/05/2022
1
01/01/2020
07/06/2020
07/05/2022
1
01/01/2020
07/06/2020
09/09/2022
2
03/04/2020
07/08/2020
15/12/2022
2
05/06/2020
23/08/2020
31/12/2022
2
01/01/2021
31/08/2020
09/01/2023
What I want to do is groupby ID and create columns for the Min and Max date for each column. But I'm not sure where to start. I know there's aggregate functions out there that work well with one column but I'm wondering is there a straight forward solution when dealing with multiple columns?
Desired Output
ID
DateMade_Min
DateMade_Max
DelDate_Min
DelDate_Max
ExpDate_Min
ExpDate_Max
1
01/01/2020
01/01/2020
05/06/2020
07/06/2020
06/05/2022
09/09/2022
2
03/04/2020
01/01/2021
07/08/2020
31/08/2020
15/12/2022
09/01/2023
First convert columns by list to datetimes in DataFrame.apply and to_datetime, then correct aggregation min and max, flatten MultiIndex with capitalize:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 2020-01-01 2020-01-01 2020-06-05 2020-06-07 2022-05-06
1 2 2020-04-03 2021-01-01 2020-08-07 2020-08-31 2022-12-15
ExpDate_Max
0 2022-09-09
1 2023-01-09
For orginal format of datetimes add lambda function with Series.dt.strftime:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.apply(lambda x: x.dt.strftime('%d/%m/%Y'))
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 01/01/2020 01/01/2020 05/06/2020 07/06/2020 06/05/2022
1 2 03/04/2020 01/01/2021 07/08/2020 31/08/2020 15/12/2022
ExpDate_Max
0 09/09/2022
1 09/01/2023
I have the following dataset and would like to group by Name and Month and Year. I would like to calculate how many times the name appears in every month of the year:
Date Name
2019-11-10 18:59:31+00:00 A
2020-11-07 18:59:31+00:00 A
2021-05-10 18:59:31+00:00 B
2020-11-09 18:59:31+00:00 C
2021-05-01 18:59:31+00:00 B
2020-12-10 18:59:31+00:00 C
2019-12-10 18:59:31+00:00 B
I do not know how exactly the result would be, but I expect something similar to this so I can then make a graph:
2019-11 A 1
2020-11 A 1
2021-05 B 2
2020-11 C 1
2020-12 C 1
2019-12 B 1
I have tried the following method:
df.groupby(pd.Grouper(key='Date',freq='1M')).groupby('Name').count()
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Try:
df.groupby([df['Date'].dt.strftime('%Y-%m'),'Name'])['Name'].count().rename('count').reset_index()
Output:
Date Name count
0 2019-11 A 1
1 2019-12 B 1
2 2020-11 A 1
3 2020-11 C 1
4 2020-12 C 1
5 2021-05 B 2
Is Date the first column to the DataFrame, it should be
df = df.reset_index()
df.groupby(pd.Grouper(key='Date',freq='1M')).groupby('Name').count()
Try converting Date to year/month using to_period('M'):
df.groupby([df.Date.dt.to_period('M'),df.Name]).agg(count = ('Date','count')).reset_index()
Result is:
Date Name count
0 2019-11 A 1
1 2019-12 B 1
2 2020-11 A 1
3 2020-11 C 1
4 2020-12 C 1
5 2021-05 B 2
I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308
Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days
I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.
Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1
Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})
I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2