pandas shifting missing months - python

let's assume the following dataframe and shift operation:
d = {'col1': ['2022-01-01','2022-02-01','2022-03-01','2022-05-01'], 'col2': [1,2,3,4]}
df = pd.DataFrame(d)
df['shifted'] = df['col2'].shift(1, fill_value=0)
I want to create a column containing the values of the month before and filling it up for months which do not exist with 0, so the desired result would look like:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
0
So in the last line the value is 0 because there is no data for April.
But at the moment it looks like this:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
3
Does anyone know how to achieve this?

One idea is create month PeriodIndex, so possible shift by months, last replace missing values:
df = df.set_index(pd.to_datetime(df['col1']).dt.to_period('m'))
df['shifted'] = df['col2'].shift(1, freq='m').reindex(df.index, fill_value=0)
print (df)
col1 col2 shifted
col1
2022-01 2022-01-01 1 0
2022-02 2022-02-01 2 1
2022-03 2022-03-01 3 2
2022-05 2022-05-01 4 0
Last is possible remove PeriodIndex:
df = df.reset_index(drop=True)
print (df)
col1 col2 shifted
0 2022-01-01 1 0
1 2022-02-01 2 1
2 2022-03-01 3 2
3 2022-05-01 4 0

Related

Count number of days in each continuous period pandas

Suppose I have next df N03_zero (date_code is already datetime):
item_code date_code
8028558104973 2022-01-01
8028558104973 2022-01-02
8028558104973 2022-01-03
8028558104973 2022-01-06
8028558104973 2022-01-07
7622300443269 2022-01-01
7622300443269 2022-01-10
7622300443269 2022-01-11
513082 2022-01-01
513082 2022-01-02
513082 2022-01-03
Millions of rows with date_code assigned to some item_code.
I am trying to get the number of days of each continuous period for each item_code, all other similar questions doesn't helped me.
The expected df should be:
item_code continuous_days
8028558104973 3
8028558104973 2
7622300443269 1
7622300443269 2
513082 3
Once days sequence breaks, it should count days in this sequence and then start to count again.
The aim is, to able to get then the dataframe with count, min, max, and mean for each item_code.
Like this:
item_code no. periods min max mean
8028558104973 2 2 3 2.5
7622300443269 2 1 2 1.5
513082 1 3 3 3
Any suggestions?
For consecutive days compare difference by Series.diff in days by Series.dt.days for not equal 1 by Series.ne with cumulative sum by Series.cumsum and then use GroupBy.size, remove second level by DataFrame.droplevel and create DataFrame:
df['date_code'] = pd.to_datetime(df['date_code'])
df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False)
.size()
.droplevel(1)
.reset_index(name='continuous_days'))
print (df1)
item_code continuous_days
0 8028558104973 3
1 8028558104973 2
2 7622300443269 1
3 7622300443269 2
4 513082 3
And then aggregate values by named aggregations by GroupBy.agg:
df2 = (df1.groupby('item_code', sort=False, as_index=False)
.agg(**{'no. periods': ('continuous_days','size'),
'min':('continuous_days','min'),
'max':('continuous_days','max'),
'mean':('continuous_days','mean')}))
print (df2)
item_code no. periods min max mean
0 8028558104973 2 2 3 2.5
1 7622300443269 2 1 2 1.5
2 513082 1 3 3 3.0

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

Python: concat rows of two dataframes where not all columns are the same

I have two dataframes:
EDIT:
df1 = pd.DataFrame(index = [0,1,2], columns=['timestamp', 'order_id', 'account_id', 'USD', 'CAD'])
df1['timestamp']=['2022-01-01','2022-01-02','2022-01-03']
df1['account_id']=['usdcad','usdcad','usdcad']
df1['order_id']=['11233123','12313213','12341242']
df1['USD'] = [1,2,3]
df1['CAD'] = [4,5,6]
df1:
timestamp account_id order_id USD CAD
0 2022-01-01 usdcad 11233123 1 4
1 2022-01-02 usdcad 12313213 2 5
2 2022-01-03 usdcad 12341242 3 6
df2 = pd.DataFrame(index = [0,1], columns = ['timestamp','account_id', 'currency','balance'])
df2['timestamp']=['2021-12-21','2021-12-21']
df2['account_id']=['usdcad','usdcad']
df2['currency'] = ['USD', 'CAD']
df2['balance'] = [2,3]
df2:
timestamp account_id currency balance
0 2021-12-21 usdcad USD 2
1 2021-12-21 usdcad CAD 3
I would like to add a row to df1 at index 0, and fill that row with the balance of df2 based on currency. So the final df should look like this:
df:
timestamp account_id order_id USD CAD
0 0 0 0 2 3
1 2022-01-01 usdcad 11233123 1 4
2 2022-01-02 usdcad 12313213 2 5
3 2022-01-03 usdcad 12341242 3 6
How can I do this in a pythonic way? Thank you
Set the index of df2 to currency then transpose the index to columns, then append this dataframe with df1
df_out = df2.set_index('currency').T.append(df1, ignore_index=True).fillna(0)
print(df_out)
USD CAD order_id
0 2 3 0
1 1 4 11233123
2 2 5 12313213
3 3 6 12341242

how to zip and also melt any number of columns in python

My table looks like this:
no type 2020-01-01 2020-01-02 2020-01-03 ...................
1 x 1 2 3
2 b 4 3 0
and what I want to do is to melt down the column date and also value to be in separated new columns. I have done it, but I specified the columns that I want to melt like this script below:
cols_dict = dict(zip(df.iloc[:, 3:100].columns, df.iloc[:, 3:100].values[0]))
id_vars = [col for col in df.columns if isinstance(col, str)]
df = df.melt(id_vars = [col for col in df.columns if isinstance(col, str)], var_name = "date", value_name = 'value')
The expected result I want is:
no type date value
1 x 2020-01-01 1
1 x 2020-01-02 2
1 x 2020-01-03 3
2 b 2020-01-01 4
2 b 2020-01-02 3
2 b 2020-01-03 0
I assume that the column dates will be always added into the data frame as time goes by, so my script would not be worked anymore when the column date is more than 100.
How should I write my script so it will provide any number of date column in the future, as basically my current script could only access until columns number 100.
Thanks in advance.
>>> df.set_index(["no", "type"]) \
.rename_axis(columns="date") \
.stack() \
.rename("value") \
.reset_index()
no type date value
0 1 x 2020-01-01 1
1 1 x 2020-01-02 2
2 1 x 2020-01-03 3
3 2 b 2020-01-01 4
4 2 b 2020-01-02 3
5 2 b 2020-01-03 0

Pandas - Counting the number of days for group by

I want to count the number of days after grouping by 2 columns:
groups = df.groupby([df.col1,df.col2])
Now i want to count the number of days relevant for each group:
result = groups['date_time'].dt.date.nunique()
I'm using something similar when I want to group by day, but here I get an error:
AttributeError: Cannot access attribute 'dt' of 'SeriesGroupBy' objects, try using the 'apply' method
What is the proper way to get the number of days?
You need another variation of groupby - define column first:
df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
df['date_time1'] = df['date_time'].dt.date
a = df.groupby([df.col1,df.col2]).date_time1.nunique()
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='15H')
df = pd.DataFrame({'date_time': rng, 'col1': [0]*5 + [1]*5, 'col2': [2]*3 + [3]*4+ [4]*3})
print (df)
col1 col2 date_time
0 0 2 2015-02-24 00:00:00
1 0 2 2015-02-24 15:00:00
2 0 2 2015-02-25 06:00:00
3 0 3 2015-02-25 21:00:00
4 0 3 2015-02-26 12:00:00
5 1 3 2015-02-27 03:00:00
6 1 3 2015-02-27 18:00:00
7 1 4 2015-02-28 09:00:00
8 1 4 2015-03-01 00:00:00
9 1 4 2015-03-01 15:00:00
#solution with apply
df1 = df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
print (df1)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
#create new helper column
df['date_time1'] = df['date_time'].dt.date
df2 = df.groupby([df.col1,df.col2]).date_time1.nunique()
print (df2)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
df3 = df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
print (df3)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64

Categories