Python/Pandas: dataframe merge and fillna - python

I try to merge two Pandas dataframes based on date and then ffill NaN values until specific date. I have the following data example:
df_1:
date
value1
01/12
10
02/12
20
03/12
30
04/12
40
05/12
60
06/12
70
07/12
80
df_2:
date
value2
01/12
100
03/12
300
05/12
500
I use the following line:
df = pd.merge((df_1,df_2, how='left', on=['date']
I get this:
date
value1
value2
01/12
10
100
02/12
20
NaN
03/12
30
300
04/12
40
Nan
05/12
50
500
06/12
60
NaN
07/12
70
NaN
What I want to achieve then is to forwardfill the NaN values in df['value2'] until 05/12 and not until 07/12.

First, convert date to datetime format to use conditional operand. It will return YYYY-MM-DD by default.
Next, create a mask for your condition ffill to 05/12. and use loc for fillna.
Lastly, convert back date from datetime back to string
df['date'] = pd.to_datetime(df["date"], format="%d/%m")
mask = (df["date"].lt(pd.to_datetime('05/12', format="%d/%m")))
df.loc[mask, "val2"] = df.loc[mask, "val2"].fillna(method="ffill")
df['date'] = df['date'].dt.strftime('%d/%m')

Related

Pandas - Compute sum of a column as week-wise columns

I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.

Pandas: Filter or Groupby then transform to select the last row

This post has a reference to one of my post in SO.
Just to reiterate, I have a dataframe df as
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-03-01 A 25 88 <-----Last row for Group A
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238 <-----Last row of Group B
Considering the last row of each Group, if the Duration value is less than 90, we omit that group. So my resultant data frame df_final should look like
Date Group Value Duration
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 240
There are two ways we approach to this problem.
First is filter method:
df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
Second is groupby.transform method:
df = df[df.groupby('Group')['Duration'].transform('last') >= 90]
But I want to filter this by the Date column and NOT by Duration. I am getting the correct result by the following code:
df_interim = df.loc[(df['Date']=='2019-03-01')&(df['Duration'] >=90)]
df_final = df.merge(df_interim[['Group','Date']],on='Group',how='right').reset_index()
In the above code, I have hard coded the Date.
My question is : How can I dynamically select the last date in the data frame? And then perform the filter or groupby.transform on Group?
Any clue?
We can select the last date by use transform as well
lastd=df.groupby('Date')['Duration'].transform('max')
df_interim = df.loc[(df['Date']==lastd)&(df['Duration'] >=90)]
I think you need first filter for maximum index by Date by DataFrameGroupBy.idxmax, then select rows by DataFrame.loc for all columns:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.loc[df.groupby('Group')['Date'].idxmax()]
print (df1)
Date Group Value Duration
2 2018-03-01 A 25 88
5 2018-03-01 B 25 238
Then filter by Duration only rows with maximal Date:
g = df1.loc[df1['Duration'] >= 90, 'Group']
print (g)
Date Group Value Duration
3 2018-01-01 B 15 180
4 2018-02-01 B 30 210
5 2018-03-01 B 25 238
And last filter original Group column by Series.isin with boolean indexing:
df = df[df['Group'].isin(g)]
print (df)
Date Group Value Duration
3 2018-01-01 B 15 180
4 2018-02-01 B 30 210
5 2018-03-01 B 25 238

Group by and fill missing datetime values

What I'm just trying is to group a Pandas Dataframe by contract and date, and fill missing datetime values.
My input is this:
contract datetime value1 value2
x 2019-01-01 00:00:00 50 60
x 2019-01-01 01:00:00 30 60
x 2019-01-01 02:00:00 70 80
y 2019-01-01 00:00:00 30 100
What I want to do is to have all possible datetimes (from 00:00:00 to 23:00:00) for each contract, and fill missing values with NaN or None.
Thank you very much.
You can use DataFrame.reindex per groups with DataFrame.groupby and lambda function:
df['datetime'] = pd.to_datetime(df['datetime'])
f= lambda x: x.reindex(pd.date_range(x.index.min().floor('d'),
.index.max().floor('d')+pd.Timedelta(23, 'H'),freq='H'))
df1 = (df.set_index('datetime')
.groupby('contract')
.apply(f)
.drop('contract', axis=1)
.reset_index())
print (df1)

Using pandas to csv, how to organize time and numerical data in a multi-level index

Using pandas to write to a csv, I want Monthly Income sums for each unique Source. Month is in datetime format.
I have tried resampling and groupby methods, but groupby neglects month and resampling neglects source. I currently have a multi-level index with Month and Source as indexes.
Month Source Income
2019-03-01 A 100
2019-03-05 B 50
2019-03-06 A 4
2019-03-22 C 60
2019-04-23 A 40
2019-04-24 A 100
2019-04-24 C 30
2019-06-1 C 100
2019-06-1 B 90
2019-06-8 B 20
2019-06-12 A 50
2019-06-27 C 50
I can groupby Source which neglects date, or I can resample for date which neglects source. I want monthly sums for each unique source.
What you have in the Month column is a Timestamp. So you can separate the month attribute of this Timestamp and afterward apply the groupby method, like this:
df.columns = ['Timestamp', 'Source', 'Income']
month_list = []
for i in range(len(df)):
month_list.append(df.loc[i,'Timestamp'].month)
df['Month'] = month_list
df1 = df.groupby(['Month', 'Source']).sum()
The output should be like this:
Income
Month Source
3 A 104
B 50
C 60
4 A 140
C 30
6 A 50
B 110
C 150

(python) Using diff() function in a DataFrame

How can I use the func diff() resetting the result to zero if the date in the current row is different from the date in the previous?
For instance, I have the df below containing ts and value, when generating value_diff I can use:
df['value_diff'] = df.value.diff()
but in this case the row of index 4 will have value_diff = 200 and I need it to reset to zero because date has changed.
i ts value value_diff
0 2019-01-02 11:48:01.001 100 0
1 2019-01-02 14:26:01.001 150 50
2 2019-01-02 16:12:01.001 75 -75
3 2019-01-02 18:54:01.001 50 -25
4 2019-01-03 09:12:01.001 250 0
5 2019-01-03 12:25:01.001 310 60
6 2019-01-03 16:50:01.001 45 -265
7 2019-01-03 17:10:01.001 30 -15
I know I can build a loop for it, but I was wondering if it can be solved in a more fancy way, maybe using lambda functions.
You want to use groupby and then fillna to get the 0 values.
import pandas as pd
# Reading your example and getting back to correct format from clipboard
df = pd.read_clipboard()
df['ts'] = df['i'] + ' ' + df['ts']
df.drop(['i', 'value_diff'], axis=1, inplace=True) # The columns get misaligned from reading clipboard
# Now we have your original
print(df.head())
# Convert ts to datetime
df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True)
# Add a date column for us to groupby
df['date'] = df['ts'].dt.date
# Apply diff and fillna
df['value_diff'] = df.groupby('date')['value'].diff().fillna(0)

Categories