Leading and Trailing Padding Dates in Pandas DataFrame - python

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN

One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Related

Reorganizing pandas dataframe turning Column into new Header, Original Header to be part of multiindex with a prexisting Column

I have been tasked with reorganizing a fairly large data set for analysis. I want to make a dataframe where each employee has a list of Stats associated with their Employee Number ordered based on how many periods they have been with the company. The data does not go all the way back to the start of the company so some employees will not appear in the first period. My guess is there's some combination of pivot and merge that I am unable to wrap my head around.
df1 looks like this:
Periods since Start Period Employee Number Wage Sick Days
0 3 202001 101 20 14
1 2 202001 102 15 12
2 1 202001 103 10 17
3 4 202002 101 20 14
4 3 202002 102 20 10
5 2 202002 103 10 13
6 5 202003 101 25 13
7 4 202003 102 20 9
8 3 202003 103 10 13
And I want df2 (Column# for reference only):
Column1 Column2 Column3 Column4 Column5
101 102 103
1 Wage NaN NaN 10
1 Sick Days NaN NaN 17
2 Wage NaN 15 10
2 Sick Days NaN 12 13
3 Wage 20 20 10
3 Sick Days 14 10 13
4 Wage 20 20 NaN
4 Sick Days 14 9 NaN
Column1 = 'Periods since Start'
Column2 = "Stat" e.g. 'Wage', 'Sick Days'
Column3 - Column 5 Headers = 'Employee Number'
First thoughts were to try pivot/merge/stack but I have had no good results.
The second option I thought of was to create a dataframe with the index and headers that I wanted and then populate it from df1
import pandas as pd
import numpy as np
stat_list = ['Wage', 'Sick Days']
largest_period = df1['Periods since Start'].max()
df2 = np.tile(stat_list, largest_period)
df2 = pd.DataFrame(data=df2, columns = ['Stat'])
df2['Period_Number'] = df2.groupby('Stat').cumcount()+1
df2 = pd.DataFrame(index = df2[['Period_Number', 'Stat']],
columns = df1['Employee Number'])
Which Yields:
Employee Number 101 102 103
(1, 'Wage') NaN NaN NaN
(1, 'Sick Days') NaN NaN NaN
(2, 'Wage') NaN NaN NaN
(2, 'Sick Days') NaN NaN NaN
(3, 'Wage') NaN NaN NaN
(3, 'Sick Days') NaN NaN NaN
(4, 'Wage') NaN NaN NaN
(4, 'Sick Days') NaN NaN NaN
But I am at a loss on how to populate it.
You can .melt and then .unstack the dataframe.
Finish up up with some multiindex column clean up using .droplevel and passing axis=1 to drop unnecessary levels on columns rather than the default axis=0, which would drop index columns. You can also use reset_index() to bring the index columns into your dataframe:
df = (df.melt(id_vars=['Periods since Start', 'Employee Number'],
value_vars=['Wage', 'Sick Days'])
.set_index(['Periods since Start', 'Employee Number', 'variable']).unstack(1)
.droplevel(0, axis=1)
.reset_index())
df
Out[1]:
Employee Number Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN
When melting the dataframe, you can pass var_name= as the default is "variable". If you do that make sure to change the column name when using set_index() as well.
Try this, first melt the dataframe keeping Periods since Start, Employee Number, and Period in the index. Next, pivot the dataframe making rows and columns with 'value' from melt the values in the pivoted dataframe. Lastly, cleanup index with reset_index and remove the column index header name using rename_axis:
df.melt(['Periods since Start', 'Employee Number', 'Period'])\
.pivot(['Periods since Start', 'variable'], 'Employee Number', 'value')\
.reset_index()\
.rename_axis(None, axis=1)
Output:
Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Group by and find sum for groups but return NaN as NaN, not 0

I have a dataframe where each unique group has 4 rows.
So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average.
But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN?
For example:
df
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 8 5 NaN
2018-02-11 14:00:00 1 a 12 1 NaN NaN
2018-02-11 14:00:00 1 a 12 3 7 NaN
2018-02-11 14:00:00 1 a 12 4 12 NaN
2018-02-11 14:00:00 2 a 5 NaN 5 5
2018-02-11 14:00:00 2 a 5 NaN 3 2
2018-02-11 14:00:00 2 a 5 NaN NaN 6
2018-02-11 14:00:00 2 a 5 NaN 7 NaN
So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on).
I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.
Desired output:
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 16 24 NaN
2018-02-11 14:00:00 2 a 5 NaN 15 13
Any help is welcomed.
Change parameter min_count to 1 - this working in last pandas version 0.22.0:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.
df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
time id el conn column1 column2 column3
0 2018-02-11 14:00:00 1 a 12 16.0 24.0 NaN
1 2018-02-11 14:00:00 2 a 5 NaN 15.0 13.0
I think it should be something like this.
df.groupby(['time','id','el','conn']).sum()
Output in Python 2:
Some little tutorial for groupby I find interesting in these cases:
https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_groups/
https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

Merge pandas df based on 2 keys

I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

Elegant resample for groups in Pandas

For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)

Categories