I need to transform a data frame into what I think are adjacency matrices or some sort of pivot table using a datetime column. I have tried a lot of googling but haven't found anything, so any help in how to do this or even what I should be googling would be appreciated.
Here is a simplified version of my data:
df = pd.DataFrame({'Location' : [1]*7 + [2]*7,
'Postcode' : ['XXX XXX']*7 + ['YYY YYY']*7,
'Date' : ['03-12-2021', '04-12-2021', '05-12-2021', '06-12-2021', '07-12-2021',
'08-12-2021', '09-12-2021', '03-12-2021', '04-12-2021', '05-12-2021',
'06-12-2021', '07-12-2021', '08-12-2021', '09-12-2021'],
'Var 1' : [6.9, 10.2, 9.2, 7.6, 9.8, 8.6, 10.6, 9.9, 9.4, 9, 9.4, 9.1, 8, 9.9],
'Var 2' : [14.5, 6.2, 9.7, 12.7, 14.8, 12, 12.2, 12.3, 14.2, 13.8, 11.7, 17.8,
10.7, 12.3]})
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
Location Postcode Date Var 1 Var 2
0 1 XXX XXX 2021-12-03 6.9 14.5
1 1 XXX XXX 2021-12-04 10.2 6.2
2 1 XXX XXX 2021-12-05 9.2 9.7
3 1 XXX XXX 2021-12-06 7.6 12.7
4 1 XXX XXX 2021-12-07 9.8 14.8
5 1 XXX XXX 2021-12-08 8.6 12.0
6 1 XXX XXX 2021-12-09 10.6 12.2
7 2 YYY YYY 2021-12-03 9.9 12.3
8 2 YYY YYY 2021-12-04 9.4 14.2
9 2 YYY YYY 2021-12-05 9.0 13.8
10 2 YYY YYY 2021-12-06 9.4 11.7
11 2 YYY YYY 2021-12-07 9.1 17.8
12 2 YYY YYY 2021-12-08 8.0 10.7
13 2 YYY YYY 2021-12-09 9.9 12.3
The output I want to create is what each variable will be in +1, +2, +3 etc days from the Date variable, so it would look like this:
But I have no idea how or where to start. My only thought is several for loops but in reality I have hundreds of locations and 10 variables for 14 Dates each, so it is a large dataset and this would be very inefficient. I feel like there should be a function or simpler way to achieve this.
Create DatetimIndex and then use DataFrameGroupBy.shift withadd suffix by DataFrame.add_suffix with {i:02} for 01, 02..10, 11 for correct sorting columns names in last step:
df = df.set_index('Date')
for i in range(1,7):
df = df.join(df.groupby('Location')[['Var 1', 'Var 2']].shift(freq=f'-{i}d')
.add_suffix(f'+ Day {i:02}'), on=['Location','Date'])
df = df.set_index(['Location','Postcode'], append=True).sort_index(axis=1)
Related
I have this multiindex df:
YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
MONTH April April August August December .....
CODE NAME
000130 RICA PLAYA 21.0 31.5 21.0 21.5 22.0
000132 PUERTO PIZARRO 12.0 33.8 12.0 32.4 11.0
000134 PAPAYAL 23.0 33.2 22.0 22.4 21.0
000135 EL SALTO 22.0 33.6 23.0 22.8 22.0
000136 CAÑAVERAL 16.0 32.7 15.0 33.1 11.0
... ... ... ... ...
158317 SUSAPAYA 19.0 17.6 19.0 17.3 21.0
158321 PALCA 16.0 19.3 17.0 19.8 16.0
158323 TALABAYA 12.0 17.6 13.0 17.5 13.0
158326 CAPAZO 17.0 13.6 17.0 13.0 19.0
158328 PAUCARANI 14.0 13.3 13.0 11.9 15.0
I want to sort columns by month name (and TMAX columns first) like this:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX
MONTH January January February February March .....
CODE NAME
000130 RICA PLAYA 22.0 31.5 23.0 27.5 23.0
000132 PUERTO PIZARRO 17.0 32.8 18.0 30.4 18.0
000134 PAPAYAL 25.0 32.2 26.0 28.4 25.0
000135 EL SALTO 26.0 31.6 26.0 26.8 26.0
000136 CAÑAVERAL 16.0 32.7 18.0 31.1 15.0
... ... ... ... ...
158317 SUSAPAYA 19.0 17.6 19.0 17.3 21.0
158321 PALCA 16.0 19.3 17.0 19.8 16.0
158323 TALABAYA 12.0 17.6 13.0 17.5 13.0
158326 CAPAZO 17.0 13.6 17.0 13.0 19.0
158328 PAUCARANI 14.0 13.3 13.0 11.9 15.0
So i wrote this code:
source: Sort "Date" in Multi-Index
dates = pd.to_datetime(df.columns.get_level_values(1), format='%B')
df.columns = [df.columns.get_level_values(0), dates]
df = df.sort_index(axis=1, level=1)
To sort columns by month but dates is not creating month names, dates is creating random dates.
How can i solve this?
Thanks in advance.
Use a CategoricalDtype by creating an ordered dtype from calendar.month_name this will ensure the correct ordering by sort.
month_dtype = pd.CategoricalDtype(categories=list(month_name), ordered=True)
df.columns = [df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(month_dtype)]
df = df.sort_index(axis=1, level=[1, 0])
Sample Data and Imports:
from calendar import month_name
import pandas as pd
df = pd.DataFrame(
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
columns=pd.MultiIndex.from_product([
['YEARS_TMAX', 'TMAX'],
['March', 'January', 'February']
])
)
df before sort:
YEARS_TMAX TMAX
March January February March January February
0 1 2 3 4 5 6
1 7 8 9 10 11 12
df after sort:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
January January February February March March
0 5 2 6 3 4 1
1 11 8 12 9 10 7
The datetime approach would also work, but converting back to strings would be necessary with DatetimeIndex.strftime:
df.columns = [df.columns.get_level_values(0),
pd.to_datetime(df.columns.get_level_values(1), format='%B')]
df = df.sort_index(axis=1, level=[1, 0])
# convert back to strings
df.columns = [df.columns.get_level_values(0),
df.columns.get_level_values(1).strftime('%B')]
df:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
January January February February March March
0 5 2 6 3 4 1
1 11 8 12 9 10 7
The drawback of this approach is level 1 is once again a string type which would need to be converted any time ordering needed to be changed as lexicographic ordering is not expected.
I've got trouble removing the index column in pandas after groupby and unstack a DataFrame.
My original DataFrame looks like this:
example = pd.DataFrame({'date': ['2016-12', '2016-12', '2017-01', '2017-01', '2017-02', '2017-02', '2017-02'], 'customer': [123, 456, 123, 456, 123, 456, 456], 'sales': [10.5, 25.2, 6.8, 23.4, 29.5, 23.5, 10.4]})
example.head(10)
output:
date
customer
sales
0
2016-12
123
10.5
1
2016-12
456
25.2
2
2017-01
123
6.8
3
2017-01
456
23.4
4
2017-2
123
29.5
5
2017-2
456
23.5
6
2017-2
456
10.4
Note that it's possible to have multiple sales for one customer per month (like in row 5 and 6).
My aim is to convert the DataFrame into an aggregated DataFrame like this:
customer
2016-12
2017-01
2017-02
123
10.5
6.8
29.5
234
25.2
23.4
33.9
My solution so far:
example = example[['date', 'customer', 'sales']].groupby(['date', 'customer']).sum().unstack('date')
example.head(10)
output:
sales
date
2016-12
2017-01
2017-02
customer
123
10.5
6.8
29.5
234
25.2
23.4
33.9
example = example['sales'].reset_index(level=[0])
example.head(10)
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
At this point I'm unable to remove the "date" column:
example.reset_index(drop = True)
example.head()
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
It just stays the same. Have you got any ideas?
An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
Why not directly go with pivot_table?
(example
.pivot_table('sales', index='customer', columns="date", aggfunc='sum')
.rename_axis(columns=None).reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
I have the following dataframe with 22 columns:
ID S0 S1 S2 .....
ABC 10.4 5.58
ABC 12.6
ABC 8.45
LMN 5.6
LMN 8.7
I have to ffill() the values based on groups. Intended result:
ID SS RR S2 ...
ABC 10.4 5.58
ABC 12.6 10.4 5.58
ABC 12.6 10.4 8.45
LMN 5.6
LMN 8.7 5.6
I am using the following code to get S0,S1... values:
df[['Resistance', 'cumcountR']].pivot(columns='cumcountR').droplevel(0, axis=1).add_prefix('R').drop(columns='R-1.0').ffill()
Little help will be appreciated. THANKS!
Try with GroupBy.ffill
out = df.groupby('ID').ffill()
I would like to substitute the NaN and NaT values of the Value1 column, with others calculated with a function that takes in input Value2 and Value3 (if they exist) of the same row of Value1. This is done for each ID. To do this, I would use 'groupby' and then 'apply'.But I get an error: 'Series' objects are mutable, thus they cannot be hashed. Could you help me? Thanks in advance!
ID1 = [2002070, 2002070, 2002740,2002740,2003010]
ID2 = [2002070, 200800, 200800,2002740,2002740]
ID3 = [2002740, 2002740, 2002070, 2002070,2003010]
Value1 = [4.5, 4.2, 3.7, 4.8, 4.4]
Value2 = [7.2, 6.4, 10, 2.3, 1.5]
Value3 = [8.4, 8.4, 8.4, 7.4, 7.4]
date1 = ['2008-05-14', '2005-12-07','2008-10-27', '2009-04-20', '2012-03-01']
date2 = ['2005-12-07','2003-10-10', '2004-05-14', '2011-06-03', '2015-07-05']
date3 = ['2010-10-22', '2012-03-01', '2013-11-28', '2005-12-07', '2012-03-01']
date1=pd.to_datetime(date1)
date2=pd.to_datetime(date2)
date3=pd.to_datetime(date3)
df1=pd.DataFrame({'ID': ID1, 'Value1': Value1, 'Date1':date1}).sort_values('Date1')
df2=pd.DataFrame({'ID': ID2, 'Value2': Value2, 'Date2':date2}).sort_values('Date2')
df3=pd.DataFrame({'ID': ID3, 'Value3': Value3, 'Date3':date3}).sort_values('Date3')
ok = df1.merge(df2, left_on=['ID','Date1'],right_on=['ID','Date2'], how='outer', sort=True)
ok1 = ok.merge(df3, left_on='ID',right_on='ID', how='inner', sort=True )
the df I obtain is this:
ID Value1 Date1 Value2 Date2 Value3 Date3
0 2002070 4.2 2005-12-07 7.2 2005-12-07 7.4 2005-12-07
1 2002070 4.2 2005-12-07 7.2 2005-12-07 8.4 2013-11-28
2 2002070 4.5 2008-05-14 NaN NaT 7.4 2005-12-07
3 2002070 4.5 2008-05-14 NaN NaT 8.4 2013-11-28
4 2002740 3.7 2008-10-27 NaN NaT 8.4 2010-10-22
5 2002740 3.7 2008-10-27 NaN NaT 8.4 2012-03-01
6 2002740 4.8 2009-04-20 NaN NaT 8.4 2010-10-22
7 2002740 4.8 2009-04-20 NaN NaT 8.4 2012-03-01
8 2002740 NaN NaT 2.3 2011-06-03 8.4 2010-10-22
9 2002740 NaN NaT 2.3 2011-06-03 8.4 2012-03-01
10 2002740 NaN NaT 1.5 2015-07-05 8.4 2010-10-22
11 2002740 NaN NaT 1.5 2015-07-05 8.4 2012-03-01
12 2003010 4.4 2012-03-01 NaN NaT 7.4 2012-03-01
this is the function I made:
def func(Value2, Value3):
return Value2/((Value3/100)**2)
result = ok1.groupby("ID").Value1.apply(func(ok1.Value2, ok1.Value3))
Do you know how to apply this function only to a NaN Value1? And how to put the NaT Date1 equal to Date2?
The output of func is another Series, and pandas is not sure what you want to do with it - what would it mean to apply this series to the groups?
Is it that you want the values of this series to be assigned wherever there is a missing Value1 in the original DataFrame?
In that case
imputes = ok1.Value2.div(ok1.Value3.div(100).pow(2)) # same as your function
# overwrite missing values with the corresponding imputed values
ok1.Value1.fillna(imputes, inplace=True)
# overwrite missing dates with dates from another column
ok1.Date1.fillna(ok1.Date2, inplace=True)
However, it's not clear to me that this is quite what you wanted, given the presence of the groupby.
I have missing data at the start of a DataFrame for one series, and I want to fill those NAs by growing back the series using the growth rate of another.
df = pd.DataFrame({'X':[np.nan, np.nan, np.nan, 6, 6.7, 6.78, 7, 9.1],
'Y':[5.4, 5.7, 5.5, 6.1, 6.5, 6.80, 7.1, 9.12]})
X Y
0 NaN 5.40
1 NaN 5.70
2 NaN 5.50
3 6.00 6.10
4 6.70 6.50
5 6.78 6.80
6 7.00 7.10
7 9.10 9.12
i.e. what I want is:
df2 = pd.DataFrame({'X':[5.31147, 5.60656, 5.40984, 6, 6.7, 6.78, 7, 9.1],
'Y':[5.4, 5.7, 5.5, 6.1, 6.5, 6.80, 7.1, 9.12]})
So that the two series have the same growth rates for those first few original missing values
df2.pct_change()
X Y
0 NaN NaN
1 0.055556 0.055556
2 -0.035088 -0.035088
3 0.109091 0.109091
4 0.116667 0.065574
5 0.011940 0.046154
6 0.032448 0.044118
7 0.300000 0.284507
Any ideas? I've figured out how to iterate back and save the output to a list, but pre-pending it's bulky and I need to prepend it to the original DataFrame
You could let
first_non_nan = df.X.isnull().idxmin()
changes = df.Y[:first_non_nan+1].pct_change()
while first_non_nan > 0:
df.X[first_non_nan-1] = df.X[first_non_nan]/(changes[first_non_nan]+1)
first_non_nan -= 1
Result:
In [48]: df
Out[48]:
X Y
0 5.311475 5.40
1 5.606557 5.70
2 5.409836 5.50
3 6.000000 6.10
4 6.700000 6.50
5 6.780000 6.80
6 7.000000 7.10
7 9.100000 9.12