pandas ffill() with groupby - python

I have the following dataframe with 22 columns:
ID S0 S1 S2 .....
ABC 10.4 5.58
ABC 12.6
ABC 8.45
LMN 5.6
LMN 8.7
I have to ffill() the values based on groups. Intended result:
ID SS RR S2 ...
ABC 10.4 5.58
ABC 12.6 10.4 5.58
ABC 12.6 10.4 8.45
LMN 5.6
LMN 8.7 5.6
I am using the following code to get S0,S1... values:
df[['Resistance', 'cumcountR']].pivot(columns='cumcountR').droplevel(0, axis=1).add_prefix('R').drop(columns='R-1.0').ffill()
Little help will be appreciated. THANKS!

Try with GroupBy.ffill
out = df.groupby('ID').ffill()

Related

AttributeError: 'list' object has no attribute 'assign'

I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.

Pandas dataframe to multiple adjacency matricies

I need to transform a data frame into what I think are adjacency matrices or some sort of pivot table using a datetime column. I have tried a lot of googling but haven't found anything, so any help in how to do this or even what I should be googling would be appreciated.
Here is a simplified version of my data:
df = pd.DataFrame({'Location' : [1]*7 + [2]*7,
'Postcode' : ['XXX XXX']*7 + ['YYY YYY']*7,
'Date' : ['03-12-2021', '04-12-2021', '05-12-2021', '06-12-2021', '07-12-2021',
'08-12-2021', '09-12-2021', '03-12-2021', '04-12-2021', '05-12-2021',
'06-12-2021', '07-12-2021', '08-12-2021', '09-12-2021'],
'Var 1' : [6.9, 10.2, 9.2, 7.6, 9.8, 8.6, 10.6, 9.9, 9.4, 9, 9.4, 9.1, 8, 9.9],
'Var 2' : [14.5, 6.2, 9.7, 12.7, 14.8, 12, 12.2, 12.3, 14.2, 13.8, 11.7, 17.8,
10.7, 12.3]})
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
Location Postcode Date Var 1 Var 2
0 1 XXX XXX 2021-12-03 6.9 14.5
1 1 XXX XXX 2021-12-04 10.2 6.2
2 1 XXX XXX 2021-12-05 9.2 9.7
3 1 XXX XXX 2021-12-06 7.6 12.7
4 1 XXX XXX 2021-12-07 9.8 14.8
5 1 XXX XXX 2021-12-08 8.6 12.0
6 1 XXX XXX 2021-12-09 10.6 12.2
7 2 YYY YYY 2021-12-03 9.9 12.3
8 2 YYY YYY 2021-12-04 9.4 14.2
9 2 YYY YYY 2021-12-05 9.0 13.8
10 2 YYY YYY 2021-12-06 9.4 11.7
11 2 YYY YYY 2021-12-07 9.1 17.8
12 2 YYY YYY 2021-12-08 8.0 10.7
13 2 YYY YYY 2021-12-09 9.9 12.3
The output I want to create is what each variable will be in +1, +2, +3 etc days from the Date variable, so it would look like this:
But I have no idea how or where to start. My only thought is several for loops but in reality I have hundreds of locations and 10 variables for 14 Dates each, so it is a large dataset and this would be very inefficient. I feel like there should be a function or simpler way to achieve this.
Create DatetimIndex and then use DataFrameGroupBy.shift withadd suffix by DataFrame.add_suffix with {i:02} for 01, 02..10, 11 for correct sorting columns names in last step:
df = df.set_index('Date')
for i in range(1,7):
df = df.join(df.groupby('Location')[['Var 1', 'Var 2']].shift(freq=f'-{i}d')
.add_suffix(f'+ Day {i:02}'), on=['Location','Date'])
df = df.set_index(['Location','Postcode'], append=True).sort_index(axis=1)

Transforming yearwise data using pandas

I have a dataframe that looks like this:
Temp
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
... ...
1981-12-27 15.5
1981-12-28 13.3
1981-12-29 15.6
1981-12-30 15.2
1981-12-31 17.4
365 rows × 1 columns
And I want to transform It so That It looks like:
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8
1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3
2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6
3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5
4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3
... ... ... ... ... ... ... ... ... ... ...
360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0
361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6
362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5
363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7
364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0
My attempt:
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values
Question:
The above code is giving me my desired output but Is there is a more efficient way of transforming this?
As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting:
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])
# Get Year and Day of Year
df['year'] = df.index.year
df['day'] = df.index.dayofyear
# Pivot
p = df.pivot(index='day', columns='year', values='Temp')
print(p)
p:
year 1981 1982
day
1 38 85
2 51 70
3 76 61
4 71 47
5 44 76
.. ... ...
361 23 22
362 42 64
363 84 22
364 26 56
365 67 73
Run-Time via Timeit
import timeit
setup = '''
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])'''
pivot = '''
df['year'] = df.index.year
df['day'] = df.index.dayofyear
p = df.pivot(index='day', columns='year', values='Temp')'''
groupby_for = '''
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values'''
if __name__ == '__main__':
print("Pivot")
print(timeit.timeit(setup=setup, stmt=pivot, number=1000))
print("Groupby For")
print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000))
Pivot
1.598973
Groupby For
2.3967995999999996
*Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.

Pandas: How to remove the index column after groupby and unstack?

I've got trouble removing the index column in pandas after groupby and unstack a DataFrame.
My original DataFrame looks like this:
example = pd.DataFrame({'date': ['2016-12', '2016-12', '2017-01', '2017-01', '2017-02', '2017-02', '2017-02'], 'customer': [123, 456, 123, 456, 123, 456, 456], 'sales': [10.5, 25.2, 6.8, 23.4, 29.5, 23.5, 10.4]})
example.head(10)
output:
date
customer
sales
0
2016-12
123
10.5
1
2016-12
456
25.2
2
2017-01
123
6.8
3
2017-01
456
23.4
4
2017-2
123
29.5
5
2017-2
456
23.5
6
2017-2
456
10.4
Note that it's possible to have multiple sales for one customer per month (like in row 5 and 6).
My aim is to convert the DataFrame into an aggregated DataFrame like this:
customer
2016-12
2017-01
2017-02
123
10.5
6.8
29.5
234
25.2
23.4
33.9
My solution so far:
example = example[['date', 'customer', 'sales']].groupby(['date', 'customer']).sum().unstack('date')
example.head(10)
output:
sales
date
2016-12
2017-01
2017-02
customer
123
10.5
6.8
29.5
234
25.2
23.4
33.9
example = example['sales'].reset_index(level=[0])
example.head(10)
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
At this point I'm unable to remove the "date" column:
example.reset_index(drop = True)
example.head()
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
It just stays the same. Have you got any ideas?
An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
Why not directly go with pivot_table?
(example
.pivot_table('sales', index='customer', columns="date", aggfunc='sum')
.rename_axis(columns=None).reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9

How to concat and transpose tow tables in python

I do not know why my code is not working, I want to transpose and concat tow tables in python.
my code:
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame({'TR':np.arange(1, 6).repeat(5), 'A': np.random.randint(1, 100,25), 'B': np.random.randint(50, 100,25), 'C': np.random.randint(50, 1000,25), 'D': np.random.randint(5, 100,25) })
table = df.groupby('TR').mean().round(decimals=1)
table2 = df.drop(['TR'], axis=1).sem().round(decimals=1)
table2 = table2.T
pd.concat([table, table2])
The output should be:
TR A B C D
1 54.0 68.6 795.8 49.8
2 61.4 67.8 524.8 52.8
3 54.0 73.6 556.6 46.6
4 35.6 69.2 207.2 46.4
5 44.4 85.0 639.8 73.8
st 6.5 3.4 62.5 6.4
output
append after assign name
table2.name='st'
table=table.append(table2)
table
A B C D
TR
1 55.8 73.2 536.8 42.8
2 31.0 75.4 731.2 43.6
3 42.0 68.8 598.6 32.4
4 33.6 79.0 300.8 43.6
5 70.2 72.2 566.8 54.8
st 5.9 3.2 62.5 5.9

Categories