Pandas: Rolling sum with multiple indexes (i.e. panel data) - python

I have a dataframe with multiple index and would like to create a rolling sum of some data, but for each id in the index.
For instance, let us say I have two indexes (Firm and Year) and I have some data with name zdata. The working example is the following:
import pandas as pd
# generating data
firms = ['firm1']*5+['firm2']*5
years = [2000+i for i in range(5)]*2
zdata = [1 for i in range(10)]
# Creating the dataframe
mydf = pd.DataFrame({'firms':firms,'year':years,'zdata':zdata})
# Setting the two indexes
mydf.set_index(['firms','year'],inplace=True)
print(mydf)
zdata
firms year
firm1 2000 1
2001 1
2002 1
2003 1
2004 1
firm2 2000 1
2001 1
2002 1
2003 1
2004 1
And now, I would like to have a rolling sum that starts over for each firm. However, if I type
new_rolling_df=mydf.rolling(window=2).sum()
print(new_rolling_df)
zdata
firms year
firm1 2000 NaN
2001 2.0
2002 2.0
2003 2.0
2004 2.0
firm2 2000 2.0
2001 2.0
2002 2.0
2003 2.0
2004 2.0
It doesn't take into account the multiple index and just make a normal rolling sum. Anyone has an idea how I should do (especially since I have even more indexes than 2 (firm, worker, country, year)
Thanks,
Adrien

Option 1
mydf.unstack(0).rolling(2).sum().stack().swaplevel(0, 1).sort_index()
Option 2
mydf.groupby(level=0, group_keys=False).rolling(2).sum()

Related

Can you subtract from multi DF columns based on DF2 single column?

I have DF1 with several int columns and DF2 with 1 int column
DF1:
Year Industrial Consumer Discretionary Technology Utilities Energy Materials Communications Consumer Staples Health Care #No L1 US Agg Financials China Agg EU Agg
2001 5.884277 6.013842 6.216585 6.640594 6.701400 8.488806 7.175017 6.334284 6.082113 0.000000 5.439149 4.193736 4.686188 4.294788
2002 5.697814 6.277471 5.241045 6.608475 6.983511 8.089475 7.399775 5.882947 5.818563 7.250000 4.877012 3.635425 4.334125 3.944324
2003 5.144356 6.503754 6.270268 5.737079 6.466985 8.122228 7.040089 5.461827 5.385670 5.611753 4.163365 2.888026 3.955665 3.464020
2004 5.436486 6.463149 4.500574 5.329104 5.863406 7.562982 6.521106 5.990889 4.874258 6.554348 4.384878 3.502861 4.556418 3.412025
2005 5.003606 6.108812 5.732764 5.543677 6.131144 7.239053 7.228042 5.421092 5.561518 NaN 4.660754 3.970243 3.944251 3.106951
2006 4.505980 6.017253 4.923927 5.955308 5.799030 7.425253 6.942308
DF2:
Year Values
2002 4.514752
2003 3.994849
2004 4.254575
2005 4.277520
2006 4.784476
etc..
The indexes are the same for 2 DataFrames.
The goal is to create DF3 while subtracting DF2 from every single column from DF1. (DF2 - DF1 = DF3)
Anywhere where there is a nan, it should skip the math.
Assuming "Year" is the index for both (if not, you can make it the index using set_index), you can use sub on axis:
df3 = df1.sub(df2['Values'], axis=0)
Output:
Industrial Consumer Discretionary Technology Utilities Energy \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 1.183062 1.762719 0.726293 2.093723 2.468759 3.574723
2003 1.149507 2.508905 2.275419 1.742230 2.472136 4.127379
2004 1.181911 2.208574 0.245999 1.074529 1.608831 3.308407
2005 0.726086 1.831292 1.455244 1.266157 1.853624 2.961533
2006 -0.278496 1.232777 0.139451 1.170832 1.014554 2.640777
Materials Communications Consumer.1 Staples Health_Care US_Agg \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 2.885023 1.368195 1.303811 2.735248 0.362260 -0.879327
2003 3.045240 1.466978 1.390821 1.616904 0.168516 -1.106823
2004 2.266531 1.736314 0.619683 2.299773 0.130303 -0.751714
2005 2.950522 1.143572 1.283998 NaN 0.383234 -0.307277
2006 2.157832 NaN NaN NaN NaN NaN
Financials China_Agg
Year
2001 NaN NaN
2002 -0.180627 -0.570428
2003 -0.039184 -0.530829
2004 0.301843 -0.842550
2005 -0.333269 -1.170569
2006 NaN NaN
If you want to subtract df1 from df2 instead, you can use rsub instead of sub. It not clear which one you want since you explain that you want df1-df2 but your formula is the opposite.

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

Pivot table with Multi Index Dataframe

I am struggling with how to pivot the dataframe with multi-indexed columns. First i import the data from an .xlsx file and from then i've tried to generate a certain Dataframe.
Note: I'm not allowed to embed images so that's the reason of the links
import pandas as pd
import numpy as np
# Read Excel file
df = pd.read_excel("myFile.xlsx", header=[0])
Output: Click
If you want, here you can see the File: Link to File
# Get Month and Year of the Dataframe
month_year = df.iloc[:, 5:-1].columns
month_list = []
year_list = []
for x in range(len(month_year)-1):
if("Unnamed" not in month_year[x]):
month_list.append(month_year[x].split()[0])
year_list.append(month_year[x].split()[1])
# Read Excel file with headers 1 & 2
df = pd.read_excel(path, header=[0,1])
Output: Click
# Join both indexes excluding the ones with Unnamed
df.columns = [str(x[0] + " " + x[1]) if("Unnamed" not in x[1]) else str(x[0]) for x in df.columns ]
Output: Click
# Adding month and list columns to the DataFrame
df['Month'] = month_list
df['Year'] = year_list
Output: Click
I want the output DataFrame to be like the following:
Desire Output
You should clean it up a bit, because I do not know how the Total column should be handled.
The code below reads the excel file as a MultiIndex, a bit of column names modification, before stacking and extracting the Year and Month columns.
df = pd.read_excel("Downloads/myFile.xlsx", header=[0,1], index_col=[0, 1, 2])
df.index.names = ['Project', 'KPI', 'Metric']
df.columns = df.columns.delete(-1).union([('Total', 'Total')])
df.columns.names = ['Month_Year', 'Values']
(df
.stack(level = 0)
.rename_axis(columns=None)
.reset_index()
.assign(Year = lambda df: df.Month_Year.str.split(" ").str[-1],
Month = lambda df: df.Month_Year.str.split(" ").str[0])
.drop(columns='Month_Year')
)
Project KPI Metric Real Target Total Year Month
0 Project 1 Numeric Project 1 Metric 10.0 30.0 NaN 2019 April
1 Project 1 Numeric Project 1 Metric 651.0 51651.0 NaN 2019 February
2 Project 1 Numeric Project 1 Metric 200.0 215.0 NaN 2019 January
3 Project 1 Numeric Project 1 Metric 2.0 5.0 NaN 2019 March
4 Project 1 Numeric Project 1 Metric NaN NaN 9.0 Total Total
5 Project 2 General Project 2 Metric 20.0 10.0 NaN 2019 April
6 Project 2 General Project 2 Metric 500.0 100.0 NaN 2019 February
7 Project 2 General Project 2 Metric 749.0 12.0 NaN 2019 January
8 Project 2 General Project 2 Metric 1.0 7.0 NaN 2019 March
9 Project 2 General Project 2 Metric NaN NaN 7.0 Total Total
10 Project 3 Numeric Project 3 Metric 30.0 20.0 NaN 2019 April
11 Project 3 Numeric Project 3 Metric 200.0 55.0 NaN 2019 February
12 Project 3 Numeric Project 3 Metric 5583.0 36.0 NaN 2019 January
13 Project 3 Numeric Project 3 Metric 3.0 7.0 NaN 2019 March
14 Project 3 Numeric Project 3 Metric NaN NaN 4.0 Total Total

Pandas-Add missing years in time series data with duplicate years

I have a dataset like this where data for some years are missing .
County Year Pop
12 1999 1.1
12 2001 1.2
13 1999 1.0
13 2000 1.1
I want something like
County Year Pop
12 1999 1.1
12 2000 NaN
12 2001 1.2
13 1999 1.0
13 2000 1.1
13 2001 nan
I have tried setting index to year and then using reindex with another dataframe of just years method (mentioned here Pandas: Add data for missing months) but it gives me error cant reindex with duplicate values. I have also tried df.loc but it has same issue. I even tried a full outer join with blank df of just years but that also didnt work.
How can I solve this?
Make a MultiIndex so you don't have duplicates:
df.set_index(['County', 'Year'], inplace=True)
Then construct a full MultiIndex with all the combinations:
index = pd.MultiIndex.from_product(df.index.levels)
Then reindex:
df.reindex(index)
The construction of the MultiIndex is untested and may need a little tweaking (e.g. if a year is entirely absent from all counties), but I think you get the idea.
I'm working under the assumption that you may want to add all years between the minimum and maximum years. It may be the case that you were missing 2000 for both Counties 12 and 13.
I'll construct a pd.MultiIndex from_product using unique values from the 'County' column and all integer years between and including the min and max years in the 'Year' column.
Note: this solution fills in all missing years even if they aren't currently present.
mux = pd.MultiIndex.from_product([
df.County.unique(),
range(df.Year.min(), df.Year.max() + 1)
], names=['County', 'Year'])
df.set_index(['County', 'Year']).reindex(mux).reset_index()
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
You can use pivot_table:
In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year 1999 2000 2001
County
12 1.1 NaN 1.2
13 1.0 1.1 NaN
and stack the result (a Series is required):
In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County Year
12 1999 1.1
2000 NaN
2001 1.2
13 1999 1.0
2000 1.1
2001 NaN
dtype: float64
Or you can try some black magic :P
min_year, max_year = df.Year.min(), df.Year.max()
df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()
You mentioned you've tried to join to a blank df and this approach can actually work.
Setup:
df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})
Solution
#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])
#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]:
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
Here is a function inspired by the accepted answer but for a case where the time-variable starts and stops at different places for different group ids. The only difference from the accepted answer is that I manually construct the multi-index.
def fill_gaps_in_panel(df, group_col, year_col):
"""
Fills the gaps in a panel by constructing an index
based on the group col and the sequence of years between min-year
and max-year for each group id.
"""
index_group = []
index_time = []
for group in df[group_col].unique():
_min = df.loc[df[group_col]==group, year_col].min()
_max = df.loc[df[group_col]==group, year_col].max() + 1
index_group.extend([group for t in range(_min, _max)])
index_time.extend([t for t in range(_min, _max)])
multi_index = pd.MultiIndex.from_arrays(
[index_group, index_time], names=(group_col, year_col))
df.set_index([group_col, year_col], inplace=True)
return df.reindex(multi_index)

Split string in a column based on character position

I have a dataframe like this:
Basic Stats Min Max Mean Stdev
1 LT50300282010256PAC01 0.336438 0.743478 0.592622 0.052544
2 LT50300282009269PAC01 0.313259 0.678561 0.525667 0.048047
3 LT50300282008253PAC01 0.374522 0.746828 0.583513 0.055989
4 LT50300282007237PAC01 -0.000000 0.749325 0.330068 0.314351
5 LT50300282006205PAC01 -0.000000 0.819288 0.600136 0.170060
and for the column Basic Stats I want to retain only the characters between [9:12] so for row 1 I only want to retain 2010 and for row 2 I only want to retain 2009. Is there a way to do this?
Just use vectorised str method to slice your strings:
In [23]:
df['Basic Stats'].str[9:13]
Out[23]:
0 2010
1 2009
2 2008
3 2007
4 2006
Name: Basic Stats, dtype: object
One way would be to use
df['Basic Stats'] = df['Basic Stats'].map(lambda x: x[9:13])
You can slice
df["Basic Stats"] = df["Basic Stats"].str.slice(9,13)
Output:
Basic Stats Min Max Mean Stdev
0 2010 0.336438 0.743478 0.592622 0.052544
1 2009 0.313259 0.678561 0.525667 0.048047
2 2008 0.374522 0.746828 0.583513 0.055989
3 2007 -0.000000 0.749325 0.330068 0.314351
4 2006 -0.000000 0.819288 0.600136 0.170060
You can do:
df["Basic Stats"] = [ x[9:13] for x in df["Basic Stats"] ]

Categories