Pandas-Add missing years in time series data with duplicate years

Pandas-Add missing years in time series data with duplicate years - python

I have a dataset like this where data for some years are missing .
County Year Pop
12 1999 1.1
12 2001 1.2
13 1999 1.0
13 2000 1.1
I want something like
County Year Pop
12 1999 1.1
12 2000 NaN
12 2001 1.2
13 1999 1.0
13 2000 1.1
13 2001 nan
I have tried setting index to year and then using reindex with another dataframe of just years method (mentioned here Pandas: Add data for missing months) but it gives me error cant reindex with duplicate values. I have also tried df.loc but it has same issue. I even tried a full outer join with blank df of just years but that also didnt work.
How can I solve this?

Make a MultiIndex so you don't have duplicates:
df.set_index(['County', 'Year'], inplace=True)
Then construct a full MultiIndex with all the combinations:
index = pd.MultiIndex.from_product(df.index.levels)
Then reindex:
df.reindex(index)
The construction of the MultiIndex is untested and may need a little tweaking (e.g. if a year is entirely absent from all counties), but I think you get the idea.

I'm working under the assumption that you may want to add all years between the minimum and maximum years. It may be the case that you were missing 2000 for both Counties 12 and 13.
I'll construct a pd.MultiIndex from_product using unique values from the 'County' column and all integer years between and including the min and max years in the 'Year' column.
Note: this solution fills in all missing years even if they aren't currently present.
mux = pd.MultiIndex.from_product([
df.County.unique(),
range(df.Year.min(), df.Year.max() + 1)
], names=['County', 'Year'])
df.set_index(['County', 'Year']).reindex(mux).reset_index()
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN

You can use pivot_table:
In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year 1999 2000 2001
County
12 1.1 NaN 1.2
13 1.0 1.1 NaN
and stack the result (a Series is required):
In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County Year
12 1999 1.1
2000 NaN
2001 1.2
13 1999 1.0
2000 1.1
2001 NaN
dtype: float64

Or you can try some black magic :P
min_year, max_year = df.Year.min(), df.Year.max()
df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()

You mentioned you've tried to join to a blank df and this approach can actually work.
Setup:
df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})
Solution
#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])
#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]:
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN

Here is a function inspired by the accepted answer but for a case where the time-variable starts and stops at different places for different group ids. The only difference from the accepted answer is that I manually construct the multi-index.
def fill_gaps_in_panel(df, group_col, year_col):
"""
Fills the gaps in a panel by constructing an index
based on the group col and the sequence of years between min-year
and max-year for each group id.
"""
index_group = []
index_time = []
for group in df[group_col].unique():
_min = df.loc[df[group_col]==group, year_col].min()
_max = df.loc[df[group_col]==group, year_col].max() + 1
index_group.extend([group for t in range(_min, _max)])
index_time.extend([t for t in range(_min, _max)])
multi_index = pd.MultiIndex.from_arrays(
[index_group, index_time], names=(group_col, year_col))
df.set_index([group_col, year_col], inplace=True)
return df.reindex(multi_index)

Related

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?

This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())

You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)

You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

Write python function that sums values from certain rows for each index type using groupby

In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.

First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0

A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700

groupby().mean() don't work under for loop

I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.

I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])

Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe

Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())

Reset the index for a pandas DataFrame created from a groupby or pivot?

I have data that contains prices, volumes and other data about various financial securities. My input data looks like the following:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot?

You need to label each year 0-4. To do this, use the cumcount after grouping. Then you can pivot correctly using that new column as the index.
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214

You can use groupby with apply new Series created with numpy array by values and then reshape by unstack:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445

Pandas: Rolling sum with multiple indexes (i.e. panel data)

I have a dataframe with multiple index and would like to create a rolling sum of some data, but for each id in the index.
For instance, let us say I have two indexes (Firm and Year) and I have some data with name zdata. The working example is the following:
import pandas as pd
# generating data
firms = ['firm1']*5+['firm2']*5
years = [2000+i for i in range(5)]*2
zdata = [1 for i in range(10)]
# Creating the dataframe
mydf = pd.DataFrame({'firms':firms,'year':years,'zdata':zdata})
# Setting the two indexes
mydf.set_index(['firms','year'],inplace=True)
print(mydf)
zdata
firms year
firm1 2000 1
2001 1
2002 1
2003 1
2004 1
firm2 2000 1
2001 1
2002 1
2003 1
2004 1
And now, I would like to have a rolling sum that starts over for each firm. However, if I type
new_rolling_df=mydf.rolling(window=2).sum()
print(new_rolling_df)
zdata
firms year
firm1 2000 NaN
2001 2.0
2002 2.0
2003 2.0
2004 2.0
firm2 2000 2.0
2001 2.0
2002 2.0
2003 2.0
2004 2.0
It doesn't take into account the multiple index and just make a normal rolling sum. Anyone has an idea how I should do (especially since I have even more indexes than 2 (firm, worker, country, year)
Thanks,
Adrien

Option 1
mydf.unstack(0).rolling(2).sum().stack().swaplevel(0, 1).sort_index()
Option 2
mydf.groupby(level=0, group_keys=False).rolling(2).sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas-Add missing years in time series data with duplicate years - python

Or you can try some black magic :P min_year, max_year = df.Year.min(), df.Year.max() df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()

Related

How to iterate over columns and check condition by group

Write python function that sums values from certain rows for each index type using groupby

groupby().mean() don't work under for loop

Reset the index for a pandas DataFrame created from a groupby or pivot?

Pandas: Rolling sum with multiple indexes (i.e. panel data)

Categories

Resources