I have following DataFrame:
data = {'year': [2010, 2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2013],
'store_number': ['1944', '1945', '1946', '1947', '1948', '1949', '1947', '1948', '1949', '1947'],
'retailer_name': ['Walmart','Walmart', 'CRV', 'CRV', 'CRV', 'Walmart', 'Walmart', 'CRV', 'CRV', 'CRV'],
'product': ['a', 'b', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'c'],
'amount': [5, 5, 8, 6, 1, 5, 10, 6, 12, 11]}
stores = pd.DataFrame(data, columns=['retailer_name', 'store_number', 'year', 'product', 'amount'])
stores.set_index(['retailer_name', 'store_number', 'year', 'product'], inplace=True)
stores.groupby(level=[0, 1, 2, 3]).sum()
I want to transform following Dataframe:
amount
retailer_name store_number year product
CRV 1946 2011 a 8
1947 2012 a 6
2013 c 11
1948 2011 a 6
b 1
1949 2012 a 12
Walmart 1944 2010 a 5
1945 2010 b 5
1947 2010 b 10
1949 2012 a 5
into dataframe of rows:
retailer_name store_number year a b c
CRV 1946 2011 8 0 0
CRV 1947 2012 6 0 0
etc...
The products are known ahead.
Any idea how to do so ?
Please see below for solution. Thanks to EdChum for corrections to original post.
Without reset_index()
stores.groupby(level=[0, 1, 2, 3]).sum().unstack().fillna(0)
amount
product a b c
retailer_name store_number year
CRV 1946 2011 8 0 0
1947 2012 6 0 0
2013 0 0 11
1948 2011 6 1 0
1949 2012 12 0 0
Walmart 1944 2010 5 0 0
1945 2010 0 5 0
1947 2010 0 10 0
1949 2012 5 0 0
With reset_index()
stores.groupby(level=[0, 1, 2, 3]).sum().unstack().reset_index().fillna(0)
retailer_name store_number year amount
product a b c
0 CRV 1946 2011 8 0 0
1 CRV 1947 2012 6 0 0
2 CRV 1947 2013 0 0 11
3 CRV 1948 2011 6 1 0
4 CRV 1949 2012 12 0 0
5 Walmart 1944 2010 5 0 0
6 Walmart 1945 2010 0 5 0
7 Walmart 1947 2010 0 10 0
8 Walmart 1949 2012 5 0 0
Unstack product from the index and fill NaN values with zero.
df = stores.groupby(level=[0, 1, 2, 3]).sum().unstack('product')
mask = pd.IndexSlice['amount', :]
df.loc[:, mask] = df.loc[:, mask].fillna(0)
>>> df
amount
product a b c
retailer_name store_number year
CRV 1946 2011 8 0 0
1947 2012 6 0 0
2013 0 0 11
1948 2011 6 1 0
1949 2012 12 0 0
Walmart 1944 2010 5 0 0
1945 2010 0 5 0
1947 2010 0 10 0
1949 2012 5 0 0
Related
It is posible to convert a dataframe on Pandas like that:
Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"2009": np.random.randn(12),
"2010": np.random.randn(12),
"2011": np.random.randn(12),
},
index=range(1, 13)
)
print(df)
Out[45]:
2009 2010 2011
1 -1.133838 -1.440585 0.570594
2 0.384319 0.773703 0.915420
3 1.496554 -1.027967 -1.669341
4 -0.355382 -0.090986 0.482714
5 -0.787534 0.492003 -0.310473
6 -0.459439 0.424672 2.394690
7 -0.059169 1.283049 1.550931
8 -0.354174 0.315986 -0.646465
9 -0.735523 -0.408082 -0.928937
10 -1.183940 -0.067948 -1.654976
11 0.238894 -0.952427 0.350193
12 -0.589920 -0.110677 -0.141757
df_out = df.unstack(1).reset_index()
df_out.columns = ["year", "month", "value"]
print(df_out)
Out[46]:
year month value
0 2009 1 -1.133838
1 2009 2 0.384319
2 2009 3 1.496554
3 2009 4 -0.355382
4 2009 5 -0.787534
5 2009 6 -0.459439
6 2009 7 -0.059169
7 2009 8 -0.354174
8 2009 9 -0.735523
9 2009 10 -1.183940
10 2009 11 0.238894
11 2009 12 -0.589920
12 2010 1 -1.440585
13 2010 2 0.773703
14 2010 3 -1.027967
15 2010 4 -0.090986
16 2010 5 0.492003
17 2010 6 0.424672
18 2010 7 1.283049
19 2010 8 0.315986
20 2010 9 -0.408082
21 2010 10 -0.067948
22 2010 11 -0.952427
23 2010 12 -0.110677
24 2011 1 0.570594
25 2011 2 0.915420
26 2011 3 -1.669341
27 2011 4 0.482714
28 2011 5 -0.310473
29 2011 6 2.394690
30 2011 7 1.550931
31 2011 8 -0.646465
32 2011 9 -0.928937
33 2011 10 -1.654976
34 2011 11 0.350193
35 2011 12 -0.141757
If i have an id that has a start date = 1988 and end date as = 2018 and value = 21100 and i want to create a array or dataframe of the dates from 1988 - 2018 i.e (1988,1989,1990...2018) with each date = to the same value of 21100
So i basically want something that looks like:
date, id1, id2
1988, 21100,0
1989, 21100,0
1990,21000 ,0
...
1994,21100,4598
...
2013,21100,4598
...
2018,21100,0
how could i do this? I want the array to start populating the value based on the start date and to end populating based on the end date. i have multiple id's (268) and i want them to loop through each adding a new column (id2, id3 ... id268). So for example id2 starts at 1994 to 2013 with a value of 4598.
EDIT:
example = pd.DataFrame({
'id': ['id1', 'id2', 'id3', 'id4'],
'start date': ['1988', '1988', '2000', '2005'],
'end date': ['2018', '2013', '2005', '2017'],
'value': [2100, 4568, 7896, 68909]
})
print (example)
id start date end date value
0 id1 1988 2018 2100
1 id2 1988 2013 4568
2 id3 2000 2005 7896
3 id4 2005 2017 68909
You can create Series in list comprehension and join them together by concat, replace missing values, convert to integers and last convert index to column Date :
L = [pd.Series(v, index=range(int(s), int(e)+1)) for s,e,v in
zip(example['start date'], example['end date'], example['value'])]
df1 = (pd.concat(L, axis=1, keys=example['id'])
.fillna(0)
.astype(int)
.rename_axis('date')
.reset_index())
print (df1)
id date id1 id2 id3 id4
0 1988 2100 4568 0 0
1 1989 2100 4568 0 0
2 1990 2100 4568 0 0
3 1991 2100 4568 0 0
4 1992 2100 4568 0 0
5 1993 2100 4568 0 0
6 1994 2100 4568 0 0
7 1995 2100 4568 0 0
8 1996 2100 4568 0 0
9 1997 2100 4568 0 0
10 1998 2100 4568 0 0
11 1999 2100 4568 0 0
12 2000 2100 4568 7896 0
13 2001 2100 4568 7896 0
14 2002 2100 4568 7896 0
15 2003 2100 4568 7896 0
16 2004 2100 4568 7896 0
17 2005 2100 4568 7896 68909
18 2006 2100 4568 0 68909
19 2007 2100 4568 0 68909
20 2008 2100 4568 0 68909
21 2009 2100 4568 0 68909
22 2010 2100 4568 0 68909
23 2011 2100 4568 0 68909
24 2012 2100 4568 0 68909
25 2013 2100 4568 0 68909
26 2014 2100 0 0 68909
27 2015 2100 0 0 68909
28 2016 2100 0 0 68909
29 2017 2100 0 0 68909
30 2018 2100 0 0 0
Use DataFrame constructor with range:
start = 1988
end = 2019
val = 21100
df = pd.DataFrame({'date':range(start, end),
'id1': val})
print (df)
date id1
0 1988 21100
1 1989 21100
2 1990 21100
3 1991 21100
4 1992 21100
5 1993 21100
6 1994 21100
7 1995 21100
8 1996 21100
9 1997 21100
10 1998 21100
11 1999 21100
12 2000 21100
13 2001 21100
14 2002 21100
15 2003 21100
16 2004 21100
17 2005 21100
18 2006 21100
19 2007 21100
20 2008 21100
21 2009 21100
22 2010 21100
23 2011 21100
24 2012 21100
25 2013 21100
26 2014 21100
27 2015 21100
28 2016 21100
29 2017 21100
30 2018 21100
I have a database in panel data form:
Date id variable1 variable2
2015 1 10 200
2016 1 17 300
2017 1 8 400
2018 1 11 500
2015 2 12 150
2016 2 19 350
2017 2 15 250
2018 2 9 450
2015 3 20 100
2016 3 8 220
2017 3 12 310
2018 3 14 350
And I have a list with the labels of the ID
List = ['Argentina', 'Brazil','Chile']
I want to replace values of id with labels from my list.
Thanks in advance
Date id variable1 variable2
2015 Argentina 10 200
2016 Argentina 17 300
2017 Argentina 8 400
2018 Argentina 11 500
2015 Brazil 12 150
2016 Brazil 19 350
2017 Brazil 15 250
2018 Brazil 9 450
2015 Chile 20 100
2016 Chile 8 220
2017 Chile 12 310
2018 Chile 14 350
map is the way to go, with enumerate:
d = {k:v for k,v in enumerate(List, start=1)}
df['id'] = df['id'].map(d)
Output:
Date id variable1 variable2
0 2015 Argentina 10 200
1 2016 Argentina 17 300
2 2017 Argentina 8 400
3 2018 Argentina 11 500
4 2015 Brazil 12 150
5 2016 Brazil 19 350
6 2017 Brazil 15 250
7 2018 Brazil 9 450
8 2015 Chile 20 100
9 2016 Chile 8 220
10 2017 Chile 12 310
11 2018 Chile 14 350
Try
df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'})
or
df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})
I have a dataframe consisting of Year, Month, Temperature. Now, I need to create seasonal means, such as DJF (Dec, Jan, Feb), MAM (Mar, Apr, May), JJA (Jun, Jul, Aug), SON (Sep, Oct, Nov).
But how can I take into account the fact that DJF should have December from the previous year, January and February of the subsequent year?
This is the code I have so far:
z = {1: 'DJF', 2: 'DJF', 3: 'MAM', 4: 'MAM', 5: 'MAM', 6: 'JJA', 7: 'JJA', 8: 'JJA', 9: 'SON', 10: 'SON',
11: 'SON', 12: 'DJF'}
df['season'] = df['Mon'].map(z)
The problem with the above coding is that when I group by year and season to calculate the means, the values for DJF will be incorrect, since they take Dec, Jan, and Feb of the same year.
df.groupby(['Year','season']).mean()
I think you can create periodindex by to_datetime and to_period
Then shift one moth and convert to Quarters by asfreq.
Last groupby by index anf aggregate mean:
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df.groupby(level=0)['Temperature'].mean())
Sample:
rng = pd.date_range('2017-04-03', periods=20, freq='M')
df = pd.DataFrame({'Date': rng, 'Temperature': range(20)})
df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df = df.drop('Date', axis=1)
print (df)
Temperature Year Month
0 0 2017 4
1 1 2017 5
2 2 2017 6
3 3 2017 7
4 4 2017 8
5 5 2017 9
6 6 2017 10
7 7 2017 11
8 8 2017 12
9 9 2018 1
10 10 2018 2
11 11 2018 3
12 12 2018 4
13 13 2018 5
14 14 2018 6
15 15 2018 7
16 16 2018 8
17 17 2018 9
18 18 2018 10
19 19 2018 11
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df)
Temperature Year Month Day
2017Q2 0 2017 4 1
2017Q2 1 2017 5 1
2017Q3 2 2017 6 1
2017Q3 3 2017 7 1
2017Q3 4 2017 8 1
2017Q4 5 2017 9 1
2017Q4 6 2017 10 1
2017Q4 7 2017 11 1
2018Q1 8 2017 12 1
2018Q1 9 2018 1 1
2018Q1 10 2018 2 1
2018Q2 11 2018 3 1
2018Q2 12 2018 4 1
2018Q2 13 2018 5 1
2018Q3 14 2018 6 1
2018Q3 15 2018 7 1
2018Q3 16 2018 8 1
2018Q4 17 2018 9 1
2018Q4 18 2018 10 1
2018Q4 19 2018 11 1
print (df.groupby(level=0)['Temperature'].mean())
2017Q2 0.5
2017Q3 3.0
2017Q4 6.0
2018Q1 9.0
2018Q2 12.0
2018Q3 15.0
2018Q4 18.0
Freq: Q-DEC, Name: Temperature, dtype: float64
And last if need season column:
df1 = df.groupby(level=0)['Temperature'].mean().rename_axis('per').reset_index()
z = {1: 'DJF',2: 'MAM', 3: 'JJA', 4: 'SON'}
df1['season'] = df1['per'].dt.quarter.map(z)
df1['yaer'] = df1['per'].dt.year
print (df1)
per Temperature season yaer
0 2017Q2 0.5 MAM 2017
1 2017Q3 3.0 JJA 2017
2 2017Q4 6.0 SON 2017
3 2018Q1 9.0 DJF 2018
4 2018Q2 12.0 MAM 2018
5 2018Q3 15.0 JJA 2018
6 2018Q4 18.0 SON 2018
I have following dataframe:
data = {'year': [2010, 2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2013],
'store_number': ['1944', '1945', '1946', '1947', '1948', '1949', '1947', '1948', '1949', '1947'],
'retailer_name': ['Walmart', 'Walmart', 'CRV', 'CRV', 'CRV', 'Walmart', 'Walmart', 'CRV', 'CRV', 'CRV'],
'month': [1, 12, 3, 11, 10, 9, 5, 5, 4, 3],
'amount': [5, 5, 8, 6, 1, 5, 10, 6, 12, 11]}
stores = pd.DataFrame(data, columns=['retailer_name', 'store_number', 'year', 'month', 'amount'])
stores.set_index(['retailer_name', 'store_number', 'year', 'month'], inplace=True)
That looks like:
amount
retailer_name store_number year month
Walmart 1944 2010 1 5
1945 2010 12 5
CRV 1946 2011 3 8
1947 2012 11 6
1948 2011 10 1
Walmart 1949 2012 9 5
1947 2010 5 10
CRV 1948 2011 5 6
1949 2012 4 12
1947 2013 3 11
How I can sort the groups:
stores_g = stores.groupby(level=0)
by 'year' and 'month' with decreasing order .
You can use sort_index by specific index levels and specify whether the ordering should be ascending or not:
In [148]:
stores.sort_index(level=['year','month'], ascending=False)
Out[148]:
amount
retailer_name store_number year month
CRV 1947 2013 3 11
2012 11 6
Walmart 1949 2012 9 5
CRV 1949 2012 4 12
1948 2011 10 1
5 6
1946 2011 3 8
Walmart 1945 2010 12 5
1947 2010 5 10
1944 2010 1 5