Set "Year" column to individual columns to create a panel - python

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!

Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Related

Replace missing values based on value of a specific column in Python

I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203

How to drop subdataframe if it contains more than 40% NaN Pandas

Hello everyone i have such problem:
I have panel data for 400.000 objects and i want to drop objects if it contains more that 40% NaNs
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
... ... ... ... ... ... ...
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
... ... ... ... ...
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
I have such dataframe, and for each subdataframe grouped by (inn,time_reg) i need to drop it if total nans in columns (revenue1 balans1 equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stliabilities1) more than 40%.
I have an idea to do it in a loop but this it takes a lot of time
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
4081809 9909403758 2002 6078000.0 2270000.0 2195000.0 -32000.0
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
4081809 1324000.0 NaN 234000.0 75000.0
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
This subdataframe should be droped, coz it contains more than 40% nans
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
5 0101000021 2011 3480000.0 610000.0 71000.0 NaN
6 0101000021 2012 4820000.0 710000.0 139000.0 149000.0
7 0101000021 2013 5200000.0 790000.0 148000.0 170000.0
8 0101000021 2014 5450000.0 830000.0 155000.0 180000.0
9 0101000021 2015 5620000.0 860000.0 164000.0 189000.0
10 0101000021 2016 5860000.0 885000.0 175000.0 200000.0
11 0101000021 2017 15112000.0 1275000.0 298000.0 323000.0
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
5 NaN 61000.0 NaN NaN
6 869000.0 129000.0 700000.0 571000.0
7 1040000.0 138000.0 780000.0 642000.0
8 1090000.0 145000.0 820000.0 675000.0
9 1124000.0 154000.0 850000.0 696000.0
10 1172000.0 165000.0 875000.0 710000.0
11 3023000.0 288000.0 1265000.0 977000.0
This subdataframe contains less than 40% nans and must be in final dataframe
Would a loop be too slow too if you used a numpy/pandas function for the counting? You could use someDataFrame.isnull().sum().sum().
Probably a lot faster than writing your own loop to go over all the values in a dataframe, since those libraries tend to have very efficient implementations of those kinds of functions.
You can use the filter method of pd.DataFrame.groupby.
This allows you to pass a function that indicates whether a subframe should be filtered or not (in this case if it contains over 40% NaNs in the relevant columns). To get that information, you can use numpy to count the nans as in getNanFraction:
def getNanFraction(df):
nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
return nanCount/len(df)
df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )

Filling nan of one column with the values of another Python

I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.
let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK

How to concat two dataframes in python

I have two data frames, i want to join them so that i could check the quantity of the that week in every year in a single in a single data frame.
df1= City Week qty Year
hyd 35 10 2015
hyd 36 15 2015
hyd 37 11 2015
hyd 42 10 2015
hyd 23 10 2016
hyd 32 15 2016
hyd 37 11 2017
hyd 42 10 2017
pune 35 10 2015
pune 36 15 2015
pune 37 11 2015
pune 42 10 2015
pune 23 10 2016
pune 32 15 2016
pune 37 11 2017
pune 42 10 2017
df2= city Week qty Year
hyd 23 10 2015
hyd 32 15 2015
hyd 35 12 2016
hyd 36 15 2016
hyd 37 11 2016
hyd 42 10 2016
hyd 43 12 2016
hyd 44 18 2016
hyd 35 11 2017
hyd 36 15 2017
hyd 37 11 2017
hyd 42 10 2017
hyd 51 14 2017
hyd 52 17 2017
pune 35 12 2016
pune 36 15 2016
pune 37 11 2016
pune 42 10 2016
pune 43 12 2016
pune 44 18 2016
pune 35 11 2017
pune 36 15 2017
pune 37 11 2017
pune 42 10 2017
pune 51 14 2017
pune 52 17 2017
I want to join two data frames as shown in the result, i want to append the quantity of the that week in every year for each city in a single data frame.
city Week qty Year y2016_wk qty y2017_wk qty y2015_week qty
hyd 35 10 2015 2016_35 12 2017_35 11 nan nan
hyd 36 15 2015 2016_36 15 2017_36 15 nan nan
hyd 37 11 2015 2016_37 11 2017_37 11 nan nan
hyd 42 10 2015 2016_42 10 2017_42 10 nan nan
hyd 23 10 2016 nan nan 2017_23 x 2015_23 10
hyd 32 15 2016 nan nan 2017_32 y 2015_32 15
hyd 37 11 2017 2016_37 11 nan nan 2015_37 x
hyd 42 10 2017 2016_42 10 nan nan 2015_42 y
pune 35 10 2015 2016_35 12 2017_35 11 nan nan
pune 36 15 2015 2016_36 15 2017_36 15 nan nan
pune 37 11 2015 2016_37 11 2017_37 11 nan nan
pune 42 10 2015 2016_42 10 2017_42 10 nan nan
You can break down your task into a few steps:
Combine your dataframes df1 and df2.
Create a list of dataframes from your combined dataframe, splitting by year.
At the same time, rename columns to reflect year, set index to Week.
Finally, concatenate along axis=1 and reset_index.
Here is an example:
df = pd.concat([df1, df2], ignore_index=True)
dfs = [df[df['Year'] == y].rename(columns=lambda x: x+'_'+str(y) if x != 'Week' else x)\
.set_index('Week') for y in df['Year'].unique()]
res = pd.concat(dfs, axis=1).reset_index()
Result:
print(res)
Week qty_2015 Year_2015 qty_2016 Year_2016 qty_2017 Year_2017
0 35 10.0 2015.0 12.0 2016.0 11.0 2017.0
1 36 15.0 2015.0 15.0 2016.0 15.0 2017.0
2 37 11.0 2015.0 11.0 2016.0 11.0 2017.0
3 42 10.0 2015.0 10.0 2016.0 10.0 2017.0
4 43 NaN NaN 12.0 2016.0 NaN NaN
5 44 NaN NaN 18.0 2016.0 NaN NaN
6 51 NaN NaN NaN NaN 14.0 2017.0
7 52 NaN NaN NaN NaN 17.0 2017.0
Personally I don't think your example output is that readable, so unless you need that format for a specific reason I might consider using a pivot table. I also think the code required is cleaner.
import pandas as pd
df3 = pd.concat([df1, df2], ignore_index=True)
df4 = df3.pivot(index='Week', columns='Year', values='qty')
print(df4)
Year 2015 2016 2017
Week
35 10.0 12.0 11.0
36 15.0 15.0 15.0
37 11.0 11.0 11.0
42 10.0 10.0 10.0
43 NaN 12.0 NaN
44 NaN 18.0 NaN
51 NaN NaN 14.0
52 NaN NaN 17.0

Python Pandas pivot with values equal to simple function of specific column

import pandas as pd
olympics = pd.read_csv('olympics.csv')
Edition NOC Medal
0 1896 AUT Silver
1 1896 FRA Gold
2 1896 GER Gold
3 1900 HUN Bronze
4 1900 GBR Gold
5 1900 DEN Bronze
6 1900 USA Gold
7 1900 FRA Bronze
8 1900 FRA Silver
9 1900 USA Gold
10 1900 FRA Silver
11 1900 GBR Gold
12 1900 SUI Silver
13 1900 ZZX Gold
14 1904 HUN Gold
15 1904 USA Bronze
16 1904 USA Gold
17 1904 USA Silver
18 1904 CAN Gold
19 1904 USA Silver
I can pivot the data frame to have some aggregate value
pivot = olympics.pivot_table(index='Edition', columns='NOC', values='Medal', aggfunc='count')
NOC AUT CAN DEN FRA GBR GER HUN SUI USA ZZX
Edition
1896 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN
1900 NaN NaN 1.0 3.0 2.0 NaN 1.0 1.0 2.0 1.0
1904 NaN 1.0 NaN NaN NaN NaN 1.0 NaN 4.0 NaN
Rather than having the total number of medals in values= , I am interested to have a tuple (a triple) with (#Gold, #Silver, #Bronze), (0,0,0) for NaN
How do I do that succinctly and elegantly?
No need to use pivot_table, as pivot is perfectly fine with tuple for a value
value_counts to count all medals
create multi-index for all combinations of countries, dates, medals
reindex with fill_values=0
counts = df.groupby(['Edition', 'NOC']).Medal.value_counts()
mux = pd.MultiIndex.from_product(
[c.values for c in counts.index.levels], names=counts.index.names)
counts = counts.reindex(mux, fill_value=0).unstack('Medal')
counts = counts[['Bronze', 'Silver', 'Gold']]
pd.Series([tuple(l) for l in counts.values.tolist()], counts.index).unstack()

Categories