Group by data to dataframe columns with pandas - python

I have the following DataFrame. Describe the cities where each user lived
City Name Date
0 Seattle Alice 2017
1 Seattle Bob 2011
2 Portland Mallory 2010
3 Seattle Mallory 2016
4 Memphis Bob 2012
5 Portland Mallory 2013
Can you with pandas achieve the following?
Name City1 Date1 City2 Date2 City3 Date3
0 Alice Seattle 2017 NaN NaN NaN NaN
1 Bob Seattle 2011 Memphis 2012 NaN NaN
2 Mallory Portland 2010 Seattle 2016 Portland 2013
Thank you very much!

You can use groupby with custom function where create new DataFrame, then unstack, sort second level of MultiIndex by sort_index and last use join for remove it:
df1 = df.groupby('Name')['City','Date']
.apply(lambda x: pd.DataFrame(x.values,
columns = ['City','Date'],
index = np.arange(1, len(x) + 1).astype(str)))
.unstack()
df1 = df1.sort_index(axis=1, level=1).replace({None:np.nan})
df1.columns = df1.columns.map(''.join)
print (df1)
City1 Date1 City2 Date2 City3 Date3
Name
Alice Seattle 2017 NaN NaN NaN NaN
Bob Seattle 2011 Memphis 2012.0 NaN NaN
Mallory Portland 2010 Seattle 2016.0 Portland 2013.0

Related

column names setup after group by the data in python

My table is as bellowed
datetime source Day area Town County Country
0 2019-01-01 16:22:46 1273 Tuesday Brighton Brighton East Sussex England
1 2019-01-02 09:33:29 1823 Wednesday Taunton Taunton Somerset England
2 2019-01-02 09:44:46 1977 Wednesday Pontefract Pontefract West Yorkshire England
3 2019-01-02 10:01:42 1983 Wednesday Isle of Wight NaN NaN NaN
4 2019-01-02 12:03:13 1304 Wednesday Dover Dover Kent England
My codes are
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
counts_by_counties.head()
My grouped result (Do the column name disappeared?)
datetime source Day area Country
County Town
Aberdeenshire Aberdeen 8 8 8 8 8
Banchory 1 1 1 1 1
Blackburn 18 18 18 18 18
Ellon 6 6 6 6 6
Fraserburgh 2 2 2 2 2
I used this codes to rename the column, I am wondering if there is other efficent way to change the column name.
# slicing of the table
counts_by_counties = counts_by_counties[['datetime',]]
# rename by datetime into Counts
counts_by_counties.rename(columns={'datetime': 'Counts'})
Expected result
Counts
County Town
Aberdeenshire Aberdeen 8
Banchory 1
Blackburn 18
Call reset_index as below.
Replace
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
with
counts_by_counties = call_by_counties.groupby(['County','Town']).count().reset_index()

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Given same value in one column, concatenate remaining rows?

Given the pandas DataFrame:
name hobby since
paul A 1995
john A 2005
paul B 2015
mary G 2013
chris E 2005
chris D 2001
paul C 1986
I would like to get:
name hobby1 since1 hobby2 since2 hobby3 since3
paul A 1995 B 2015 C 1986
john A 2005 NaN NaN NaN NaN
mary G 2013 NaN NaN NaN NaN
chris E 2005 D 2001 NaN NaN
I.e. I would like to have one row per name. The maximum number of hobbies a person can have, say 3 in this case, is something I know in advance. What would be the most elegant/short way to do this?
You can first melt and then , groupby.cumcount() to add to the variable and then pivot using pivot_table():
m=df.melt('name')
(m.assign(variable=m.variable+(m.groupby(['name','variable']).cumcount()+1).astype(str))
.pivot_table(index='name',columns='variable',values='value',aggfunc='first')
.rename_axis(None,axis=1))
hobby1 hobby2 hobby3 since1 since2 since3
name
chris E D NaN 2005 2001 NaN
john A NaN NaN 2005 NaN NaN
mary G NaN NaN 2013 NaN NaN
paul A B C 1995 2015 1986
Use cumcount and unstack. Finally, use multiindex.map to join 2-level columns to one level
df1 = df.set_index(['name', df.groupby('name').cumcount().add(1)]) \
.unstack().sort_index(1,level=1)
df1.columns = df1.columns.map('{0[0]}{0[1]}'.format)
Out[812]:
hobby1 since1 hobby2 since2 hobby3 since3
name
chris E 2005.0 D 2001.0 NaN NaN
john A 2005.0 NaN NaN NaN NaN
mary G 2013.0 NaN NaN NaN NaN
paul A 1995.0 B 2015.0 C 1986.0
Maybe something like this? But you would need to rename the columns after with this solution.
df["combined"] = [ "{}_{}".format(x,y) for x,y in zip(df.hobby,df.since)]
df.groupby("name")["combined"]
.agg(lambda x: "_".join(x))
.str.split("_",expand=True)
The result is:
0 1 2 3 4 5
name
chris E 2005 D 2001 None None
john A 2005 None None None None
mary G 2013 None None None None
paul A 1995 B 2015 C 1986

Filling nan of one column with the values of another Python

I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.
let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK

Pandas merge 2 databases based on 2 keys

I am trying to merge 2 pandas df's:
ISO3 Year Calories
AFG 1960 2300
AFG 1961 2323
...
USA 2005 2800
USA 2006 2828
and
ISO3 Year GDP
AFG 1980 3600
AFG 1981 3636
...
USA 2049 10000
USA 2050 10100
I have tried pd.merge(df1,df2,on=['ISO3','Year'],how=outer) and many others, but for some reason it does not work, any help?
pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer') should work just fine. Can you modify the example, or post a concrete example of df1, df2 which is not yielding the correct result?
In [58]: df1 = pd.read_table('data', sep='\s+')
In [59]: df1
Out[59]:
ISO3 Year Calories
0 AFG 1960 2300
1 AFG 1961 2323
2 USA 2005 2800
3 USA 2006 2828
In [60]: df2 = pd.read_table('data2', sep='\s+')
In [61]: df2
Out[61]:
ISO3 Year GDP
0 AFG 1980 3600
1 AFG 1981 3636
2 USA 2049 10000
3 USA 2050 10100
In [62]: pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer')
Out[62]:
ISO3 Year Calories GDP
0 AFG 1960 2300 NaN
1 AFG 1961 2323 NaN
2 USA 2005 2800 NaN
3 USA 2006 2828 NaN
4 AFG 1980 NaN 3600
5 AFG 1981 NaN 3636
6 USA 2049 NaN 10000
7 USA 2050 NaN 10100

Categories