Filling nan of one column with the values of another Python - python

I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.

let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK

Related

column names setup after group by the data in python

My table is as bellowed
datetime source Day area Town County Country
0 2019-01-01 16:22:46 1273 Tuesday Brighton Brighton East Sussex England
1 2019-01-02 09:33:29 1823 Wednesday Taunton Taunton Somerset England
2 2019-01-02 09:44:46 1977 Wednesday Pontefract Pontefract West Yorkshire England
3 2019-01-02 10:01:42 1983 Wednesday Isle of Wight NaN NaN NaN
4 2019-01-02 12:03:13 1304 Wednesday Dover Dover Kent England
My codes are
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
counts_by_counties.head()
My grouped result (Do the column name disappeared?)
datetime source Day area Country
County Town
Aberdeenshire Aberdeen 8 8 8 8 8
Banchory 1 1 1 1 1
Blackburn 18 18 18 18 18
Ellon 6 6 6 6 6
Fraserburgh 2 2 2 2 2
I used this codes to rename the column, I am wondering if there is other efficent way to change the column name.
# slicing of the table
counts_by_counties = counts_by_counties[['datetime',]]
# rename by datetime into Counts
counts_by_counties.rename(columns={'datetime': 'Counts'})
Expected result
Counts
County Town
Aberdeenshire Aberdeen 8
Banchory 1
Blackburn 18
Call reset_index as below.
Replace
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
with
counts_by_counties = call_by_counties.groupby(['County','Town']).count().reset_index()

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Given same value in one column, concatenate remaining rows?

Given the pandas DataFrame:
name hobby since
paul A 1995
john A 2005
paul B 2015
mary G 2013
chris E 2005
chris D 2001
paul C 1986
I would like to get:
name hobby1 since1 hobby2 since2 hobby3 since3
paul A 1995 B 2015 C 1986
john A 2005 NaN NaN NaN NaN
mary G 2013 NaN NaN NaN NaN
chris E 2005 D 2001 NaN NaN
I.e. I would like to have one row per name. The maximum number of hobbies a person can have, say 3 in this case, is something I know in advance. What would be the most elegant/short way to do this?
You can first melt and then , groupby.cumcount() to add to the variable and then pivot using pivot_table():
m=df.melt('name')
(m.assign(variable=m.variable+(m.groupby(['name','variable']).cumcount()+1).astype(str))
.pivot_table(index='name',columns='variable',values='value',aggfunc='first')
.rename_axis(None,axis=1))
hobby1 hobby2 hobby3 since1 since2 since3
name
chris E D NaN 2005 2001 NaN
john A NaN NaN 2005 NaN NaN
mary G NaN NaN 2013 NaN NaN
paul A B C 1995 2015 1986
Use cumcount and unstack. Finally, use multiindex.map to join 2-level columns to one level
df1 = df.set_index(['name', df.groupby('name').cumcount().add(1)]) \
.unstack().sort_index(1,level=1)
df1.columns = df1.columns.map('{0[0]}{0[1]}'.format)
Out[812]:
hobby1 since1 hobby2 since2 hobby3 since3
name
chris E 2005.0 D 2001.0 NaN NaN
john A 2005.0 NaN NaN NaN NaN
mary G 2013.0 NaN NaN NaN NaN
paul A 1995.0 B 2015.0 C 1986.0
Maybe something like this? But you would need to rename the columns after with this solution.
df["combined"] = [ "{}_{}".format(x,y) for x,y in zip(df.hobby,df.since)]
df.groupby("name")["combined"]
.agg(lambda x: "_".join(x))
.str.split("_",expand=True)
The result is:
0 1 2 3 4 5
name
chris E 2005 D 2001 None None
john A 2005 None None None None
mary G 2013 None None None None
paul A 1995 B 2015 C 1986

Calculate Number of Rows containg NaN values

I Have a Data Frame df which is given below and I have to calculate the number of rows containing NaN values.
Name Age City Country
0 jack NaN Sydeny Australia
1 Riti NaN Delhi India
2 Vikas 31 NaN India
3 Neelu 32 Bangalore India
4 Steve 16 New York US
5 John 11 NaN NaN
6 NaN NaN NaN NaN
To get the answer I tried
df.isnull().sum().sum()
And it gives me output 9 by calculating all NaN value, but the is answer is 5 by calculating Rows which contain NaN value. I do not know how to calculate this.
You need df.any() over axis=1 after you check isnull():
df.isnull().any(axis=1).sum()
#5
Just for an example how to get it.
Example DF
>>> df
Name Age City Country
0 jack NaN Sydeny Australia
1 Riti NaN Delhi India
2 Vikas 31.0 NaN India
3 Neelu 32.0 Bangalore India
4 John 16.0 New York US
5 John 11.0 NaN NaN
6 NaN NaN NaN NaN
TO designate the Nan rows with bool...
>>> df.isnull().any(1)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
dtype: bool
To get the row where Nan appeared:
>>> df.index[df.isnull().any(1)]
Int64Index([0, 1, 2, 5, 6], dtype='int64')
Last your answer directly:
>>> df.isnull().any(1).sum()
5
OR
>>> df.index[df.isnull().any(1).sum()]
5

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Categories