How to compacting anything not null in pandas - python

Here's my input
code US UK Germany Japan
AR5 13 NaN 7 NaN
A85 NaN 9 NaN 8
Here's my Output, anything not null will be registered
code country
A85 UK
A85 Japan
AR5 US
AR5 Germany
Regards

You can melt then dropna:
df.melt('code', var_name='Country').dropna()
Output:
code Country value
0 AR5 US 13.0
3 A85 UK 9.0
4 AR5 Germany 7.0
7 A85 Japan 8.0

Related

Use dataframe columns as arguments for function

I want to get arguments from a datafile (excel, .csv, whatever) and pass them as arguments to a Python function.
To get the arguments from the datafile I've converted it to a Pandas dataframe. Created a list of the index of the df and iterate over this list whilst finding all the cell values and passing these as arguments.
I've got some working code (see below) but I feel like it's kinda clunky.
Is there a better way to do this?
import pandas as pd
import os
curdir = os.getcwd()
def pdFunc(Name, Module): #function that takes multiple arguments from the dataframe
print(str(Name) + ',' + str(Module))
#further code will be added here which will create new .csv files etc. This output is not suitable to be placed in a dataframe.
assetList = os.path.join(curdir, 'Lists', 'Assets_ShortTesting_v1.0.xlsx') # setting path for the excel file with the data
assetdf = pd.read_excel(assetList) #importing the data to a dataframe
indexList = assetdf.index.tolist() #creating list to iterate over
for i in indexList: #iterating over list
pdFunc(assetdf.loc[i]['Name'], assetdf.loc[i]['Module']) #finding the cell values from the dataframe and setting them as arguments for the function
Here's the dataframe:
Name ISIN SymbolYF SymbolInvestpy Currency Country Exchange Type Module Constituent of
0 Adyen N.V. NaN ADYEN.AS NaN EUR Netherlands NaN Stock 1 AEX
1 Aegon N.V. NaN AGN.AS NaN EUR Netherlands NaN Stock 1 AEX
2 Aalberts N.V. NaN AALB.AS NaN EUR Netherlands NaN Stock 1 AMX
3 ABN AMRO Bank N.V. NaN ABN.AS NaN EUR Netherlands NaN Stock 1 AMX
4 Anheuser-Busch InBev SA/NV NaN ABI.BR NaN EUR Belgium NaN Stock 2 BEL20
5 Ackermans & Van Haaren NV NaN ACKB.BR NaN EUR Belgium NaN Stock 2 BEL20
6 L'Air Liquide S.A. NaN AI.PA NaN EUR France NaN Stock 2 CAC40
7 Airbus SE NaN AIR.PA NaN EUR France NaN Stock 2 CAC40
8 Vonovia SE NaN VNA.DE NaN EUR Germany NaN Stock 2 DAX
9 US Dollar NaN USD-EUR NaN EUR US NaN Forex 3 Forex
10 Shiba Inu NaN SHIB-EUR NaN EUR NaN NaN Crypto 3 Forex
11 FTSE 1000 NaN ^FTSE NaN EUR United Kingdom NaN Index 3 Index
12 Wheat NaN ZW=F NaN USD NaN NaN Commodity 3 Commodity
13 Apple Inc. NaN AAPL NaN USD US NaN Stock 4 US MegaCap
14 Sirius XM Holdings Inc. NaN SIRI NaN USD US NaN Stock 4 US High volume

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Given same value in one column, concatenate remaining rows?

Given the pandas DataFrame:
name hobby since
paul A 1995
john A 2005
paul B 2015
mary G 2013
chris E 2005
chris D 2001
paul C 1986
I would like to get:
name hobby1 since1 hobby2 since2 hobby3 since3
paul A 1995 B 2015 C 1986
john A 2005 NaN NaN NaN NaN
mary G 2013 NaN NaN NaN NaN
chris E 2005 D 2001 NaN NaN
I.e. I would like to have one row per name. The maximum number of hobbies a person can have, say 3 in this case, is something I know in advance. What would be the most elegant/short way to do this?
You can first melt and then , groupby.cumcount() to add to the variable and then pivot using pivot_table():
m=df.melt('name')
(m.assign(variable=m.variable+(m.groupby(['name','variable']).cumcount()+1).astype(str))
.pivot_table(index='name',columns='variable',values='value',aggfunc='first')
.rename_axis(None,axis=1))
hobby1 hobby2 hobby3 since1 since2 since3
name
chris E D NaN 2005 2001 NaN
john A NaN NaN 2005 NaN NaN
mary G NaN NaN 2013 NaN NaN
paul A B C 1995 2015 1986
Use cumcount and unstack. Finally, use multiindex.map to join 2-level columns to one level
df1 = df.set_index(['name', df.groupby('name').cumcount().add(1)]) \
.unstack().sort_index(1,level=1)
df1.columns = df1.columns.map('{0[0]}{0[1]}'.format)
Out[812]:
hobby1 since1 hobby2 since2 hobby3 since3
name
chris E 2005.0 D 2001.0 NaN NaN
john A 2005.0 NaN NaN NaN NaN
mary G 2013.0 NaN NaN NaN NaN
paul A 1995.0 B 2015.0 C 1986.0
Maybe something like this? But you would need to rename the columns after with this solution.
df["combined"] = [ "{}_{}".format(x,y) for x,y in zip(df.hobby,df.since)]
df.groupby("name")["combined"]
.agg(lambda x: "_".join(x))
.str.split("_",expand=True)
The result is:
0 1 2 3 4 5
name
chris E 2005 D 2001 None None
john A 2005 None None None None
mary G 2013 None None None None
paul A 1995 B 2015 C 1986

Calculate Number of Rows containg NaN values

I Have a Data Frame df which is given below and I have to calculate the number of rows containing NaN values.
Name Age City Country
0 jack NaN Sydeny Australia
1 Riti NaN Delhi India
2 Vikas 31 NaN India
3 Neelu 32 Bangalore India
4 Steve 16 New York US
5 John 11 NaN NaN
6 NaN NaN NaN NaN
To get the answer I tried
df.isnull().sum().sum()
And it gives me output 9 by calculating all NaN value, but the is answer is 5 by calculating Rows which contain NaN value. I do not know how to calculate this.
You need df.any() over axis=1 after you check isnull():
df.isnull().any(axis=1).sum()
#5
Just for an example how to get it.
Example DF
>>> df
Name Age City Country
0 jack NaN Sydeny Australia
1 Riti NaN Delhi India
2 Vikas 31.0 NaN India
3 Neelu 32.0 Bangalore India
4 John 16.0 New York US
5 John 11.0 NaN NaN
6 NaN NaN NaN NaN
TO designate the Nan rows with bool...
>>> df.isnull().any(1)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
dtype: bool
To get the row where Nan appeared:
>>> df.index[df.isnull().any(1)]
Int64Index([0, 1, 2, 5, 6], dtype='int64')
Last your answer directly:
>>> df.isnull().any(1).sum()
5
OR
>>> df.index[df.isnull().any(1).sum()]
5

Filling nan of one column with the values of another Python

I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.
let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK

Categories