Values not replaced using .ix in Pandas library - python

I have defined a simple function to replace missing values in numerical columns with the average of the non missing values for the columns. The function is syntactically correct and generating correct values. However, the missing values are not getting replaced
Below is the code snippet
def fillmissing_with_mean(df1):
df2 = df1._get_numeric_data()
for i in range(len(df2.columns)):
df2[df2.iloc[:,i].isnull()].iloc[:,i]=df2.iloc[:,i].mean()
return df2
fillmissing_with_mean(df)
The data frame which is passed looks like this:
age gender job name height
NaN F student alice 165.0
26.0 None student john 180.0
NaN M student eric 175.0
58.0 None manager paul NaN
33.0 M engineer julie 171.0
34.0 F scientist peter NaN

You do not need worry about select the numeric or not , when you doing the mean ,it will only affect to those numeric column, and fillna can pass by pd.Serise
df.fillna(df.mean())
Out[1398]:
age gender job name height
0 37.75 F student alice 165.00
1 26.00 None student john 180.00
2 37.75 M student eric 175.00
3 58.00 None manager paul 172.75
4 33.00 M engineer julie 171.00
5 34.00 F scientist peter 172.75
More Info
df.mean()
Out[1399]:
age 37.75
height 172.75
dtype: float64

This may be what you need. skipna=True by default, but I've included it here explicitly so you know what it's doing.
for col in ['age', 'height']:
df[col] = df[col].fillna(df[col].mean(skipna=True))
# age gender job name height
# 0 37.75 F student alice 165.00
# 1 26.00 None student john 180.00
# 2 37.75 M student eric 175.00
# 3 58.00 None manager paul 172.75
# 4 33.00 M engineer julie 171.00
# 5 34.00 F scientist peter 172.75

Related

Compare two dataframes based one grain column and print out differences into txt file

I have two dataframes df1 and df2 which different row sizes but same columns, The ID column is common across both dataframes. I want a write the difference in a text file. For example:
df1:
ID Name Age Profession sex
1 Tom 20 engineer M
2 nick 21 doctor M
3 krishi 19 lawyer F
4 jacky 18 dentist F
df2:
ID Name Age Profession sex
1 Tom 20 plumber M
2 nick 21 doctor M
3 krishi 23 Analyst F
4 jacky 18 dentist F
The resultant text file should look like:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19 23 lawyer Analyst
You can use compare and a loop:
df3 = df1.set_index('ID').compare(df2.set_index('ID'))
df3.columns = (df3.rename({'self': 'old', 'other': 'new'}, level=1, axis=1)
.columns.map('_'.join)
)
for id, row in df3.iterrows():
print(f'ID : {id}')
print(row.dropna().to_frame().T.to_string(index=False))
print()
output:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19.0 23.0 lawyer Analyst
NB. using print here for the demo, to write to a file:
with open('file.txt') as f:
f.write(f'ID : {id}\n')
f.write(row.dropna().to_frame().T.to_string(index=False))
f.write('\n\n')
You could also directly use df3:
Age_old Age_new Profession_old Profession_new
ID
1 NaN NaN engineer plumber
3 19.0 23.0 lawyer Analyst

Pandas DataFrame GroupBy Rank

DataFrame:
account_id plan_id policy_group_nbr plan_type split_eff_date splits
0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A...
1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A...
2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100)
3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50)
4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100)
... ... ... ... ... ... ...
1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50)
1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50)
1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50)
I need to group by (account_id, plan_id, policy_group_nbr, plan_type) sorted by split_eff_date (desc), in order to remove all rows for the group but the most recent date while maintaining all columns. I can get a rank however, when attempting to pass an argument to the lambda function, I'm receiving a TypeError.
working as expected:
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())
TypeError: incompatible index of inserted column with frame index
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())
passing the axis argument didn't seem to help either... is this a simple syntax issue, or am I not understanding the function properly?
easier -- and typically faster -- to do this with .transform().
easier because when you sort descending, the index doesn't match when you try to assign back to original DataFrame. i tried not using an index in the .groupby(), but wasn't able to get that working.
link to documentation about .transform(): https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html
i recommend using .transform() like this, and be sure to supply ascending=False kwarg to .rank() as well:
df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[
"split_eff_date"
].transform(
lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first")
)
result with both kinds of ranking -- i took just the first 5 rows from your sample data:
In [93]: df
Out[93]:
account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2
3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0
2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0
4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0
0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0
1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0

Amend row in a data-frame if it exists in another data-frame

I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

Categories