Pandas DataFrame: Fill NaN values based on multiple criteria - python

I'm currently wrangling a big data set of 2 mio rows from Lyft for a Udacity project. The DataFrame looks like this:
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 NaN NaN 37.775232 -122.224498
As I try to express in the last line, I have a lot of NaN values for id and name, however, latitude and longitude are mostly never empty. My assumption is that I could actually extract the name from other rows given a certain combination of latitude and longitude.
Once I have the name, I would try filling the NaN values for id using name
dict_id = dict(zip(df['name'], df['id']))
df['id'] = df['id'].fillna(df['name'].map(dict_id))
However, I struggle because with latitude and longitude I have two values to match against the name.

You can left merge the dataframe with the copy of it after dropna , then rename the columns:
m = df.merge(df.dropna(subset=['name']),on=['latitude','longitude'],
how='left',suffixes=('','_y'))
out = (m.drop(['id','name'],1).rename(columns={'id_y':'id','name_y':'name'})
.reindex(df.columns,axis=1))
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 237.0 Fruitvale BART Station 37.775232 -122.224498

Related

How we can Iterate within a particular row with known index in a pandas data frame?

I have a data frame named df_cp which has the data as below,
I need to insert a new project name for CompanyID 'LCM' at the first empty cell in a row with index 1. I have found the index of the row which is of my interest using this,
index_row = df_cp[df_cp['CompanyID']=='LCM'].index
How can I iterate within a row with index_row as 1, the task is to replace the first NaN at index 1 with "Healthcare".
Please help with this.
IIUC, you can use isna and idxmax:
df.loc[1, df.loc[1].isna().idxmax()] = 'Healthcare'
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN
Note: idxmax returns the index of the first occurrence of the maximum value.
More, generalized:
m = df['CompanyID'] == 'LCM'
df.loc[m, df[m].isna().idxmax(axis=1)] = 'Healthcare'
df
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN

pandas groupby replace based on condition

I have a dataset structures as below:
index country city Data
0 AU Sydney 23
1 AU Sydney 45
2 AU Unknown 2
3 CA Toronto 56
4 CA Toronto 2
5 CA Ottawa 1
6 CA Unknown 2
I want to replace 'Unknown' in the city column with the mode of the occurences of cities per country. The result would be:
...
2 AU Sydney 2
...
6 CA Toronto 2
I can get the city modes with:
city_modes = df.groupby('country')['city'].apply(lambda x: x.mode().iloc[0])
And I can replace values with:
df['column']=df.column.replace('Unknown', 'something')
But i cant work out how to combine these to only replace unknowns for each country based on mode of occurrence of cities.
Any ideas?
Use transform for Series with same size as original DataFrame and set new values by numpy.where:
city_modes = df.groupby('country')['city'].transform(lambda x: x.mode().iloc[0])
df['column'] = np.where(df['column'] == 'Unknown',city_modes, df['column'])
Or:
df.loc[df['column'] == 'Unknown', 'column'] = city_modes

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

How can I plot a pivot table value?

I have a pivot table and I want to plot the values for the 12 months of each year for each town.
2010-01 2010-02 2010-03
City RegionName
Atlanta Downtown NaN NaN NaN
Midtown 194.263702 196.319964 197.946962
Alexandria Alexandria NaN NaN NaN
West
Landmark- NaN NaN NaN
Van Dom
How can I select only the values for each region of each town? I thought maybe it would be better to change the column names with years and months to datetime format and set them as index. How can I do this?
The result must be:
City RegionName
2010-01 Atlanta Downtown NaN
Midtown 194.263702
Alexandria Alexandria NaN
West
Landmark- NaN
Van Dom
Here's some similar dummy data to play with:
idx = pd.MultiIndex.from_arrays([['A','A', 'B','C','C'],
['A1','A2','B1','C1','C2']], names=['City','Region'])
idcol = pd.date_range('2012-01', freq='M', periods=12)
df = pd.DataFrame(np.random.rand(5,12), index=idx, columns=[t.strftime('%Y-%m') for t in idcol])
Let's see what we've got:
print(df.ix[:,:3])
2012-01 2012-02 2012-03
City Region
A A1 0.513709 0.941354 0.133290
A2 0.734199 0.005218 0.068914
B B1 0.043178 0.124049 0.603469
C C1 0.721248 0.483388 0.044008
C2 0.784137 0.864326 0.450250
Let's convert these to a datetime: df.columns = pd.to_datetime(df.columns)
Now to plot you just need to transpose:
df.T.plot()
Update after your updated your question:
Use stack, and then reorder if you want:
df = df.stack().reorder_levels([2,0,1])
df.head()
City Region
2012-01-01 A A1 0.513709
2012-02-01 A A1 0.941354
2012-03-01 A A1 0.133290
2012-04-01 A A1 0.324518
2012-05-01 A A1 0.554125

Aggregate function to data frame in pandas

I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()

Categories