pandas groupby replace based on condition - python

I have a dataset structures as below:
index country city Data
0 AU Sydney 23
1 AU Sydney 45
2 AU Unknown 2
3 CA Toronto 56
4 CA Toronto 2
5 CA Ottawa 1
6 CA Unknown 2
I want to replace 'Unknown' in the city column with the mode of the occurences of cities per country. The result would be:
...
2 AU Sydney 2
...
6 CA Toronto 2
I can get the city modes with:
city_modes = df.groupby('country')['city'].apply(lambda x: x.mode().iloc[0])
And I can replace values with:
df['column']=df.column.replace('Unknown', 'something')
But i cant work out how to combine these to only replace unknowns for each country based on mode of occurrence of cities.
Any ideas?

Use transform for Series with same size as original DataFrame and set new values by numpy.where:
city_modes = df.groupby('country')['city'].transform(lambda x: x.mode().iloc[0])
df['column'] = np.where(df['column'] == 'Unknown',city_modes, df['column'])
Or:
df.loc[df['column'] == 'Unknown', 'column'] = city_modes

Related

How to replace a part of column value with values from another two columns based on a condition in pandas

I have a dataframe df as shown below. I want replace all the temp_idcolumn values which are having _(underscore with another value which combination of numerical part of the temp_id + city+ country column values.
df
temp_id city country
12225IND DELHI IND
14445UX_TY AUSTIN US
56784SIN BEDOK SIN
72312SD_IT_UZ NEW YORK US
47853DUB DUBAI UAE
80976UT_IS_SZ SYDENY AUS
89012TY_JP_IS TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
Expected Output
temp_id city country
12225IND DELHI IND
14445AUSTINUS AUSTIN US
56784SIN BEDOK SIN
72312NEWYORKUS NEW YORK US
47853DUB DUBAI UAE
80976SYDENYAUS SYDENY AUS
89012TOKOYOJPN TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
How can this be done in pandas python
Use boolean indexing:
# find rows with value in country and city
m1 = df[['city', 'country']].notna().all(axis=1)
# find rows with a "_"
m2 = df['temp_id'].str.contains('_')
# both conditions above
m = m1&m2
# replace matching rows by number + city + country
df.loc[m, 'temp_id'] = (df.loc[m, 'temp_id'].str.extract('^(\d+)', expand=False)
+df.loc[m, 'city'].str.replace(' ', '')+df.loc[m, 'country']
)
Output:
temp_id city country
0 12225IND DELHI IND
1 14445AUSTINUS AUSTIN US
2 56784SIN BEDOK SIN
3 72312NEWYORKUS NEW YORK US
4 47853DUB DUBAI UAE
5 80976SYDENYAUS SYDENY AUS
6 89012TOKOYOJPN TOKOYO JPN
7 51309HJ_IS_IS None None
8 42087IND MUMBAI IND
You can use the str.replace() method on the temp_id column and use regular expressions to match the pattern of the values you want to replace. Here is an example:
import re
df['temp_id'] = df['temp_id'].apply(lambda x: re.sub(r'^(\d+)_.*', r'\1'+df['city']+df['country'], x))
This uses a regular expression to match the pattern of the temp_id values that you want to replace (in this case, any value that starts with one or more digits followed by an underscore), and replaces them with the matched digits concatenated with the values of the corresponding city and country columns. The result will be the temp_id column with the desired format.

replace row data in pandas based on another dataframe

I've a sample dataframe
name
0 Newyork
1 Los Angeles
2 Ohio
3 Washington DC
4 Kentucky
Also I've a second dataframe
name ratio
0 Newyork 1:2
1 Kentucky 3:7
2 Florida 1:5
3 SF 2:9
How can I replace the data of name column in the df2 with not available, if the name is present in df1?
Desired result:
name ratio
0 Not Available 1:2
1 Not Available 3:7
2 Florida 1:5
3 SF 2:9
Use numpy.where:
df2['name'] = np.where(df2['name'].isin(df1['name']), 'Not Available', df2['name'])

How can I read in row names as they were originally, using pandas.read_csv( )?

I need to read in a .csv file which contains a distance matrix, so it has identical row names and column names, and it's important to have them both. However, the code below can only get me a dataframe where row names are included in an extra "Unnamed: 0" column and the index become integers again, which is very inconvenient for the indexing later.
DATA = pd.read_csv("https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv")
I did check the documentation of pandas.read_csv and played with index_col, header, names, e.t.c but none seemed to work. Can anybody help me out here?
Use index_col=0 parameter for first column to index:
url = "https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv"
DATA = pd.read_csv(url, index_col=0)
print (DATA.head())
Imperial Kern Los Angeles Orange Riverside San Bernardino \
Imperial 0 3 3 2 1 2
Kern 3 0 1 2 2 1
Los Angeles 3 1 0 1 2 1
Orange 2 2 1 0 1 1
Riverside 1 2 2 1 0 1
San Diego San Luis Obispo Santa Barbara Ventura
Imperial 1 4 4 4
Kern 3 1 1 1
Los Angeles 2 2 2 1
Orange 1 3 3 2
Riverside 1 3 3 3
This issue most likely exhibits because your CSV was saved along with its RangeIndex, which usually doesn't have a name. The fix would actually need to be done when saving the DataFrame data.to_csv('file.csv', index = False)
To read the unnamed column as the index. Specify an index_col=0 argument to pd.read_csv, this reads in the first column as the index.
data = pd.read_csv("https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv",index_col = 0)
And to drop the unnamed column use data.drop(data.filter(regex="Unname"),axis=1, inplace=True)

Find two matching rows in a Pandas DataFrame to calculate value

I want to find a matching row for another row in a Pandas dataframe. Given this example frame:
name location type year area delta
0 building NY a 2019 650.3 ?
1 building NY b 2019 400.0 ?
2 park LA a 2017 890.7 ?
3 lake SF b 2007 142.2 ?
4 park LA b 2017 333.3 ?
...
Each row has a matching row, where all values equal - except the "type" and the "area". For example row 0 and 1 match, and 2 and 4, ...
I want to somehow get the matching rows; and write the difference between their areas in their "delta" column (e.g. |650.3 - 400.0| = 250.3 for row 0).
The "delta" column doesn't exist yet, but an empty column could be easily added with df["Delta"] = 0. I just don't know how to be able to fill the delta column for ALL rows.
I tried getting a matching row with df[name = 'building' & location = 'type' ... ~& type = 'a']; but I can't edit the result I get from that. Maybe I also don't quite understand when I get a copy, and when a reference.
I hope my problem is clear. If not, I am happy to explain further.
Thanks a lot already for your help!
IIUC, you want groupby.transform:
df['delta']=( df.groupby(df.columns.difference(['type','area']).tolist())
.transform('diff').abs() )
print(df)
name location type year area delta
0 building NY a 2019 650.3 NaN
1 building NY b 2019 400.0 250.3
2 park LA a 2017 890.7 NaN
3 lake SF b 2007 142.2 NaN
4 park LA b 2017 333.3 557.4
If you want to write the difference in both rows ofdelta column:
df['delta']=( df.groupby(df.columns.difference(['type','area']).tolist())
.transform(lambda x: x.diff().bfill()).abs() )
print(df)
name location type year area delta
0 building NY a 2019 650.3 250.3
1 building NY b 2019 400.0 250.3
2 park LA a 2017 890.7 557.4
3 lake SF b 2007 142.2 NaN
4 park LA b 2017 333.3 557.4
Detail:
df.columns.difference(['type','area']).tolist()
#[*df.columns.difference(['type','area'])] or this
#['location', 'name', 'year'] #Output
A solution with merge:
df['other_type'] = np.where(df['type']=='a', 'b', 'a')
(df.merge(df,
left_on=['name','location', 'year', 'type'],
right_on=['name','location', 'year', 'other_type'],
suffixes=['','_r'])
.assign(delta=lambda x: x['area']-x['area_r'])
.drop(['area_r', 'other_type_r'], axis=1)
)

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Categories