pandas fill missing country values based on city if it exists - python

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).

You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.

I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.

One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!

Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Related

How to replace a part of column value with values from another two columns based on a condition in pandas

I have a dataframe df as shown below. I want replace all the temp_idcolumn values which are having _(underscore with another value which combination of numerical part of the temp_id + city+ country column values.
df
temp_id city country
12225IND DELHI IND
14445UX_TY AUSTIN US
56784SIN BEDOK SIN
72312SD_IT_UZ NEW YORK US
47853DUB DUBAI UAE
80976UT_IS_SZ SYDENY AUS
89012TY_JP_IS TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
Expected Output
temp_id city country
12225IND DELHI IND
14445AUSTINUS AUSTIN US
56784SIN BEDOK SIN
72312NEWYORKUS NEW YORK US
47853DUB DUBAI UAE
80976SYDENYAUS SYDENY AUS
89012TOKOYOJPN TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
How can this be done in pandas python
Use boolean indexing:
# find rows with value in country and city
m1 = df[['city', 'country']].notna().all(axis=1)
# find rows with a "_"
m2 = df['temp_id'].str.contains('_')
# both conditions above
m = m1&m2
# replace matching rows by number + city + country
df.loc[m, 'temp_id'] = (df.loc[m, 'temp_id'].str.extract('^(\d+)', expand=False)
+df.loc[m, 'city'].str.replace(' ', '')+df.loc[m, 'country']
)
Output:
temp_id city country
0 12225IND DELHI IND
1 14445AUSTINUS AUSTIN US
2 56784SIN BEDOK SIN
3 72312NEWYORKUS NEW YORK US
4 47853DUB DUBAI UAE
5 80976SYDENYAUS SYDENY AUS
6 89012TOKOYOJPN TOKOYO JPN
7 51309HJ_IS_IS None None
8 42087IND MUMBAI IND
You can use the str.replace() method on the temp_id column and use regular expressions to match the pattern of the values you want to replace. Here is an example:
import re
df['temp_id'] = df['temp_id'].apply(lambda x: re.sub(r'^(\d+)_.*', r'\1'+df['city']+df['country'], x))
This uses a regular expression to match the pattern of the temp_id values that you want to replace (in this case, any value that starts with one or more digits followed by an underscore), and replaces them with the matched digits concatenated with the values of the corresponding city and country columns. The result will be the temp_id column with the desired format.

Is there a way to take all of my row values and make them col headers?

I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?
Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN
Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil
7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN

How can I fill some data of the cell of the new column that is in accord with a substring of the original data using pandas?

There are 2 dataframes, and they have simillar data.
A dataframe
Index Business Address
1 Oils Moskva, Russia
2 Foods Tokyo, Japan
3 IT California, USA
... etc.
B dataframe
Index Country Country Calling Codes
1 USA +1
2 Egypt +20
3 Russia +7
4 Korea +82
5 Japan +81
... etc.
I will add a column named 'Country Calling Codes' to A dataframe, too.
After this, 'Country' column in B dataframe will be compared with the data of 'Address' column. If the string of 'A.Address' includes string of 'B.Country', 'B.Country Calling Codes' will be inserted to 'A.Country Calling Codes' of compared row.
Result is:
Index Business Address Country Calling Codes
1 Oils Moskva, Russia +7
2 Foods Tokyo, Japan +81
3 IT California, USA +1
I don't know how to deal with the issue because I don't have much experience using pandas. I should be very grateful to you if you might help me.
Use Series.str.extract for get possible strings by Country column and then Series.map by Series:
d = B.drop_duplicates('Country').set_index('Country')['Country Calling Codes']
s = A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False)
A['Country Calling Codes'] = s.map(d)
print (A)
Index Business Address Country Calling Codes
0 1 Oils Moskva, Russia +7
1 2 Foods Tokyo, Japan +81
2 3 IT California, USA +1
Detail:
print (A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False))
0 Russia
1 Japan
2 USA
Name: Address, dtype: object

Left join tables (1:n) using Pandas, keeping number of rows the same as left table

How do I left join tables with 1:n relationship, while keeping the number of rows the same as left table and concatenating any duplicate data with a character/string like ';'.
Example: Country Table
CountryID Country Area
1 UK 1029
2 Russia 8374
Cities Table
CountryID City
1 London
1 Manchester
2 Moscow
2 Ufa
I want:
CountryID Country Area Cities
1 UK 1029 London;Manchester
2 Russia 8374 Moscow;Ufa
I know how to perform a normal left join
country.merge(city, how='left', on='CountryID')
which gives me four rows instead of two:
Area Country CountryID City
1029 UK 1 London
1029 UK 1 Manchester
8374 Russia 2 Moscow
8374 Russia 2 Ufa
Use map by Series created by groupby + join for new column in df1 if performance is important:
df1['Cities'] = df1['CountryID'].map(df2.groupby('CountryID')['City'].apply(';'.join))
print (df1)
CountryID Country Area Cities
0 1 UK 1029 London;Manchester
1 2 Russia 8374 Moscow;Ufa
Detail:
print (df2.groupby('CountryID')['City'].apply(';'.join))
CountryID
1 London;Manchester
2 Moscow;Ufa
Name: City, dtype: object
Another solution with join:
df = df1.join(df2.groupby('CountryID')['City'].apply(';'.join), on='CountryID')
print (df)
CountryID Country Area City
0 1 UK 1029 London;Manchester
1 2 Russia 8374 Moscow;Ufa
This will give you the desired result:
df1.merge(df2, on='CountryID').groupby(['CountryID', 'Country', 'Area']).agg({'City': lambda x: ';'.join(x)}).reset_index()
# CountryID Country Area City
#0 1 UK 1029 London;Manchester
#1 2 Russia 8374 Moscow;Ufa

how to remove entire column if a particular row has duplicate values in a dataframe in python

I have a dataframe like this,
df,
Name City
0 sri chennai
1 pedhci pune
2 bahra pune
there is a duplicate in City column.
I tried:
df["City"].drop_duplicates()
but it gives only the particular column.
my desired output should be
output_df
Name City
0 sri chennai
1 pedhci pune
You can use:
df2 = df.drop_duplicates(subset='City')
if you want to store the result in a new dataframe, or:
df.drop_duplicates(subset='City',inplace=True)
if you want to update df.
This produces:
>>> df
City Name
0 chennai sri
1 pune pedhci
2 pune bahra
>>> df.drop_duplicates(subset='City')
City Name
0 chennai sri
1 pune pedhci
This will thus only take duplicates for City into account, duplicates in Name are ignored.

Categories