Derive pandas column from another - python

I have a column like following in my dataframe
What I want is to create a new column based on the language column, like country, If someone's language is "eng", the country column should fill with UK
Desired output
NB: This is a sample I created in excel, I am working with pandas in jupyter notebook

Considering this to be your df:
In [1359]: df = pd.DataFrame({'driver':['Hamilton', 'Sainz', 'Giovanazi'], 'language':['eng', 'spa', 'ita']})
In [1360]: df
Out[1360]:
driver language
0 Hamilton eng
1 Sainz spa
2 Giovanazi ita
And this to be your language-country mapping:
In [1361]: mapping = {'eng': 'UK', 'spa': 'Spain', 'ita': 'Italy'}
You can use df.map to solve it:
In [1363]: df['country'] = df.language.map(mapping)
In [1364]: df
Out[1364]:
driver language country
0 Hamilton eng UK
1 Sainz spa Spain
2 Giovanazi ita Italy

Related

Using df1 as a lookup table for df2, df2 has more unique values than df1 in Python

I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.
How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe

How do I deal with a column in Pandas that equals "NA"? [duplicate]

This question already has an answer here:
pandas redefine isnull to ignore 'NA'
(1 answer)
Closed 2 years ago.
I know this sounds dumb, but I can't figure out what to do about data in a spreadsheet that equals "NA" (in my case, it's an abbreviation for "North America"). When I do a Pandas "read_excel", the data gets brought in as "NaN" instead of "NA".
Is "NA" also considered "Not a Number" like NaN is?
The input Excel sheet cells contain NA. The dataframe contains "NaN".
Any way to avoid this?
Solution
You can switch-off auto-detection of na-values by using keep_defaul_na=False in pandas.read_excel() as follows.
I am using the demo test.xlsx file that I created in the Dummy Data section.
pd.read_excel('test.xlsx', keep_default_na=False)
## Output
# Region Country
# 0 NA Canada
# 1 NA USA
# 2 SA Brazil
# 3 EU Sweden
# 4 AU Australia
Dummy Data
import pandas as pd
# Create a dummy dataframe for demo purpose
df = pd.DataFrame({'Region': ['NA', 'NA', 'SA', 'EU', 'AU'],
'Country': ['Canada', 'USA', 'Brazil', 'Sweden', 'Australia']})
# Create an excel file with this data
df.to_excel('test.xlsx', index=False)
# Show dataframe
print(df)
Output
Region Country
0 NA Canada
1 NA USA
2 SA Brazil
3 EU Sweden
4 AU Australia

Create a two-way table from dictionary of combinations

I'm writing a simple code to have a two-way table of distances between various cities.
Basically, I have a list of cities (say just 3: Paris, Berlin, London), and I created a combination between them with itertools (so I have Paris-Berlin, Paris-London, Berlin-London). I parsed the distances from a website and saved them in a dictionary (so I have: {Paris: {Berlin : 878.36, London : 343.67}, Berlin : {London : 932.14}}).
Now I want to create a two way table, so that I can look up for a pair of cities in Excel (I need it in Excel unfortunately, otherwise with Python all of this would be unnecessary!), and have the distance back. The table has to be complete (ie not triangular, so that I can look for London-Paris, or Paris-London, and the value must be there on both row/column pair). Is something like this possible easily? I was thinking probably I need to fill in my dictionary (ie create something like { Paris : {Berlin : 878.36, London 343.67}, Berlin : {Paris : 878.36, London : 932.14}, London : {Paris : 343.67, Berlin : 932.14}), and then feed it to Pandas, but not sure it's the fastest way. Thank you!
I think this does something like what you need:
import pandas as pd
data = {'Paris': {'Berlin': 878.36, 'London': 343.67}, 'Berlin': {'London': 932.14}}
# Create data frame from dict
df = pd.DataFrame(data)
# Rename index
df.index.name = 'From'
# Make index into a column
df = df.reset_index()
# Turn destination columns into rows
df = df.melt(id_vars='From', var_name='To', value_name='Distance')
# Drop missing values (distance to oneself)
df = df.dropna()
# Concatenate with itself but swapping the order of cities
df = pd.concat([df, df.rename(columns={'From' : 'To', 'To': 'From'})], sort=False)
# Reset index
df = df.reset_index(drop=True)
print(df)
Output:
From To Distance
0 Berlin Paris 878.36
1 London Paris 343.67
2 London Berlin 932.14
3 Paris Berlin 878.36
4 Paris London 343.67
5 Berlin London 932.14

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Categories