Pandas: How to map the values of a Dataframe to another Dataframe? - python

I am totally new to Python and just learning with some use cases I have.
I have 2 Data Frames, one is where I need the values in the Country Column, and another is having the values in the column named 'Countries' which needs to be mapped in the main Data Frame referring to the column named 'Data'.
(Please accept my apology if this question has already been answered)
Below is the Main DataFrame:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas |
Divya london Khosla |
new delhi Pragati Kumari |
Will London Turner |
Joseph Mascurenus Bombay |
Jason New York Bourne |
New york Vice Roy |
Joseph Mascurenus new York |
Peter Parker California |
Bruce (istanbul) Wayne |
Below is the Referenced DataFrame:
Data | Countries
-------------- | ---------
las Vegas | US
london | UK
New Delhi | IN
London | UK
bombay | IN
New York | US
New york | US
new York | US
California | US
istanbul | TR
Moscow | RS
Cape Town | SA
And what I want in the result will look like below:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas | US
Divya london Khosla | UK
new delhi Pragati Kumari | IN
Will London Turner | UK
Joseph Mascurenus Bombay | IN
Jason New York Bourne | US
New york Vice Roy | US
Joseph Mascurenus new York | US
Peter Parker California | US
Bruce (istanbul) Wayne | TR
Please note, Both the dataframes are not same in size.
I though of using map or Fuzzywuzzy method but couldn't really achieved the result.

Find the country key that matches in the reference dataframe and extract it.
regex = '(' + ')|('.join(ref_df['Data']) + ')'
df['key'] = df['Name Data'].str.extract(regex, flags=re.I).bfill(axis=1)[0]
>>> df
Name Data key
0 Arjun Kumar Reddy las Vegas las Vegas
1 Bruce (istanbul) Wayne istanbul
2 Joseph Mascurenus new York new York
>>> ref_df
Data Country
0 las Vegas US
1 new York US
2 istanbul TR
Merge both the dataframes on key extracted.
pd.merge(df, ref_df, left_on='key', right_on='Data')
Name Data key Data Country
0 Arjun Kumar Reddy las Vegas las Vegas las Vegas US
1 Bruce (istanbul) Wayne istanbul istanbul TR
2 Joseph Mascurenus new York new York new York US

It looks like everything is sorted so you can merge on index
mdf.merge(rdf, left_index=True, right_index=True)

Related

Pyhton pandas for manipulate text & inconsistent data

how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400

Changing values in a column based on a match

I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ).
The DataFrame look like this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| UFRJ |
| Universidade de Sao Paulo |
| USP |
| Catholic University of Minas Gerais |
And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this:
| long_name | short_name |
|----------------------------------------|------------|
| Universidade Federal do Rio de Janeiro | UFRJ |
| Universidade de Sao Paulo | USP |
What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| Universidade Federal do Rio de Janeiro |
| Universidade de Sao Paulo |
| Universidade de Sao Paulo |
| Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same
Is there a way to do that using pandas and numpy (or any other library)?
Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna:
df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name'])
.fillna(df1['college']))
print (df1)
college
0 Universidade Federal do Rio de Janeiro
1 Universidade Federal do Rio de Janeiro
2 Universidade de Sao Paulo
3 Universidade de Sao Paulo
4 Catholic University of Minas Gerais

How to split a pandas string to extract middle names?

I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
A single str.extract call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ

Phrase similarity from List

Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?

map US state name to two letter acronyms that was given in dictionary separately

Suppose now I have a dataframe with 2 columns: State and City.
Then I have a separate dict with the two-letter acronym for each state. Now I want to add a third column to map state name with its two-letter acronym. What should I do in Python/Pandas? For instance the sample question is as follows:
import pandas as pd
a = pd.Series({'State': 'Ohio', 'City':'Cleveland'})
b = pd.Series({'State':'Illinois', 'City':'Chicago'})
c = pd.Series({'State':'Illinois', 'City':'Naperville'})
d = pd.Series({'State': 'Ohio', 'City':'Columbus'})
e = pd.Series({'State': 'Texas', 'City': 'Houston'})
f = pd.Series({'State': 'California', 'City': 'Los Angeles'})
g = pd.Series({'State': 'California', 'City': 'San Diego'})
state_city = pd.DataFrame([a,b,c,d,e,f,g])
state_2 = {'OH': 'Ohio','IL': 'Illinois','CA': 'California','TX': 'Texas'}
Now I have to map the column State in the df state_city using the dictionary of state_2. The mapped df state_city should contain three columns: state, city, and state_2letter.
The original dataset I had had multiple columns with nearly all US major cities.
Therefore it will be less efficient to do it manually. Is there any easy way to do it?
For one, it's probably easier to store the key-value pairs like state name: abbreviation in your dictionary, like this:
state_2 = {'Ohio': 'OH', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'}
You can achieve this easily:
state_2 = {state: abbrev for state, abbrev in state_2.items()}
Using pandas.DataFrame.map:
>>> state_city['abbrev'] = state_city['State'].map(state_2)
>>> state_city
City State abbrev
0 Cleveland Ohio OH
1 Chicago Illinois IL
2 Naperville Illinois IL
3 Columbus Ohio OH
4 Houston Texas TX
5 Los Angeles California CA
6 San Diego California CA
I do agree with #blacksite that the state_2 dictionary should map its values like that:
state_2 = {'Ohio': 'OH','Illinois': 'IL','California': 'CA','Texas': 'TX'}
Then using pandas.DataFrame.replace
state_city['state_2letter'] = state_city.State.replace(state_2)
state_city
|-|State |City |state_2letter|
|-|----- |------ |----------|
|0| Ohio | Cleveland | OH|
|1| Illinois | Chicago | IL|
|2| Illinois | Naperville | IL|
|3| Ohio | Columbus | OH|
|4| Texas | Houston | TX|
|5| California| Los Angeles | CA|
|6| California| San Diego | CA|

Categories