how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400
I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ).
The DataFrame look like this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| UFRJ |
| Universidade de Sao Paulo |
| USP |
| Catholic University of Minas Gerais |
And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this:
| long_name | short_name |
|----------------------------------------|------------|
| Universidade Federal do Rio de Janeiro | UFRJ |
| Universidade de Sao Paulo | USP |
What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| Universidade Federal do Rio de Janeiro |
| Universidade de Sao Paulo |
| Universidade de Sao Paulo |
| Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same
Is there a way to do that using pandas and numpy (or any other library)?
Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna:
df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name'])
.fillna(df1['college']))
print (df1)
college
0 Universidade Federal do Rio de Janeiro
1 Universidade Federal do Rio de Janeiro
2 Universidade de Sao Paulo
3 Universidade de Sao Paulo
4 Catholic University of Minas Gerais
I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
A single str.extract call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?
Suppose now I have a dataframe with 2 columns: State and City.
Then I have a separate dict with the two-letter acronym for each state. Now I want to add a third column to map state name with its two-letter acronym. What should I do in Python/Pandas? For instance the sample question is as follows:
import pandas as pd
a = pd.Series({'State': 'Ohio', 'City':'Cleveland'})
b = pd.Series({'State':'Illinois', 'City':'Chicago'})
c = pd.Series({'State':'Illinois', 'City':'Naperville'})
d = pd.Series({'State': 'Ohio', 'City':'Columbus'})
e = pd.Series({'State': 'Texas', 'City': 'Houston'})
f = pd.Series({'State': 'California', 'City': 'Los Angeles'})
g = pd.Series({'State': 'California', 'City': 'San Diego'})
state_city = pd.DataFrame([a,b,c,d,e,f,g])
state_2 = {'OH': 'Ohio','IL': 'Illinois','CA': 'California','TX': 'Texas'}
Now I have to map the column State in the df state_city using the dictionary of state_2. The mapped df state_city should contain three columns: state, city, and state_2letter.
The original dataset I had had multiple columns with nearly all US major cities.
Therefore it will be less efficient to do it manually. Is there any easy way to do it?
For one, it's probably easier to store the key-value pairs like state name: abbreviation in your dictionary, like this:
state_2 = {'Ohio': 'OH', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'}
You can achieve this easily:
state_2 = {state: abbrev for state, abbrev in state_2.items()}
Using pandas.DataFrame.map:
>>> state_city['abbrev'] = state_city['State'].map(state_2)
>>> state_city
City State abbrev
0 Cleveland Ohio OH
1 Chicago Illinois IL
2 Naperville Illinois IL
3 Columbus Ohio OH
4 Houston Texas TX
5 Los Angeles California CA
6 San Diego California CA
I do agree with #blacksite that the state_2 dictionary should map its values like that:
state_2 = {'Ohio': 'OH','Illinois': 'IL','California': 'CA','Texas': 'TX'}
Then using pandas.DataFrame.replace
state_city['state_2letter'] = state_city.State.replace(state_2)
state_city
|-|State |City |state_2letter|
|-|----- |------ |----------|
|0| Ohio | Cleveland | OH|
|1| Illinois | Chicago | IL|
|2| Illinois | Naperville | IL|
|3| Ohio | Columbus | OH|
|4| Texas | Houston | TX|
|5| California| Los Angeles | CA|
|6| California| San Diego | CA|