Difflib error when applying onto two columns in pandas dataframe - python

I have DataFrame that look like this:
Cities Cities_Dict
"San Francisco" ["San Francisco", "New York", "Boston"]
"Los Angeles" ["Los Angeles"]
"berlin" ["Munich", "Berlin"]
"Dubai" ["Dubai"]
I want to create new column that compares city from firest column to the list of cities from secon column and finds the one that is the closest match.
I use difflib for that:
df["new_col"]=difflib.get_close_matches(df["Cities"],df["Cities_Dict"])
However I get error:
TypeError: object of type 'float' has no len()

Use DataFrame.apply with lambda function and axis=1 for processing by rows:
import difflib, ast
#if necessary convert values to lists
#df['Cities_Dict'] = df['Cities_Dict'].apply(ast.literal_eval)
f = lambda x: difflib.get_close_matches(x["Cities"],x["Cities_Dict"])
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] [San Francisco]
1 Los Angeles [Los Angeles] [Los Angeles]
2 berlin [Munich, Berlin] [Berlin]
3 Dubai [Dubai] [Dubai]
EDIT:
For first value with empty string for empty list use:
f = lambda x: next(iter(difflib.get_close_matches(x["Cities"],x["Cities_Dict"])), '')
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 San Francisco [San Francisco, New York, Boston] San Francisco
1 Los Angeles [Los Angeles] Los Angeles
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai
EDIT1: If possible problematic data is possible use try-except:
def f(x):
try:
return difflib.get_close_matches(x["Cities"],x["Cities_Dict"])[0]
except:
return ''
df["new_col"] = df.apply(f, axis=1)
print (df)
Cities Cities_Dict new_col
0 NaN [San Francisco, New York, Boston]
1 Los Angeles [10]
2 berlin [Munich, Berlin] Berlin
3 Dubai [Dubai] Dubai

Related

Keyword categorization from strings in a new column in pandas

This is not the best approach but this what I did so far:
I have this example df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
gives:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.
So, here is my nested dic as a reference for countries and cities:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None
then applied it to df:
df.City.apply(lambda x: get_country(x.lower()))
I got the following output:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
Expected output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.
df = pd.DataFrame({
'City': ['I lived Los Angeles',
'I visited London and Toronto',
'the best one is Toronto',
'business hub is in New York',
' Mexico city is stunning']
})
country_cities = {
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)
for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] += 1
return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Solution based on Series.str.count
A simpler solution is using Series.str.count to count the occurences of the following regex pattern city1|city2|etc for each country (the pattern matches city1 or city2 or etc). Using the same setup as above:
country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
for country, pat in country_patterns.items():
df[country] = df['City'].str.count(pat)
Why your solution doesn't work?
if y.lower() in x:
df[k] = count + 1
else:
return None
The reason your function doesn't produce the right output is that
you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because the return statement automatically exits the function.
What is happening is that only US cities are checked, and the line df[k] = 1 (in this case k = 'US') creates an entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.
You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0

Split a row into more rows based on a string (regex)

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

Segregate a column values in 2 columns

I have got data like this:
Col
Texas[x]
Dallas
Austin
California[x]
Los Angeles
San Francisco
What i want is this:
col1 Col2
Texas[x] Dallas
Austin
California[x] Los Angeles
San Francisco
Please help!
Use str.extract to create columns and then clean up
df.Col.str.extract('(.*\[x\])?(.*)').ffill()\
.replace('', np.nan).dropna()\
.rename(columns = {0:'Col1', 1: 'Col2'})\
.set_index('Col1')
Col2
Col1
Texas [x] Dallas
Texas [x] Austin
California [x] Los Angeles
California [x] San Francisco
Update: To address the follow-up question.
df.Col.str.extract('(.*\[x\])?(.*)').ffill()\
.replace('', np.nan).dropna()\
.rename(columns = {0:'Col1', 1: 'Col2'})
You get
Col1 Col2
1 Texas[x] Dallas
2 Texas[x] Austin
4 California[x] Los Angeles
5 California[x] San Francisco
Seems like [x] represents state in a list. You can try to iterate over the dataframe using iterrows. Something like this:
state = None # initialize as None, in case something goes wrong
city = None
rowlist = []
for idx, row in df.iterrows():
# get the state
if '[x]' in row['Col']:
state = row['Col']
continue
# now, get the cities
city = row['Col']
rowlist.append([state, city])
df2 = pd.DataFrame(rowlist)
This assumes that your initial dataframe is called df and column name is Col, and only works if cities are followed by states, which it seems like they do from your data sample.

String mode aggregation with group by function

I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])

Pandas Dataframe, How can I split a column into two by "," when some rows may have more than 1 ","

I have a dataframe. One of the columns is a combination of CITY and STATE. I want to split this column to two columns, CITY and STATE using:
df['CITY'],df['STATE'] = df['WORKSITE'].str.split(",")
And I got this error:
ValueError Traceback (most recent call
last) in ()
----> 1 df['CITY'],df['STATE'] = df['WORKSITE'].str.split(",")
ValueError: too many values to unpack (expected 2)
So, I'm wondering is there a method that I can ignore the exceptions or detect which row is not working?
Set n=2 in the split call and use the str method effectively:
import pandas as pd
x = ['New York, NY', 'Los Angeles, CA', 'Kansas City, KS, US']
df = pd.DataFrame(x, columns=['WORKSITE'])
df['CITY'], df['STATE'] = df['WORKSITE'].str.split(',', 2).str[0:2].str
print df
Output
WORKSITE CITY STATE
0 New York, NY New York NY
1 Los Angeles, CA Los Angeles CA
2 Kansas City, KS, US Kansas City KS
I got some help from looking at this answer to this question.

Categories