Extract country from cities in pandas - python

I have an array of list of cities. I want to group them by the country name. Is there any library I can install which will do that ?
e.g array(['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago',
'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago'])
I want to know the country they belong to and add a new country column in my dataframe.

I agree with what has been said in the comments - there is no clear way to join a city to a country when city names are not unique.
For example if we run...
import pandas as pd
df = pd.read_csv('https://datahub.io/core/world-cities/r/world-cities.csv')
df.rename(columns ={"name":"city"}, inplace=True)
print(df)
Outputs:
# create a list of city names for testing...
myCityList = ['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago', 'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago']
# pull out all the rows matching a city in the test list..
df.query(f'city=={myCityList}')
Outputs:
However something is wrong because there are more rows listed than items in the test city list (and clearly Santiago is listed multiple times)...
print(len(myCityList))
print(df.query(f'city=={myCityList}').shape[0])
Outputs:
10
15
Maybe the above is useful but it has to be used with caution as it's not 100% guaranteed to output the correct country for a given city.

Related

Find duplicate values in pandas column that may contain an additional word

Suppose a dataframe with a column of company names:
namelist = ['canadian pacific railway',
'nestlé canada',
'nestlé',
'chicken farmers of canada',
'cenovus energy',
'merck frosst canada',
'food banks canada',
'canadian fertilizer institute',
'balanceco',
'bell textron canada',
'safran landing systems canada',
'airbus canada',
'airbus',
'investment counsel association of canada',
'teck resources',
'fertilizer canada',
'engineers canada',
'google',
'google canada']
s = pd.Series(namelist)
I would like to isolate all rows of company names that are duplicated, whether or not they contain the word "canada". In this example, the filtered column should contain:
'nestlé canada'
'nestlé'
'airbus canada'
'airbus'
'google'
'google canada'
The goal is to standardize the names to a single form.
I can't wholesale remove the word because there are other company names with that word I want to preserve, like "fertilizer canada" and "engineers canada".
Is there a regex pattern or another clever way to get this?
You can replace the trailing "canada" and then use duplicated to construct a Boolean mask to get the values from the original Series. (With keep=False, all duplicates are marked as True.)
s[s.str.replace(' canada$', '').duplicated(keep=False)]
s[s.str.replace('\s*canada\s*', '', regex=True).duplicated(keep=False)]
Output:
1 nestlé canada
2 nestlé
10 airbus canada
11 airbus
16 google
17 google canada
dtype: object

How to identify dummy data in pandas and delete?

Is there a way to identify the dummy data in a dataframe and delete them? In my data below, there are random characters in each column that I need to delete.
import pandas as pd
import numpy as np
data = {'Name' : ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
#Code to identify and delete dummy data
print(contact_details)
Output of the above code:
Name Address1 Address2
0 Tom High Street Park Avenue
1 AABBCC uwdfjfuf The Crescent
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
6 U iwefwfn 1ca2s597
have you investigated your data? Are always the "good data" a combination of lowercase and uppercase characters? If so you could make a function to find those dummy data, for example:
if text.lower() == text or text.upper() == text:
# text is dummy
Without a good definition of good and bad values for each column, there's really nothing you can do automatically. There are a couple of data cleaning tricks you can use to make those values easier to find in a large dataset.
Starting with your original dataset:
import pandas as pd
data = {'Name': ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
The first thing you can do is get the unique values for a column to reduce the number of values you're looking through.
# get all the unique values in the 'Name' column
names = contact_details['Name'].unique()
Next you can sort them so that any near-duplicates will stand out more easily. Near duplicates happen often with errors in data entry.
# sort them alphabetically and then take a closer look
names.sort()
print(list(names))
So for example, is you had seen the values ' Tom', 'Tom', and 'Tom ', you know you need to strip whitespace from names.
contact_details['Name'] = contact_details['Name'].strip()
Another benefit of sorting unique values in a column is that string values that start with a number will all be at the beginning of the list, and lowercase strings will be sorted at the end. This makes a couple of your 'Address1 values stand out.
# get all the unique values in the 'Address1' column
address1 = contact_details['Address1'].unique()
address1.sort()
print(list(address1))
This gives me the list of unique values:
['00000', 'Church Street', 'Green Lane', 'High Street', 'Kingsway', 'iwefwfn', 'uwdfjfuf']
It's not clear yet if the first value is valid, but those last two look suspect. If I want to remove those, I can filter them out by selecting all rows where Address1 is not in a list of bad values.
contact_details_filtered = contact_details[~contact_details['Address1'].isin(['iwefwfn', 'uwdfjfuf'])]
print(contact_details_filtered)
This gives me the output:
Name Address1 Address2
0 Tom High Street Park Avenue
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
Row 2 is definitely suspect, and row 4 is questionable, but I think you get the idea of how to find and remove values that look like placeholders or just bad data.

Create New Column With List Comprehension Python

I am trying to create a new column that contains city names. I also have a list containing the city names needed and CSV files that have city names under different column names.
What I am trying to do is to check whether the city names in the list exist in a specific range of columns of the CSV files and fill that particular city name in the new column City.
My code is:
import pandas as pd
import numpy as np
City_Name_List = ['Amsterdam', 'Antwerp', 'Brussels', 'Ghent', 'Asheville', 'Austin', 'Boston', 'Broward County',
'Cambridge', 'Chicago', 'Clark County Nv', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles',
'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem Or', 'San Diego']
data = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
'neighbourhood':['Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands'],
'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}
df = pd.DataFrame(data)
df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
When I run the code, I get this message:
Traceback (most recent call last):
File "C:/Users/YAZAN/PycharmProjects/Yazan_Work/try.py", line 63, in <module>
df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
IndexError: list index out of range
This due to the face that the City Amsterdam in the data is followed by other words.
I want my output to be as follow:
0 Amsterdam
1 Amsterdam
2 Amsterdam
3 Amsterdam
4 Amsterdam
5 Amsterdam
6 Amsterdam
7 Amsterdam
8 Amsterdam
9 Amsterdam
Name: City, dtype: object
I tried relentlessly to solve this issue. I tried to use endswith, startswith, regex, but to no avail. I might be using both methods wrongly. I hope someone can help me.
Basic Solution Using Pandas.DataFrame.Apply
df['City'] = df.apply(
lambda row: [x if x in row.loc['neighbourhood'] for x in City_Name_List][0],
axis=1
)
Following execution of the above, df['city'] will contain a city (defined by its inclusion in City_Name_List) if one is found within the 'neighbourhood' column of each row.
Modified Solution
You could be a bit more explicit my specifying that City should populate on the first substring present before the first occurrence of , within the 'neighbourhood' field of each row. This may be a good idea if the 'neighbourhood' column is reliably uniform in structure as it could help to mitigate against any unwanted behaviour arising from similarly named cities, cities that are substrings of other cities in City_Name_List etc.
df['City'] = df.apply(
lambda row: [x if x in row.loc['neighbourhood'].split(',')[0] for x in City_Name_List][0],
axis=1
)
Note: The above solutions are simply examples of how you may resolve the problems that you are having. They do not take into account proper handling of exceptions, edge cases etc. As ever you should take care to account for such considerations in your code.
df['City'] = df['neighbourhood'].apply(lambda x: [i for i in x.split(',') if i in City_Name_List])
df['City'] = df['City'].apply(lambda x: "" if len(x) == 0 else x[0])
The issue is that when you say x in df.loc[] you are not checking if the city name is in each particular string, but rather if the city name is in the entire Series, which it is not. What you need is something like this:
df['city'] = [x if x in City_Name_list else '' for x[0] in df['neighbourhood'].str.split(',')]
This will split each row in df['neighborhood'] along the commas and return the first value, then check if that value is in your list of city names and if so then place it in the 'city' Series.

How to assign rows a number based on a level in pandas dataframe?

I have the following code:
from pandas import DataFrame
import pandas as pd
data = {'City': ['NY', 'NY', 'Arizona'], 'Doctor': ['Dr. Prof. Vera', 'Dr. Prof. Vera', 'Dr. Martin'], 'Type': ['Checked', 'Checked', 'Ordered'], 'Covid-Patient': ['yes', 'no', 'no']}
df = DataFrame(data).set_index(['City', 'Doctor', 'Type'])
df['Dr-Nr.'] = pd.Series(df.groupby(['Doctor']).cumcount()+1)
Which results in:
But what I want is an individual number of the Doctor in a new column Dr-Nr..
Apparently, the grouping by level Doctor does not seem to have an effect. Any help is appreciated!
You can rank() index level Doctor:
df['Dr-Nr.'] =df.assign(d_=df.index.get_level_values('Doctor'))['d_'].rank(method='dense').astype(int)
The order of indexing will be alphabetical here, so:
Covid-Patient Dr-Nr.
City Doctor Type
NY Dr. Prof. Vera Checked yes 2
Checked no 2
Arizona Dr. Martin Ordered no 1

How do I get all the values of different strings that end with a certain word

My dataframe has a column called Borough which contains values like these:
"east toronto", "west toronto", "central toronto" and "west toronto", along with other region names.
Now I want a regular expression which gets me the data of every entry that ends with "toronto". How do I do that?
I tried this:
tronto_data = df_toronto[df_toronto['Borough'] = .*Toronto$].reset_index(drop=True)
tronto_data.head(7)
If the data is well formatted you can split the string on the space and access the final word, comparing it to Toronto. For example
df = pd.DataFrame({'column': ['west toronto', 'central toronto', 'some place']})
mask_df = df['column'].str.split(' ', expand=True)
which returns:
0 1
0 west toronto
1 central toronto
2 some place
you can then access the final column to work out the rows that end with Toronto.
toronto_df = df[mask_df[1]=='toronto']
Edit:
Did not know there was a string method .endswith which is the better way to do this. However, this solution does provide two columns which maybe useful.
Like #Code_10 refers in a comment that you can use string.endswith.. try below->
df = pd.DataFrame({'city': ['east toronto', 'west toronto', 'other', 'central toronto']})
df_toronto = df[df['city'].str.endswith('toronto')]
#df_toronto.head()

Categories