Create New Column With List Comprehension Python - python

I am trying to create a new column that contains city names. I also have a list containing the city names needed and CSV files that have city names under different column names.
What I am trying to do is to check whether the city names in the list exist in a specific range of columns of the CSV files and fill that particular city name in the new column City.
My code is:
import pandas as pd
import numpy as np
City_Name_List = ['Amsterdam', 'Antwerp', 'Brussels', 'Ghent', 'Asheville', 'Austin', 'Boston', 'Broward County',
'Cambridge', 'Chicago', 'Clark County Nv', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles',
'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem Or', 'San Diego']
data = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
'neighbourhood':['Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands'],
'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}
df = pd.DataFrame(data)
df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
When I run the code, I get this message:
Traceback (most recent call last):
File "C:/Users/YAZAN/PycharmProjects/Yazan_Work/try.py", line 63, in <module>
df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
IndexError: list index out of range
This due to the face that the City Amsterdam in the data is followed by other words.
I want my output to be as follow:
0 Amsterdam
1 Amsterdam
2 Amsterdam
3 Amsterdam
4 Amsterdam
5 Amsterdam
6 Amsterdam
7 Amsterdam
8 Amsterdam
9 Amsterdam
Name: City, dtype: object
I tried relentlessly to solve this issue. I tried to use endswith, startswith, regex, but to no avail. I might be using both methods wrongly. I hope someone can help me.

Basic Solution Using Pandas.DataFrame.Apply
df['City'] = df.apply(
lambda row: [x if x in row.loc['neighbourhood'] for x in City_Name_List][0],
axis=1
)
Following execution of the above, df['city'] will contain a city (defined by its inclusion in City_Name_List) if one is found within the 'neighbourhood' column of each row.
Modified Solution
You could be a bit more explicit my specifying that City should populate on the first substring present before the first occurrence of , within the 'neighbourhood' field of each row. This may be a good idea if the 'neighbourhood' column is reliably uniform in structure as it could help to mitigate against any unwanted behaviour arising from similarly named cities, cities that are substrings of other cities in City_Name_List etc.
df['City'] = df.apply(
lambda row: [x if x in row.loc['neighbourhood'].split(',')[0] for x in City_Name_List][0],
axis=1
)
Note: The above solutions are simply examples of how you may resolve the problems that you are having. They do not take into account proper handling of exceptions, edge cases etc. As ever you should take care to account for such considerations in your code.

df['City'] = df['neighbourhood'].apply(lambda x: [i for i in x.split(',') if i in City_Name_List])
df['City'] = df['City'].apply(lambda x: "" if len(x) == 0 else x[0])

The issue is that when you say x in df.loc[] you are not checking if the city name is in each particular string, but rather if the city name is in the entire Series, which it is not. What you need is something like this:
df['city'] = [x if x in City_Name_list else '' for x[0] in df['neighbourhood'].str.split(',')]
This will split each row in df['neighborhood'] along the commas and return the first value, then check if that value is in your list of city names and if so then place it in the 'city' Series.

Related

Find duplicate values in pandas column that may contain an additional word

Suppose a dataframe with a column of company names:
namelist = ['canadian pacific railway',
'nestlé canada',
'nestlé',
'chicken farmers of canada',
'cenovus energy',
'merck frosst canada',
'food banks canada',
'canadian fertilizer institute',
'balanceco',
'bell textron canada',
'safran landing systems canada',
'airbus canada',
'airbus',
'investment counsel association of canada',
'teck resources',
'fertilizer canada',
'engineers canada',
'google',
'google canada']
s = pd.Series(namelist)
I would like to isolate all rows of company names that are duplicated, whether or not they contain the word "canada". In this example, the filtered column should contain:
'nestlé canada'
'nestlé'
'airbus canada'
'airbus'
'google'
'google canada'
The goal is to standardize the names to a single form.
I can't wholesale remove the word because there are other company names with that word I want to preserve, like "fertilizer canada" and "engineers canada".
Is there a regex pattern or another clever way to get this?
You can replace the trailing "canada" and then use duplicated to construct a Boolean mask to get the values from the original Series. (With keep=False, all duplicates are marked as True.)
s[s.str.replace(' canada$', '').duplicated(keep=False)]
s[s.str.replace('\s*canada\s*', '', regex=True).duplicated(keep=False)]
Output:
1 nestlé canada
2 nestlé
10 airbus canada
11 airbus
16 google
17 google canada
dtype: object

Extract country from cities in pandas

I have an array of list of cities. I want to group them by the country name. Is there any library I can install which will do that ?
e.g array(['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago',
'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago'])
I want to know the country they belong to and add a new country column in my dataframe.
I agree with what has been said in the comments - there is no clear way to join a city to a country when city names are not unique.
For example if we run...
import pandas as pd
df = pd.read_csv('https://datahub.io/core/world-cities/r/world-cities.csv')
df.rename(columns ={"name":"city"}, inplace=True)
print(df)
Outputs:
# create a list of city names for testing...
myCityList = ['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago', 'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago']
# pull out all the rows matching a city in the test list..
df.query(f'city=={myCityList}')
Outputs:
However something is wrong because there are more rows listed than items in the test city list (and clearly Santiago is listed multiple times)...
print(len(myCityList))
print(df.query(f'city=={myCityList}').shape[0])
Outputs:
10
15
Maybe the above is useful but it has to be used with caution as it's not 100% guaranteed to output the correct country for a given city.

Conditional extraction of a substring from a string

I've got a DataFrame with a column that is an object and contains the full address in one of the following formats:
'street name, building number',
'city, street, building number',
'city, district, street, building number'.
Regardless of the format I need to extract the name of the street and copy it to a new column. I cannot attach the original DF since all the information is in Russian. I've created a dummy DF instead:
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
N.B. Different parts of one string are always separated by one comma. The substring with the street name in it always contains one of the words: 'street', 'avenue', 'boulevard'.
After several hours of Googling I've come up with something like this but to no avail:
street_list = ['street', 'avenue', 'boulevard']
for row in df:
for x in street_list:
if df.loc[row, 'address'].split(', ')[0].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[0]
elif df.loc[row, 'address'].split(', ')[1].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[1]
elif df.loc[row, 'address'].split(', ')[2].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[2]
This code doesn't work for me. Is it possible to tweak it somehow so that it works(or maybe you know a better solution)?
Please let me know if any additional information is required.
As far as I understand:
1. The streets could be in a couple of positions in the comma separated values depending on the length.
2. The streets has an additional substring check.
In the below code:
Point 1 is represented by the streetMap
Point 2 is represented by the 'any' condition
import pandas as pd
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
streetMap = {2:0,3:1,4:2} # Map of length of items to location of street.
street_list = ['street', 'avenue', 'boulevard']
addresses = df['address']
streets = []
for address in addresses:
items = address.split(', ')
streetCandidate = items[streetMap[len(items)]]
street = streetCandidate if any([s in streetCandidate for s in street_list]) else "NA"
streets.append(street)
df['street'] = streets
print(df)
Output:
0 new york city, the bronx borough, willis avenu... willis avenue
1 town of salem, main street, building 105 main street
2 second boulevard, 12 second boulevard

How do I get all the values of different strings that end with a certain word

My dataframe has a column called Borough which contains values like these:
"east toronto", "west toronto", "central toronto" and "west toronto", along with other region names.
Now I want a regular expression which gets me the data of every entry that ends with "toronto". How do I do that?
I tried this:
tronto_data = df_toronto[df_toronto['Borough'] = .*Toronto$].reset_index(drop=True)
tronto_data.head(7)
If the data is well formatted you can split the string on the space and access the final word, comparing it to Toronto. For example
df = pd.DataFrame({'column': ['west toronto', 'central toronto', 'some place']})
mask_df = df['column'].str.split(' ', expand=True)
which returns:
0 1
0 west toronto
1 central toronto
2 some place
you can then access the final column to work out the rows that end with Toronto.
toronto_df = df[mask_df[1]=='toronto']
Edit:
Did not know there was a string method .endswith which is the better way to do this. However, this solution does provide two columns which maybe useful.
Like #Code_10 refers in a comment that you can use string.endswith.. try below->
df = pd.DataFrame({'city': ['east toronto', 'west toronto', 'other', 'central toronto']})
df_toronto = df[df['city'].str.endswith('toronto')]
#df_toronto.head()

How does a double "for" work in list comprehension?

So, for a background of the problem from where this question emerges, kindly refer to this link.
As the accepted answer suggested, I went ahead with the provided code and was able to accomplish what I initially wanted. But making a dictionary was not my final goal. My ultimate aim with that dictionary was to transform it into a DataFrame, which I was able to.
Here is what I did:
df = pd.DataFrame(([st, cty] for st, cty in dic.items() for cty in dic[st]),
columns = ["State", "City"])
For your ready reference, the dic variable is as follows:
{'Alabama': ['Auburn',
'Florence',
'Jacksonville',
'Livingston',
'Montevallo',
'Troy',
'Tuscaloosa',
'Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
'Arkansas': ['Arkadelphia',
'Conway',
'Fayetteville',
'Jonesboro',
'Magnolia',
'Monticello',
'Russellville',
'Searcy'],
'California': ['Angwin',
'Arcata',
'Berkeley',
'Chico',
'Claremont',
'Cotati',
'Davis',
'Irvine',
'Isla Vista',
'University Park, Los Angeles',
'Merced',
'Orange',
'Palo Alto',
'Pomona',
'Redlands',
'Riverside',
'Sacramento',
'University District, San Bernardino',
'San Diego',
'San Luis Obispo',
'Santa Barbara',
'Santa Cruz',
'Turlock',
'Westwood, Los Angeles',
'Whittier'],
'Colorado': ['Alamosa',
'Boulder',
'Durango',
'Fort Collins',
'Golden',
'Grand Junction',
'Greeley',
'Gunnison',
'Pueblo, Colorado'],
'Connecticut': ['Fairfield',
'Middletown',
'New Britain',
'New Haven',
'New London',
'Storrs',
'Willimantic'],
'Delaware': ['Dover', 'Newark'], .... all the other states with their city names
The output that I got after running the above code is as follows (a screenshot):
My query is: Although I got the desired output, and although I formulated that "DataFrame comprehension", so to speak, myself, I do not fully understand the double for.
Can someone please explain how exactly does a for inside another for work in these kind of situations. I am a beginner in Pandas.
That is a generator and has nothing to do with Pandas.
The term ([x, y] for x in q for y in p) is a Python generator. You can assign this to a variable, say g = ([x, y] for x in q for y in p) and then iterator over it:
for element in g:
print(element)
Pandas accepts generators at this point and iterates over them to get all values for the DataFrame.
The double for is evaluated like this:
for x in q:
for y in p:
yield [x, y]
So what this generator produces is a flat list of all the combinations of the elements in q and p.

Categories