Extracting strings from a list by a specific word

Extracting strings from a list by a specific word - python

I have this column of addresses in pandas and I want to select only those addresses in the US, however I either get an empty string or thrown an error.
Here's what I have done:
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA
5 Hillstrust Primary School, 29 Nethan St, Govan, Glasgow G51 3LX, UK
6 5778+JM Godalming, UK
7 569 Durham Rd, Low Fell, Gateshead NE9 5EY, UK
8 Pennine Way, Barnard Castle DL12, UK
9 14 Studios Rd, Shepperton TW17 0QW, UK
matching = [s for s in final_data["full_address"] if "USA" in s]
matching
#returns: TypeError: argument of type 'float' is not iterable
#Whereas
ab = [final_data["full_address"]]
matching = [s for s in ab if "USA" in s]
matching
#returns: []
Expected output:
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA

Try this:
import pandas as pd
data = {
'full_address': [
'238 Lincoln St, Hahnville, LA 70057, USA', '101 Home Pl Ln, Hahnville, LA 70057, USA', '1250 Poydras St, New Orleans, LA 70113, USA',
'1117 Broadway STE 401, Tacoma, WA 98402, USA', '2715 N Junett St, Tacoma, WA 98407, USA', '5778+JM Godalming, UK', '569 Durham Rd, Low Fell, Gateshead NE9 5EY, UK',
'Pennine Way, Barnard Castle DL12, UK', '14 Studios Rd, Shepperton TW17 0QW, UK'
]
}
df = pd.DataFrame(data)
matching = df[df['full_address'].str.contains("USA")]
print(matching)
Output:
full_address
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA

Hello I have tried to recreate your scenario and in this it is working I just added a query with contain statement on specific column which is here is country
import pandas as pd
# Build cars DataFrame
names = ['238 Lincoln St, Hahnville, LA 70057, USA', '101 Home Pl Ln, Hahnville, LA 70057, USA', 'Hillstrust Govan, Glasgow G51 3LX, UK']
dict = { 'country':names}
cars = pd.DataFrame(dict)
b = cars.query('country.str.contains("USA")', engine='python')
print(b)

Related

Python Dict Comprehension retrieve value from 1 dataframe column if match another column value

I have a dataframe and there are 2 columns ["country"] and ["city"] which basically informs of the country and their cities.
I need to create a dict using dict comprehensions, to get as a key, the country and as values, a list of the city/cities (some of them only have one city, others many).
I'm able to define the keys and create a list but all the cities existing appears a values, I am not able to create the condition that the country of the value should be the key:
Dic = {k: list(megacities["city"]) for k,f in megacities.groupby('country')}
for k in Dic:
print("{}:{}\n".format(k, Dic[k]))
Part of the output that I receive is:
Argentina:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Bangladesh:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Brazil:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
So basically the expect output would be:
Argentina:['Buenos Aires']
Bangladesh:['Dhaka']
Brazil:['São Paulo', 'Rio de Janeiro']
How can I should proceed in terms of syntaxis to stablish that condition for the value in the dict comprehension?
Lastly, the dataframe:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6839 139.7744 Japan JP JPN Tōkyō primary 39105000 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 35362000 1360771077
2 Delhi Delhi 28.6667 77.2167 India IN IND Delhi admin 31870000 1356872604
3 Manila Manila 14.6000 120.9833 Philippines PH PHL Manila primary 23971000 1608618140
4 São Paulo Sao Paulo -23.5504 -46.6339 Brazil BR BRA São Paulo admin 22495000 1076532519
5 Seoul Seoul 37.5600 126.9900 South Korea KR KOR Seoul primary 22394000 1410836482
6 Mumbai Mumbai 19.0758 72.8775 India IN IND Mahārāshtra admin 22186000 1356226629
7 Shanghai Shanghai 31.1667 121.4667 China CN CHN Shanghai admin 22118000 1156073548
8 Mexico City Mexico City 19.4333 -99.1333 Mexico MX MEX Ciudad de México primary 21505000 1484247881
9 Guangzhou Guangzhou 23.1288 113.2590 China CN CHN Guangdong admin 21489000 1156237133
10 Cairo Cairo 30.0444 31.2358 Egypt EG EGY Al Qāhirah primary 19787000 1818253931
11 Beijing Beijing 39.9040 116.4075 China CN CHN Beijing primary 19437000 1156228865
12 New York New York 40.6943 -73.9249 United States US USA New York NaN 18713220 1840034016
13 Kolkāta Kolkata 22.5727 88.3639 India IN IND West Bengal admin 18698000 1356060520
14 Moscow Moscow 55.7558 37.6178 Russia RU RUS Moskva primary 17693000 1643318494
15 Bangkok Bangkok 13.7500 100.5167 Thailand TH THA Krung Thep Maha Nakhon primary 17573000 1764068610
16 Dhaka Dhaka 23.7289 90.3944 Bangladesh BD BGD Dhaka primary 16839000 1050529279
17 Buenos Aires Buenos Aires -34.5997 -58.3819 Argentina AR ARG Buenos Aires, Ciudad Autónoma de primary 16216000 1032717330
18 Ōsaka Osaka 34.7520 135.4582 Japan JP JPN Ōsaka admin 15490000 1392419823
19 Lagos Lagos 6.4500 3.4000 Nigeria NG NGA Lagos minor 15487000 1566593751
20 Istanbul Istanbul 41.0100 28.9603 Turkey TR TUR İstanbul admin 15311000 1792756324
21 Karachi Karachi 24.8600 67.0100 Pakistan PK PAK Sindh admin 15292000 1586129469
22 Kinshasa Kinshasa -4.3317 15.3139 Congo (Kinshasa) CD COD Kinshasa primary 15056000 1180000363
23 Shenzhen Shenzhen 22.5350 114.0540 China CN CHN Guangdong minor 14678000 1156158707
24 Bangalore Bangalore 12.9791 77.5913 India IN IND Karnātaka admin 13999000 1356410365
25 Ho Chi Minh City Ho Chi Minh City 10.8167 106.6333 Vietnam VN VNM Hồ Chí Minh admin 13954000 1704774326
26 Tehran Tehran 35.7000 51.4167 Iran IR IRN Tehrān primary 13819000 1364305026
27 Los Angeles Los Angeles 34.1139 -118.4068 United States US USA California NaN 12750807 1840020491
28 Rio de Janeiro Rio de Janeiro -22.9083 -43.1964 Brazil BR BRA Rio de Janeiro admin 12486000 1076887657
29 Chengdu Chengdu 30.6600 104.0633 China CN CHN Sichuan admin 11920000 1156421555
30 Baoding Baoding 38.8671 115.4845 China CN CHN Hebei NaN 11860000 1156256829
31 Chennai Chennai 13.0825 80.2750 India IN IND Tamil Nādu admin 11564000 1356374944
32 Lahore Lahore 31.5497 74.3436 Pakistan PK PAK Punjab admin 11148000 1586801463
33 London London 51.5072 -0.1275 United Kingdom GB GBR London, City of primary 11120000 1826645935
34 Paris Paris 48.8566 2.3522 France FR FRA Île-de-France primary 11027000 1250015082
35 Tianjin Tianjin 39.1467 117.2056 China CN CHN Tianjin admin 10932000 1156174046
36 Linyi Linyi 35.0606 118.3425 China CN CHN Shandong NaN 10820000 1156086320
37 Shijiazhuang Shijiazhuang 38.0422 114.5086 China CN CHN Hebei admin 10784600 1156217541
38 Zhengzhou Zhengzhou 34.7492 113.6605 China CN CHN Henan admin 10136000 1156183137
39 Nanyang Nanyang 32.9987 112.5292 China CN CHN Henan NaN 10013600 1156192287
Many thanks!

Try:
d = {i: g["city"].to_list() for i, g in df.groupby("country")}
print(d)
Prints:
{
"Argentina": ["Buenos Aires"],
"Bangladesh": ["Dhaka"],
"Brazil": ["São Paulo", "Rio de Janeiro"],
"China": [
"Shanghai",
"Guangzhou",
"Beijing",
"Shenzhen",
"Chengdu",
"Baoding",
"Tianjin",
"Linyi",
"Shijiazhuang",
"Zhengzhou",
"Nanyang",
],
"Congo (Kinshasa)": ["Kinshasa"],
"Egypt": ["Cairo"],
"France": ["Paris"],
"India": ["Delhi", "Mumbai", "Kolkāta", "Bangalore", "Chennai"],
"Indonesia": ["Jakarta"],
"Iran": ["Tehran"],
"Japan": ["Tokyo", "Ōsaka"],
"Mexico": ["Mexico City"],
"Nigeria": ["Lagos"],
"Pakistan": ["Karachi", "Lahore"],
"Philippines": ["Manila"],
"Russia": ["Moscow"],
"South Korea": ["Seoul"],
"Thailand": ["Bangkok"],
"Turkey": ["Istanbul"],
"United Kingdom": ["London"],
"United States": ["New York", "Los Angeles"],
"Vietnam": ["Ho Chi Minh City"],
}

Since you are doing the groupby, You need to fetch city from the group
Dic = {k: f['city'].unique() for k,f in megacities.groupby('country')}

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico

Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

How to apply a pandas geocode function to Pyspark column

Table is like this
id
ADDRESS
0
6101 SUMMITVIEW AVE STE 200 YAKIMA
1
527 CEDAR WAY SUITE 105 OAKMONT
2
1700 N ROSE AVE SUITE 460 OXNARD
3
1275 YORK AVE NEW YORK
4
2300 MANCHESTER EXPY A SUITE 101 A COLUMBUS
5
401 N MICHIGAN AVE CHICAGO
6
111 GROSSMAN DR INTERNAL MEDICINE BRAINTREE
7
1850 N CENTRAL AVE STE 1600 PHOENIX
8
47 NEW SCOTLAND AVENUE ALBANY MEDICAL CENTER A...
9
201 N VINE ST EL DORADO
10
4420 LAKE BOONE TRL RALEIGH
11
2727 W HOLCOMBE BLVD HOUSTON
12
850 PETER BRYCE BLVD TUSCALOOSA
13
1803 WEHRLI RD NAPERVILLE
14
4321 N MACDILL AVE STE 203 TAMPA
15
111 CONTINENTAL DR SUITE 412 NEWARK
16
1834 E INNOVATION PARK DR ORO VALLEY
17
880 KEMPSVILLE RD SUITE 2200 NORFOLK
18
701 PRINCETON AVE SW BIRMINGHAM
19
4729 COUNTY ROAD 101 MINNETONKA
import pandas as pd
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import matplotlib.pyplot as plt
import folium
from folium.plugins import FastMarkerCluster
locator = Nominatim(user_agent="myGeocoder")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode,min_delay_seconds=0.0, error_wait_seconds=1.0, swallow_exceptions=True, return_value_on_exception=None)
apprix_1_na['location'] = apprix_1_na['ADDRESS'].apply(geocode)
apprix_1_na['point'] = apprix_1_na['location'].apply(lambda loc: tuple(loc.point) if loc enter code hereelse None)
I want this code to work in Pyspark for longitude and latitude

I'll show a "complex" example with GoogleV3 API. It is easy suitable to your case
from geopy.geocoders import GoogleV3
from pyspark.sql.functions import col, udf
from pyspark.sql.types import FloatType, ArrayType
df = spark.createDataFrame([("123 Fake St, Springfield, 12345, USA",),("1000 N West Street, Suite 1200 Wilmington, DE 19801, USA",)], ["address"])
df.display()
address
123 Fake St, Springfield, 12345, USA
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
#udf(returnType=ArrayType(FloatType()))
def geoloc(address):
api = 'your_api_key_here'
geolocator = GoogleV3(api)
#get lat_long
return geolocator.geocode(address)[1]
#find coord
df = df.withColumn('geocode', geoloc(col('address')))
#separate tuple
df = df.withColumn("latitude", col('geocode').getItem(0))\
.withColumn("longitude", col('geocode').getItem(1))
df.display()
address
geocode
latitude
longitude
123 Fake St, Springfield, 12345, USA
[44.046238, -123.022026]
44.046238
-123.022026
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
[39.74717, -75.54999]
39.74717
-75.54999

Python Scaling loops

For each letter in the alphabet. The code should go to website.com/a and grab a table. Then it should check for a next button grab the link and makesoup and grab the next table and repeat until there is no valid next link. Then move to website.com/b(next letter in alphabet) and repeat. But I can only get as far as 2 pages for each letter. the first for loop grabs page 1 and the second grabs page 2 for each letter. I know I could write a loop for as many pages as needed but that is not scalable. How can I fix this?
from nfl_fun import make_soup
import urllib.request
import os
from string import ascii_lowercase
import requests
letter = ascii_lowercase
link = "https://www.nfl.com"
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for tbody in soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for page in soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
pagelink = ""
footer = ""
footer = page.attrs["href"]
pagelink = f"{link}{footer}"
print(footer)
getpage = requests.get(pagelink)
if getpage.status_code == 200:
next_soup = make_soup(pagelink)
for next_page in next_soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
print(getpage)
for tbody in next_soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
soup = next_soup
Thank You again,

There is an element in there that says when the "Next" button is inactive. So that'll tell you you are on the last page. So what you can do is a while loop, and just keep going to the next page, until it reaches the last page (Ie "Next" is inactive) and then tell it to stop the loop and go to the next letter:
from bs4 import BeautifulSoup
from string import ascii_lowercase
import requests
import pandas as pd
import re
letters = ascii_lowercase
link = "https://www.nfl.com"
results = pd.DataFrame()
for letter in letters:
continueToNextPage = True
after = ''
page=1
while continueToNextPage == True:
# Get the Table
url = f"https://www.nfl.com/players/active/{letter}?query={letter}&after={after}"
response = requests.get(url, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
temp_df = pd.read_html(response.text)[0]
results = results.append(temp_df, sort=False).reset_index(drop=True)
print ("{letter}: Page: {page}".format(letter=letter.upper(), page=page))
# Check if next page is inactive
buttons = soup.find('div', {'class':'nfl-o-table-pagination__buttons'})
regex = re.compile('.*pagination__next.*is-inactive.*')
if buttons.find('span', {'class':regex}):
continueToNextPage = False
else:
after = buttons.find('a', {'title':'Next'})['href'].split('after=')[-1]
page+=1
Output:
print (results)
Player Current Team Position Status
0 Chidobe Awuzie Dallas Cowboys CB ACT
1 Josh Avery Seattle Seahawks DT ACT
2 Genard Avery Philadelphia Eagles DE ACT
3 Anthony Averett Baltimore Ravens CB ACT
4 Lee Autry Chicago Bears DT ACT
5 Denico Autry Indianapolis Colts DT ACT
6 Tavon Austin Dallas Cowboys WR UFA
7 Blessuan Austin New York Jets CB ACT
8 Antony Auclair Tampa Bay Buccaneers TE ACT
9 Jeremiah Attaochu Denver Broncos LB ACT
10 Hunter Atkinson Atlanta Falcons OT ACT
11 John Atkins Detroit Lions DE ACT
12 Geno Atkins Cincinnati Bengals DT ACT
13 Marcell Ateman Las Vegas Raiders WR ACT
14 George Aston New York Giants RB ACT
15 Dravon Askew-Henry New York Giants DB ACT
16 Devin Asiasi New England Patriots TE ACT
17 George Asafo-Adjei New York Giants OT ACT
18 Ade Aruna Las Vegas Raiders DE ACT
19 Grayland Arnold Philadelphia Eagles SAF ACT
20 Dan Arnold Arizona Cardinals TE ACT
21 Damon Arnette Las Vegas Raiders CB UDF
22 Ray-Ray Armstrong Dallas Cowboys LB UFA
23 Ka'John Armstrong Denver Broncos OT ACT
24 Dorance Armstrong Dallas Cowboys DE ACT
25 Cornell Armstrong Houston Texans CB ACT
26 Terron Armstead New Orleans Saints OT ACT
27 Ryquell Armstead Jacksonville Jaguars RB ACT
28 Arik Armstead San Francisco 49ers DE ACT
29 Alex Armah Carolina Panthers FB ACT
... ... ... ...
3180 Clive Walford Miami Dolphins TE UFA
3181 Cameron Wake Tennessee Titans DE UFA
3182 Corliss Waitman Pittsburgh Steelers P ACT
3183 Rick Wagner Green Bay Packers OT ACT
3184 Bobby Wagner Seattle Seahawks MLB ACT
3185 Ahmad Wagner Chicago Bears WR ACT
3186 Colby Wadman Denver Broncos P ACT
3187 Christian Wade Buffalo Bills RB ACT
3188 LaAdrian Waddle Buffalo Bills OT UFA
3189 Oshane Ximines New York Giants LB ACT
3190 Trevon Young Cleveland Browns DE ACT
3191 Sam Young Las Vegas Raiders OT ACT
3192 Kenny Young Los Angeles Rams ILB ACT
3193 Chase Young Washington Redskins DE UDF
3194 Bryson Young Atlanta Falcons DE ACT
3195 Isaac Yiadom Denver Broncos CB ACT
3196 T.J. Yeldon Buffalo Bills RB ACT
3197 Deon Yelder Kansas City Chiefs TE ACT
3198 Rock Ya-Sin Indianapolis Colts CB ACT
3199 Eddie Yarbrough Minnesota Vikings DE ACT
3200 Marshal Yanda Baltimore Ravens OG ACT
3201 Tavon Young Baltimore Ravens CB ACT
3202 Brandon Zylstra Carolina Panthers WR ACT
3203 Jabari Zuniga New York Jets DE UDF
3204 Greg Zuerlein Dallas Cowboys K ACT
3205 Isaiah Zuber New England Patriots WR ACT
3206 Justin Zimmer Cleveland Browns DT ACT
3207 Anthony Zettel Minnesota Vikings DE ACT
3208 Kevin Zeitler New York Giants OG ACT
3209 Olamide Zaccheaus Atlanta Falcons WR ACT
[3210 rows x 4 columns]

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting strings from a list by a specific word - python

Related

Python Dict Comprehension retrieve value from 1 dataframe column if match another column value

Draw a Map of cities in python

How to apply a pandas geocode function to Pyspark column

Python Scaling loops

Split column in DataFrame based on item in list

Categories

Resources