Substituting a part of array using re - python

9 7 316 Lake St Arran Dr St. Catharines, ON L2N 4H4 Phone: 905-934-5885 112.9 123 130 --- 1/1/18
10 Esso 142 Lakeshore Rd Geneva St St. Catharines, ON L2N 2T5 Phone: 905-646-4558 112.7 125.9 131.9 --- 1/1/18
11 Petro-Canada 533 Lake St Linwell Rd St. Catharines, ON L2N 4H6 Phone: (905) 937-7719 112.9 125.9 131.9 124.9 1/1/18
I have above data where I need to change (905) to 905- so that all data is in similar format.I have tried to read this content as list and import re.
import re
for line in data :
line = re.sub(r"(905) ", "905-", line)
print(line)
But it is not working.How to replace it ?

If all you want is a simple replacement, then you shouldn't use re:
line = line.replace("(905) ", "905-")
If you need to replace more prefixes than just 905, only then you need regular expressions:
line = re.sub(r"\((\d{3})\) ", r"\1-", line)
That would also replace (204) 342-4532 with 204-342-4532.

Escape the brackets in RE like this
re.sub(r"\(905\) ", "905-", line)

You need to escape the parentheses because they are special characters:
for line in data :
line = re.sub("\(905\) ", "905-", line)
print(line)
Output:
9 7 316 Lake St Arran Dr St. Catharines, ON L2N 4H4 Phone: 905-934-5885 112.9 123 130 --- 1/1/18
10 Esso 142 Lakeshore Rd Geneva St St. Catharines, ON L2N 2T5 Phone: 905-646-4558 112.7 125.9 131.9 --- 1/1/18
11 Petro-Canada 533 Lake St Linwell Rd St. Catharines, ON L2N 4H6 Phone: 905-937-7719 112.9 125.9 131.9 124.9 1/1/18

Related

How to apply a pandas geocode function to Pyspark column

Table is like this
id
ADDRESS
0
6101 SUMMITVIEW AVE STE 200 YAKIMA
1
527 CEDAR WAY SUITE 105 OAKMONT
2
1700 N ROSE AVE SUITE 460 OXNARD
3
1275 YORK AVE NEW YORK
4
2300 MANCHESTER EXPY A SUITE 101 A COLUMBUS
5
401 N MICHIGAN AVE CHICAGO
6
111 GROSSMAN DR INTERNAL MEDICINE BRAINTREE
7
1850 N CENTRAL AVE STE 1600 PHOENIX
8
47 NEW SCOTLAND AVENUE ALBANY MEDICAL CENTER A...
9
201 N VINE ST EL DORADO
10
4420 LAKE BOONE TRL RALEIGH
11
2727 W HOLCOMBE BLVD HOUSTON
12
850 PETER BRYCE BLVD TUSCALOOSA
13
1803 WEHRLI RD NAPERVILLE
14
4321 N MACDILL AVE STE 203 TAMPA
15
111 CONTINENTAL DR SUITE 412 NEWARK
16
1834 E INNOVATION PARK DR ORO VALLEY
17
880 KEMPSVILLE RD SUITE 2200 NORFOLK
18
701 PRINCETON AVE SW BIRMINGHAM
19
4729 COUNTY ROAD 101 MINNETONKA
import pandas as pd
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import matplotlib.pyplot as plt
import folium
from folium.plugins import FastMarkerCluster
locator = Nominatim(user_agent="myGeocoder")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode,min_delay_seconds=0.0, error_wait_seconds=1.0, swallow_exceptions=True, return_value_on_exception=None)
apprix_1_na['location'] = apprix_1_na['ADDRESS'].apply(geocode)
apprix_1_na['point'] = apprix_1_na['location'].apply(lambda loc: tuple(loc.point) if loc enter code hereelse None)
I want this code to work in Pyspark for longitude and latitude
I'll show a "complex" example with GoogleV3 API. It is easy suitable to your case
from geopy.geocoders import GoogleV3
from pyspark.sql.functions import col, udf
from pyspark.sql.types import FloatType, ArrayType
df = spark.createDataFrame([("123 Fake St, Springfield, 12345, USA",),("1000 N West Street, Suite 1200 Wilmington, DE 19801, USA",)], ["address"])
df.display()
address
123 Fake St, Springfield, 12345, USA
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
#udf(returnType=ArrayType(FloatType()))
def geoloc(address):
api = 'your_api_key_here'
geolocator = GoogleV3(api)
#get lat_long
return geolocator.geocode(address)[1]
#find coord
df = df.withColumn('geocode', geoloc(col('address')))
#separate tuple
df = df.withColumn("latitude", col('geocode').getItem(0))\
.withColumn("longitude", col('geocode').getItem(1))
df.display()
address
geocode
latitude
longitude
123 Fake St, Springfield, 12345, USA
[44.046238, -123.022026]
44.046238
-123.022026
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
[39.74717, -75.54999]
39.74717
-75.54999

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df
Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

AWK reformat portion of results (names) within larger string

My goal is to reformat names from Last First Middle (LFM) to First Middle Last (FML), which are part of a larger string. Here's some sample data:
Name, Address1, Address2
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
What I would like is:
Name, Address1, Address2 Joe M Smith, 123 Apple Rd, Paris
TX Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY Karen E F
Walker, 98 West Ave, Denver CO
I know how to reorder the first column, but I end up dropping the rest of the row data:
# Return the first colum via comma seperation (name), then seperate by spaces
# If there are two strings but not three (only a last and first name),
# then change the order to first last.
awk -F, '{print $1}'| awk -F" " '$2!="" && $3=="" {print $2,$1}' >> names.txt
awk -F, '{print $1}'| awk -F" " '$3!="" && $4=="" {print $3,$1,$2}' >> names.txt
...# Continue to iterate column numbers
If there's an easier way to put the last string found and move it to the front I'd like to hear about it, but here's my real interest...
My problem is that I want to reorder the space separated fields of the 1st comma separated field (what I did above), but then also print the rest of the comma separated data.
Is there a way I can store the address info in a variable and append it after the space seperated names?
Alternatively, could I do some kind of nested split?
I'm currently doing this with awk in bash, but am willing to use python/pandas or any other efficient methods.
Thanks for the help!
Using sed, looks terrible but works:
sed -E '2,$s/^([^ ,]*) ([^ ,]*)( [^,]*)?/\2\3 \1/' in
and POSIX version:
sed '2,$s/^\([^ ,]*\) \([^ ,]*\)\( [^,]*\)*/\2\3 \1/' in
output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The following AWK script, as ugly as it is, works for your inputs (run with awk -F, -f script.awk):
{
split($1, names, " ");
for (i=2; i<=length(names); i++)
printf("%s ", names[i]);
printf("%s, ", names[1]);
for(i=2; i<NF; i++)
printf("%s,", $i);
print($NF)
}
Input:
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
Output:
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The same solution in Python:
import sys
import re
for line in sys.stdin:
parts = re.split('\s*,\s*', line)
names = parts[0].split()
print(", ".join([" ".join(names[1:] + names[:1])] + parts[1:]))
Another awk. This one works with the header line and Madonna (ie. single word fields):
$ awk ' # using awk
BEGIN{FS=OFS=","} # csv
{
n=split($1,a," ") # split the first field to a
for(i=n;i>1;i--) # iterate back from the last element of a
a[1]=a[i] " " a[1] # prepending to the first element of a
$1=a[1] # replace the first field with the first element of a
}1' file # output
Output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
Madonna, ...
$ awk '
BEGIN { FS=OFS=", " }
$1 ~ / / {
last = rest = $1
sub(/ .*/,"",last)
sub(/[^ ]+ /,"",rest)
$1 = rest " " last
}
{ print }
' file
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

How to get most popular combination (set of values, and their count from CSV file as input)

I have a CSV file with passengers' travel data:
Now using this CSV file as input, we need to find the most popular trip.
Start Time End Time Trip Duration Start Station End Station User Type Gender Birth Year
1/1/2017 0:00 1/1/2017 0:06 356 Canal St & Taylor St Canal St & Monroe St () Customer
1/1/2017 0:02 1/1/2017 0:08 327 Larrabee St & Menomonee St Sheffield Ave & Kingsbury St Subscriber Male 1984
1/1/2017 0:06 1/1/2017 0:18 745 Orleans St & Chestnut St (NEXT Apts) Ashland Ave & Blackhawk St Subscriber Male 1985
1/1/2017 0:07 1/1/2017 0:12 323 Franklin St & Monroe St Clinton St & Tilden St Subscriber Male 1990*
*def popular_trip(csv_file):
'''TODO: fill out docstring with description, arguments, and return values.
Question: What is the most popular trip?
'''
# TODO: complete function*
df = pd.DataFrame({"start":['a','a','c','b'],
"end":['e','e','g','f']}) #sample df
trip_series = df["start"].astype(str) + " to " + df["end"].astype(str)
trip_series.describe()
output:
count 4
unique 3
top a to e
freq 2
dtype: object
most_popular_trip = trip_series.describe()["top"]
print(most_popular_trip)
output: 'a to e'

Python pandas regular expression replace part of the matching pattern

I've got a bunch of addresses like so:
df['street'] =
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
What I want to do is to remove those numbers at the end of the street part of the addresses to get:
df['street'] =
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
I have my regular expression like this:
df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)
which gives me this:
df['street'] =
5311 Whitsett
355 Sawyer St
607 Hampshire
342 Old Hwy 1
267 W Juniper
It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?
EDIT:
Notice the address 342 Old Hwy 1. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.
df_street = '''
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
'''
# digits on the end are preceded by one of ( Ave, Rd, Dr), space,
# may be preceded by a #, and followed by a possible space, and by the newline
df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
You should use the following regex:
>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'

Categories