AWK reformat portion of results (names) within larger string

AWK reformat portion of results (names) within larger string - python

My goal is to reformat names from Last First Middle (LFM) to First Middle Last (FML), which are part of a larger string. Here's some sample data:
Name, Address1, Address2
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
What I would like is:
Name, Address1, Address2 Joe M Smith, 123 Apple Rd, Paris
TX Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY Karen E F
Walker, 98 West Ave, Denver CO
I know how to reorder the first column, but I end up dropping the rest of the row data:
# Return the first colum via comma seperation (name), then seperate by spaces
# If there are two strings but not three (only a last and first name),
# then change the order to first last.
awk -F, '{print $1}'| awk -F" " '$2!="" && $3=="" {print $2,$1}' >> names.txt
awk -F, '{print $1}'| awk -F" " '$3!="" && $4=="" {print $3,$1,$2}' >> names.txt
...# Continue to iterate column numbers
If there's an easier way to put the last string found and move it to the front I'd like to hear about it, but here's my real interest...
My problem is that I want to reorder the space separated fields of the 1st comma separated field (what I did above), but then also print the rest of the comma separated data.
Is there a way I can store the address info in a variable and append it after the space seperated names?
Alternatively, could I do some kind of nested split?
I'm currently doing this with awk in bash, but am willing to use python/pandas or any other efficient methods.
Thanks for the help!

Using sed, looks terrible but works:
sed -E '2,$s/^([^ ,]*) ([^ ,]*)( [^,]*)?/\2\3 \1/' in
and POSIX version:
sed '2,$s/^\([^ ,]*\) \([^ ,]*\)\( [^,]*\)*/\2\3 \1/' in
output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

The following AWK script, as ugly as it is, works for your inputs (run with awk -F, -f script.awk):
{
split($1, names, " ");
for (i=2; i<=length(names); i++)
printf("%s ", names[i]);
printf("%s, ", names[1]);
for(i=2; i<NF; i++)
printf("%s,", $i);
print($NF)
}
Input:
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
Output:
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The same solution in Python:
import sys
import re
for line in sys.stdin:
parts = re.split('\s*,\s*', line)
names = parts[0].split()
print(", ".join([" ".join(names[1:] + names[:1])] + parts[1:]))

Another awk. This one works with the header line and Madonna (ie. single word fields):
$ awk ' # using awk
BEGIN{FS=OFS=","} # csv
{
n=split($1,a," ") # split the first field to a
for(i=n;i>1;i--) # iterate back from the last element of a
a[1]=a[i] " " a[1] # prepending to the first element of a
$1=a[1] # replace the first field with the first element of a
}1' file # output
Output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
Madonna, ...

$ awk '
BEGIN { FS=OFS=", " }
$1 ~ / / {
last = rest = $1
sub(/ .*/,"",last)
sub(/[^ ]+ /,"",rest)
$1 = rest " " last
}
{ print }
' file
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

Related

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

Compute a Python Program to to display the following output with the original 13 states in alphabetical order

The file States.txt contains the 50 U.S. states in the order in which they joined the union. Write a program, using the for loop, to display the following output with the original 13 states in alphabetical order.
Content inside the file States.txt:
Delaware
Pennsylvania
New Jersey
Georgia
Connecticut
Massachusetts
Maryland
South Carolina
New Hampshire
Virginia
New York
North Carolina
Rhode Island
Vermont
Kentucky
Tennessee
Ohio
Louisiana
Indiana
Mississippi
Illinois
Alabama
Maine
Missouri
Arkansas
Michigan
Florida
Texas
Iowa
Wisconsin
California
Minnesota
Oregon
Kansas
West Virginia
Nevada
Nebraska
Colorado
North Dakota
South Dakota
Montana
Washington
Idaho
Wyoming
Utah
Oklahoma
New Mexico
Arizona
Alaska
Hawaii
My current coding:
infile = open("States.txt", 'r')
states = [line.rstrip() for line in infile]
states.sort() # I sort those 50 states in alphabetical order first
for state in states:
print("\r", states[6],"\n",states[7], "\n", states[9], "\n", states[19], "\n", states[20], "\n", states[28], "\n", states[29], "\n", states[31], "\n", states[32], "\n", states[37], "\n", states[38], "\n", states[39], "\n", states[45], end="") #I locate exact position of those 13 states
break
infile.close()
Although I can display the expected output using the above coding , it is definitely not a good way to utilize the for loop function as I can still display the exact result without the for loop statement.
The expected output for the program: (the 13 states are needed to be displayed in a vertical way)
Connecticut
Delaware
Georgia
Maryland
Massachusetts
New Hampshire
New Jersey
New York
North Carolina
Pennsylvania
Rhode Island
South Carolina
Virginia

You could use slicing to take the first 13 entries of states before sorting:
infile = open("States.txt", 'r')
states = [line.rstrip() for line in infile]
original_states = states[:13]
original_states.sort()
for state in original_states:
print(state)
infile.close()

So if the first list contains 50 states, and you should output 13 of them in a sorted order, I guess you could make a list containing the 13 key states, and in your for loop, you could check if a state is one of them, and only if true print it?

You can use itertools.islice to read only the first 13 lines of the file before sorting them:
from itertools import islice
with open('States.txt') as infile:
print(*sorted(islice(infile, 13)), sep='')

This is my first contribution to stackoverflow so please go easy on me. I want to get more involved and give back because I have been helped greatly by this community.
Someone probably can simplify this more, but this is my solution:
filename="States.txt"
states= list()
with open (filename) as file:
# convert to list of names
items = list(file.read().split())
Generate list of state names from the file. Because the file contains space separated names, you need to keep some names together: e.g. New York
two_word_state_names=['New','North','South','West', 'Rhode']
skip=False
for idx,item in enumerate(items):
if skip:
skip=False
continue
elif item in two_word_state_names:
states.append(str(items[idx] + " " + items[idx+1]))
skip=True # skip the next loop because next item is the second word of the state
else:
states.append(item)
Now you can sort and print the first thirteen:
thirteen_states=states[:13]
thirteen_states.sort()
# Print result
for state in thirteen_states:
print (state)
Connecticut
Delaware
Georgia
Maryland
Massachusetts
New Hampshire
New Jersey
New York
North Carolina
Pennsylvania
Rhode Island
South Carolina
Virginia

Your original question had the states as space-separated words, which required extra logic to extract and represent the two-word states. With your edited question, it is much simpler:
with open ('States.txt') as file:
states = list(file.read().split('\n')[:13])
states.sort()
for state in states:
print (state)
Connecticut
Delaware
Georgia
Maryland
Massachusetts
New Hampshire
New Jersey
New York
North Carolina
Pennsylvania
Rhode Island
South Carolina
Virginia

Append a sub-string at the beginning in a DataFrame instead of at the end

I'd like to extract a sub-string from Name and add that in front of Address but cat is by default adding it to the end.
My data:
Name | Address
Eleanor A. Martin #/222 Rhapsody | Street 32601 Florida
Ann K. Wagner | 3071 Half and Half Drive Hialeah FL 33012
My code:
df = pd.DataFrame([['Eleanor A. Martin #/222 Rhapsody ','Street 32601 Florida'],['Ann K. Wagner','3071 Half and Half Drive Hialeah FL 33012']],columns=['Name','Address'])
df['Address'] = df['Address'].str.cat(df['Name'].str.extract(r'#/(.*)'), sep=' ', na_rep = '').str.strip()
Current Result:
Name | Address
Eleanor A. Martin #/222 Rhapsody | Street 32601 Florida 222 Rhapsody
Ann K. Wagner | 3071 Half and Half Drive Hialeah FL 33012
Desired result:
Name | Address
Eleanor A. Martin #/222 Rhapsody | 222 Rhapsody Street 32601 Florida
Ann K. Wagner | 3071 Half and Half Drive Hialeah FL 33012
This is not working in my set (messing up different rows)
df['Address'] = df['Name'].str.extract(r'#/(.*)') + " " + df['Address']
How can I add the sub-string from Name in front of the string in Address ?

First add parameter expand=False for Series from Series.str.extract, add separator and replace missing values to empty string, last add second column:
df['Address'] = (df['Name'].str.extract(r'#/(.*)', expand=False).add(" ").fillna('') +
df['Address'])
Alternative:
df['Address'] = ((df['Name'].str.extract(r'#/(.*)', expand=False) + " ").fillna('') +
df['Address'])
print (df)
Name \
0 Eleanor A. Martin #/222 Rhapsody
1 Ann K. Wagner
Address
0 222 Rhapsody Street 32601 Florida
1 3071 Half and Half Drive Hialeah FL 33012

Similar to your original solution:
df['Address'] = df['Name'].str.extract(r'#/(.*)').str.cat(df['Address'], sep=' ', na_rep = '').str.strip()

How to apply regex to get the exact house number with approximate residual address match

import re
list =[]
for element in address1:
z = re.match("^\d+", element)
if z:
list.append(z.string)
get_best_fuzzy("SATYAGRAH;OPP. RAJ SUYA BUNGLOW", list)
I have tried the above code, it is giving me the approximate address match for the addresses in my text file. How can I get the exact house number match with approximate rest address match. My addresses are in format:
1004; Jay Shiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India
101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India

you can try this.
Code :
import re
address = ["1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India",
"101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India",
"101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India"]
for i in address:
z = re.match("^([^ ;]+)", i)
print(z.group())
Output :
1004
101
101/32-B

Pandas exits early before I can read entire Excel file

I'm trying to read in a Python excel file into Pandas, access particular columns of each row, and geocode an address to coordinates. Then write them to a csv
The geocoding part works good, and as far as I know my loop starts out good where it can read the address. However, it just stops as 22 rows. I have no clue why, I've been using Pandas with this same excel file for something else and it does fine. Just doing this, not so much. It has 27k rows in it. Printing out data.__len__() gives me 27395. Any help?
##### READ IN DATA
file = r'rollingsales_manhattan.xls'
# Read in the data from the Excel
data = pd.read_excel(file)
# g = geocoder.osm(str(data['ADDRESS'].iloc[0]) + " New York City, NY " + str(data['ZIP CODE'].iloc[0]))
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for index,d in enumerate(data):
print(str(data['ADDRESS'].iloc[index]) + " New York City, NY " + str(data['ZIP CODE'].iloc[index]))
Then my output...
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
229 EAST 2ND STREET New York City, NY 10009
243 EAST 7TH STREET New York City, NY 10009
238 EAST 4TH STREET New York City, NY 10009
303 EAST 4TH STREET New York City, NY 10009
Process finished with exit code 0

You need to use the iteritems() method to iterate over the Pandas series. To iterate them both, use map() like such...
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for a, z in map(None, data['ADDRESS'].iteritems(), data['ZIP CODE'].iteritems()):
print(str(a[1]) + " New York City, NY " + str(z[1]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWK reformat portion of results (names) within larger string - python

Related

Split column in DataFrame based on item in list

Compute a Python Program to to display the following output with the original 13 states in alphabetical order

Append a sub-string at the beginning in a DataFrame instead of at the end

How to apply regex to get the exact house number with approximate residual address match

Pandas exits early before I can read entire Excel file

Categories

Resources