Python categorize and create new columns from CSV full addresses

Python categorize and create new columns from CSV full addresses - python

How to categorize and create new columns from full addresses?
The address string is comma separated and use those keywords:
*district
*city
*borough
*village
*municipality
*county
The source full address format can look like this:
col1;col2;fulladdress
data1_1;data2_1;Some district, Cityname, Some county
data1_2;data2_2;Another village, Another municipality, Another county
data1_3;data2_3;Third city, Third county
data1_4;data2_4;Forth borough, Forth municipality, Forth county
There is one peculiarity with one city in particular - This city is called "Clause" and sometimes it is wrote out as format of: "Clause city" and sometimes it is just "Clause" in full address string.
For example:
Clause city, Third county
Seventh district, Clause, Seventh municipality
So, I want to categorize only one version format which is "Clause city" to avoid duplicate output. So, if there is "Clause" alone in full address string, it should be renamed to "Clause city" in CSV.
The source data file is called data.csv and exported.csv for the categorized version.
All I have is this:
import pandas as pd
df = pd.read_csv('data.csv', delimiter=';')
address = df.fulladdress.str.split(',', expand=True)
district = df[df.fulladdress.str.match('district')]
city = df[df.fulladdress.str.match('city')]
borough = df[df.fulladdress.str.match('borough')]
village = df[df.fulladdress.str.match('village')]
municipality = df[df.fulladdress.str.match('municipality')]
county = df[df.fulladdress.str.match('county')]
df.to_csv('exported.csv', sep=';', encoding='utf8', index=True)
print ("Done!")

If there is only one problem city, I think you could use replace?
raw csv data:
Some district, Cityname, Some county
Another village, Another municipality, Another county
Third city, Third county
Forth borough, Forth municipality, Forth county
Clause city, Third county
Seventh district, Clause, Seventh municipality
replace solution:
df = pd.read_csv('clause.csv', names=['district', 'city', 'country'], sep=',')
# need to strip white space for replace to work
df = df.apply(lambda x: x.str.strip())
df.replace('Clause', 'Clause city', inplace=True)
output:
district city country
0 Some district Cityname Some county
1 Another village Another municipality Another county
2 Third city Third county NaN
3 Forth borough Forth municipality Forth county
4 Clause city Third county NaN
5 Seventh district Clause city Seventh municipality

Related

Conditional extraction of a substring from a string

I've got a DataFrame with a column that is an object and contains the full address in one of the following formats:
'street name, building number',
'city, street, building number',
'city, district, street, building number'.
Regardless of the format I need to extract the name of the street and copy it to a new column. I cannot attach the original DF since all the information is in Russian. I've created a dummy DF instead:
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
N.B. Different parts of one string are always separated by one comma. The substring with the street name in it always contains one of the words: 'street', 'avenue', 'boulevard'.
After several hours of Googling I've come up with something like this but to no avail:
street_list = ['street', 'avenue', 'boulevard']
for row in df:
for x in street_list:
if df.loc[row, 'address'].split(', ')[0].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[0]
elif df.loc[row, 'address'].split(', ')[1].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[1]
elif df.loc[row, 'address'].split(', ')[2].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[2]
This code doesn't work for me. Is it possible to tweak it somehow so that it works(or maybe you know a better solution)?
Please let me know if any additional information is required.

As far as I understand:
1. The streets could be in a couple of positions in the comma separated values depending on the length.
2. The streets has an additional substring check.
In the below code:
Point 1 is represented by the streetMap
Point 2 is represented by the 'any' condition
import pandas as pd
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
streetMap = {2:0,3:1,4:2} # Map of length of items to location of street.
street_list = ['street', 'avenue', 'boulevard']
addresses = df['address']
streets = []
for address in addresses:
items = address.split(', ')
streetCandidate = items[streetMap[len(items)]]
street = streetCandidate if any([s in streetCandidate for s in street_list]) else "NA"
streets.append(street)
df['street'] = streets
print(df)
Output:
0 new york city, the bronx borough, willis avenu... willis avenue
1 town of salem, main street, building 105 main street
2 second boulevard, 12 second boulevard

strip data frame cell then create columns

i'm trying to take the info from dataframe and break it out into columns with the following header names. the info is all crammed into 1 cell.
new to python, so be gentle.
thanks for the help
my code:
r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')
page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])
print(df)
the result i get:
['License Number, 80479 Status, Valid Renewal Date, n/a Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']

You can use read_html with some post processing:
url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'
#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T
print (df)
Address Classifications County License Number \
1 3217 Vagabond Dr Monroe, NC 28110 Residential Union 80479
Limitation Name Qualifiers Renewal Date \
1 Limited DLR Construction, LLC Arteaga, Vicky Rodriguez
Status Telephone
1 Valid (980) 245-0867

Replace the df line like below:
df = pd.DataFrame(company_info, columns = ['ic_number', 'status', 'renewal_date', 'company_name', 'address', 'county', 'telephon', 'limitation', 'residential_qualifiers'])
Each column mentioned under columns should be within quotes. Else it is considered as one single column.

Extract last term after comma into new column

I have a pandas dataframe which is essentially 2 columns and 9000 rows
CompanyName | CompanyAddress
and the address is in the form
Line1, Line2, ..LineN, PostCode
i.e. basically different numbers of comma-separated items in a string (or dtype 'object'), and I want to just pull out the post code i.e. the item after the last comma in the field
I've tried the Dot notation string manipulation suggestions (possibly badly):
df_address['CompanyAddress'] = df_address['CompanyAddress'].str.rsplit(', ')
which just put '[ ]' around the fields - I had no success trying to isolate the last component of any split-up/partitioned string, with maxsplit kicking up errors.
I had a small degree of success following EdChums comment to Pandas split Column into multiple columns by comma
pd.concat([df_address[['CompanyName']], df_address['CompanyAddress'].str.rsplit(', ', expand=True)], axis=1)
However, whilst isolating the Postcode, this just creates multiple columns and the post code is in columns 3-6... equally no good.
It feels incredibly close, please advise.
EmployerName Address
0 FAUCET INN LIMITED [Union, 88-90 George Street, London, W1U 8PA]
1 CITIBANK N.A [Citigroup Centre,, Canary Wharf, Canada Squar...
2 AGENCY 2000 LIMITED [Sovereign House, 15 Towcester Road, Old Strat...
3 Transform Trust [Unit 11 Castlebridge Office Village, Kirtley ...
4 R & R.C.BOND (WHOLESALE) LIMITED [One General Street, Pocklington Industrial Es...
5 MARKS & SPENCER FINANCIAL SERVICES PLC [Marks & Spencer Financial, Services Kings Mea...

Given the DataFrame,
df = pd.DataFrame({'Name': ['ABC'], 'Address': ['Line1, Line2, LineN, PostCode']})
Address Name
0 Line1, Line2, LineN, PostCode ABC
If you need only post code, you can extract that using rsplit and re-assign it to the column Address. It will save you the step of concat.
df['Address'] = df['Address'].str.rsplit(',').str[-1]
You get
Address Name
0 PostCode ABC
Edit: Give that you have dataframe with address values in list
df = pd.DataFrame({'Name': ['FAUCET INN LIMITED'], 'Address': [['Union, 88-90 George Street, London, W1U 8PA']]})
Address Name
0 [Union, 88-90 George Street, London, W1U 8PA] FAUCET INN LIMITED
You can get last element using
df['Address'] = df['Address'].apply(lambda x: x[0].split(',')[-1])
You get
Address Name
0 W1U 8PA FAUCET INN LIMITED

Just rsplit the existing column into 2 columns - the existing one and a new one. Or two new ones if you want to keep the existing column intact.
df['Address'], df['PostCode'] = df['Address'].str.rsplit(', ', 1).str
Edit: Since OP's Address column is a list with 1 string in it, here is a solution for that specifically:
df['Address'], df['PostCode'] = df['Address'].map(lambda x: x[0]).str.rsplit(', ', 1).str

rsplit returns a list, try rsplit(‘,’)[0] to get last element in source line

Iterate geolocation over pandas dataframe

I have a dataframe that has two columns, Hospital name and Address, and I want to iterate through each address to find the latitude and longitude. My code seems to be taking the first row in the dataframe and I can't seem to select the address to find the coordinates.
import pandas
from geopy.geocoders import Nominatim
geolocator = Nominatim()
for index, item in df.iterrows():
location = geolocator.geocode(item)
df["Latitude"].append(location.latitude)
df["Longitude"].append(location.longitude)
Here is the code I used to scrape the website. Copy and run this and you'll have the data set.
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("https://www.privatehealth.co.uk/hospitals-and-
clinics/orthopaedic-surgery/?offset=300")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"col-9"})
names = []
for item in all:
d={}
d["Hospital Name"] = item.find(["h3"],{"class":"mb6"}).text.replace("\n","")
d["Address"] = item.find(["p"],{"class":"mb6"}).text.replace("\n","")
names.append(d)
df=pandas.DataFrame(names)
df = df[['Hospital Name','Address']]
df
Currently the data looks like (one hospital example):
Hospital Name |Address
Fulwood Hospital|Preston, PR2 9SZ
The final output that I'm trying to achieve looks like.
Hospital Name |Address | Latitude | Longitude
Fulwood Hospital|Preston, PR2 9SZ|53.7589938|-2.7051618

Seems like there are a few issues here. Using data from the URL you provided:
df.head()
Hospital Name Address
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
(1) If your data frame column names really are Hospital name and Address, then you need to use item.Address in the call to geocode().
Just using item will give you both Hospital name and Address.
for index, item in df.iterrows():
print(f"index: {index}")
print(f"item: {item}")
print(f"item.Address only: {item.Address}")
# Output:
index: 0
item: Hospital Name Fortius Clinic City
Address London, EC4N 7BE
Name: 0, dtype: object
item.Address only: London, EC4N 7BE
...
(2) You noted that your data frame only has two columns. If that's true, you'll get a KeyError when you try to perform operations on df["Latitude"] and df["Longitude"], because they don't exist.
(3) Using apply() on the Address column might be clearer than iterrows().
Note that this is a stylistic point, and debatable. (The first two points are actual errors.)
For example, using the provided URL:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
tmp = df.head().copy()
latlon = tmp.Address.apply(lambda addr: geolocator.geocode(addr))
tmp["Latitude"] = [x.latitude for x in latlon]
tmp["Longitude"] = [x.longitude for x in latlon]
Output:
Hospital Name Address \
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
Latitude Longitude
0 51.507322 -0.127647
1 51.946413 -0.279165
2 50.840871 -0.180561
3 51.507322 -0.127647
4 51.131528 0.278068

Rearrange CSV data

I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.

For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.