Conditional extraction of a substring from a string

Conditional extraction of a substring from a string - python

I've got a DataFrame with a column that is an object and contains the full address in one of the following formats:
'street name, building number',
'city, street, building number',
'city, district, street, building number'.
Regardless of the format I need to extract the name of the street and copy it to a new column. I cannot attach the original DF since all the information is in Russian. I've created a dummy DF instead:
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
N.B. Different parts of one string are always separated by one comma. The substring with the street name in it always contains one of the words: 'street', 'avenue', 'boulevard'.
After several hours of Googling I've come up with something like this but to no avail:
street_list = ['street', 'avenue', 'boulevard']
for row in df:
for x in street_list:
if df.loc[row, 'address'].split(', ')[0].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[0]
elif df.loc[row, 'address'].split(', ')[1].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[1]
elif df.loc[row, 'address'].split(', ')[2].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[2]
This code doesn't work for me. Is it possible to tweak it somehow so that it works(or maybe you know a better solution)?
Please let me know if any additional information is required.

As far as I understand:
1. The streets could be in a couple of positions in the comma separated values depending on the length.
2. The streets has an additional substring check.
In the below code:
Point 1 is represented by the streetMap
Point 2 is represented by the 'any' condition
import pandas as pd
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
streetMap = {2:0,3:1,4:2} # Map of length of items to location of street.
street_list = ['street', 'avenue', 'boulevard']
addresses = df['address']
streets = []
for address in addresses:
items = address.split(', ')
streetCandidate = items[streetMap[len(items)]]
street = streetCandidate if any([s in streetCandidate for s in street_list]) else "NA"
streets.append(street)
df['street'] = streets
print(df)
Output:
0 new york city, the bronx borough, willis avenu... willis avenue
1 town of salem, main street, building 105 main street
2 second boulevard, 12 second boulevard

Related

In python, If cell contains specific string, remove specific string and while moving specific string to a different cell

During data entry, Some of the States were added to the same cell as the Address line. The city and state vary and are generally unknown. There are also some cases of a , that would need to be removed.
AddrLines AddrCity AddrState
0 123 street Titusville FL NaN
1 456 road Viera FL NaN
2 789 place Melbourne, Fl NaN
3 001 ave Wright VA
My goal is to clean up the City column while moving the state over to the State column. Something like this, but removing the state and the , at the same time.
df.loc[(df['AddrCity'].str.contains(' Fl')),'AddrState'] = 'FL'
df.loc[(df['AddrCity'].str.contains(' FL')),'AddrState'] = 'FL'
df.loc[(df['AddrCity'].str.contains(', Fl')),'AddrState'] = 'FL'

I was able to get this by performing the following
df1[['AddrCity', 'State_Holder']] = df1['AddrCity'].str.replace(', ', ' ').str.replace(' ', ', ').str.split(', ', 1, expand = True)
df1['AddrState'] = np.where(df1['AddrState'].isna(), df1['State_Holder'], df1['AddrState'])
df1.drop(columns = ['State_Holder'])

How to identify dummy data in pandas and delete?

Is there a way to identify the dummy data in a dataframe and delete them? In my data below, there are random characters in each column that I need to delete.
import pandas as pd
import numpy as np
data = {'Name' : ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
#Code to identify and delete dummy data
print(contact_details)
Output of the above code:
Name Address1 Address2
0 Tom High Street Park Avenue
1 AABBCC uwdfjfuf The Crescent
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
6 U iwefwfn 1ca2s597

have you investigated your data? Are always the "good data" a combination of lowercase and uppercase characters? If so you could make a function to find those dummy data, for example:
if text.lower() == text or text.upper() == text:
# text is dummy

Without a good definition of good and bad values for each column, there's really nothing you can do automatically. There are a couple of data cleaning tricks you can use to make those values easier to find in a large dataset.
Starting with your original dataset:
import pandas as pd
data = {'Name': ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
The first thing you can do is get the unique values for a column to reduce the number of values you're looking through.
# get all the unique values in the 'Name' column
names = contact_details['Name'].unique()
Next you can sort them so that any near-duplicates will stand out more easily. Near duplicates happen often with errors in data entry.
# sort them alphabetically and then take a closer look
names.sort()
print(list(names))
So for example, is you had seen the values ' Tom', 'Tom', and 'Tom ', you know you need to strip whitespace from names.
contact_details['Name'] = contact_details['Name'].strip()
Another benefit of sorting unique values in a column is that string values that start with a number will all be at the beginning of the list, and lowercase strings will be sorted at the end. This makes a couple of your 'Address1 values stand out.
# get all the unique values in the 'Address1' column
address1 = contact_details['Address1'].unique()
address1.sort()
print(list(address1))
This gives me the list of unique values:
['00000', 'Church Street', 'Green Lane', 'High Street', 'Kingsway', 'iwefwfn', 'uwdfjfuf']
It's not clear yet if the first value is valid, but those last two look suspect. If I want to remove those, I can filter them out by selecting all rows where Address1 is not in a list of bad values.
contact_details_filtered = contact_details[~contact_details['Address1'].isin(['iwefwfn', 'uwdfjfuf'])]
print(contact_details_filtered)
This gives me the output:
Name Address1 Address2
0 Tom High Street Park Avenue
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
Row 2 is definitely suspect, and row 4 is questionable, but I think you get the idea of how to find and remove values that look like placeholders or just bad data.

Python categorize and create new columns from CSV full addresses

How to categorize and create new columns from full addresses?
The address string is comma separated and use those keywords:
*district
*city
*borough
*village
*municipality
*county
The source full address format can look like this:
col1;col2;fulladdress
data1_1;data2_1;Some district, Cityname, Some county
data1_2;data2_2;Another village, Another municipality, Another county
data1_3;data2_3;Third city, Third county
data1_4;data2_4;Forth borough, Forth municipality, Forth county
There is one peculiarity with one city in particular - This city is called "Clause" and sometimes it is wrote out as format of: "Clause city" and sometimes it is just "Clause" in full address string.
For example:
Clause city, Third county
Seventh district, Clause, Seventh municipality
So, I want to categorize only one version format which is "Clause city" to avoid duplicate output. So, if there is "Clause" alone in full address string, it should be renamed to "Clause city" in CSV.
The source data file is called data.csv and exported.csv for the categorized version.
All I have is this:
import pandas as pd
df = pd.read_csv('data.csv', delimiter=';')
address = df.fulladdress.str.split(',', expand=True)
district = df[df.fulladdress.str.match('district')]
city = df[df.fulladdress.str.match('city')]
borough = df[df.fulladdress.str.match('borough')]
village = df[df.fulladdress.str.match('village')]
municipality = df[df.fulladdress.str.match('municipality')]
county = df[df.fulladdress.str.match('county')]
df.to_csv('exported.csv', sep=';', encoding='utf8', index=True)
print ("Done!")

If there is only one problem city, I think you could use replace?
raw csv data:
Some district, Cityname, Some county
Another village, Another municipality, Another county
Third city, Third county
Forth borough, Forth municipality, Forth county
Clause city, Third county
Seventh district, Clause, Seventh municipality
replace solution:
df = pd.read_csv('clause.csv', names=['district', 'city', 'country'], sep=',')
# need to strip white space for replace to work
df = df.apply(lambda x: x.str.strip())
df.replace('Clause', 'Clause city', inplace=True)
output:
district city country
0 Some district Cityname Some county
1 Another village Another municipality Another county
2 Third city Third county NaN
3 Forth borough Forth municipality Forth county
4 Clause city Third county NaN
5 Seventh district Clause city Seventh municipality

How do I get all the values of different strings that end with a certain word

My dataframe has a column called Borough which contains values like these:
"east toronto", "west toronto", "central toronto" and "west toronto", along with other region names.
Now I want a regular expression which gets me the data of every entry that ends with "toronto". How do I do that?
I tried this:
tronto_data = df_toronto[df_toronto['Borough'] = .*Toronto$].reset_index(drop=True)
tronto_data.head(7)

If the data is well formatted you can split the string on the space and access the final word, comparing it to Toronto. For example
df = pd.DataFrame({'column': ['west toronto', 'central toronto', 'some place']})
mask_df = df['column'].str.split(' ', expand=True)
which returns:
0 1
0 west toronto
1 central toronto
2 some place
you can then access the final column to work out the rows that end with Toronto.
toronto_df = df[mask_df[1]=='toronto']
Edit:
Did not know there was a string method .endswith which is the better way to do this. However, this solution does provide two columns which maybe useful.

Like #Code_10 refers in a comment that you can use string.endswith.. try below->
df = pd.DataFrame({'city': ['east toronto', 'west toronto', 'other', 'central toronto']})
df_toronto = df[df['city'].str.endswith('toronto')]
#df_toronto.head()

Extract last term after comma into new column

I have a pandas dataframe which is essentially 2 columns and 9000 rows
CompanyName | CompanyAddress
and the address is in the form
Line1, Line2, ..LineN, PostCode
i.e. basically different numbers of comma-separated items in a string (or dtype 'object'), and I want to just pull out the post code i.e. the item after the last comma in the field
I've tried the Dot notation string manipulation suggestions (possibly badly):
df_address['CompanyAddress'] = df_address['CompanyAddress'].str.rsplit(', ')
which just put '[ ]' around the fields - I had no success trying to isolate the last component of any split-up/partitioned string, with maxsplit kicking up errors.
I had a small degree of success following EdChums comment to Pandas split Column into multiple columns by comma
pd.concat([df_address[['CompanyName']], df_address['CompanyAddress'].str.rsplit(', ', expand=True)], axis=1)
However, whilst isolating the Postcode, this just creates multiple columns and the post code is in columns 3-6... equally no good.
It feels incredibly close, please advise.
EmployerName Address
0 FAUCET INN LIMITED [Union, 88-90 George Street, London, W1U 8PA]
1 CITIBANK N.A [Citigroup Centre,, Canary Wharf, Canada Squar...
2 AGENCY 2000 LIMITED [Sovereign House, 15 Towcester Road, Old Strat...
3 Transform Trust [Unit 11 Castlebridge Office Village, Kirtley ...
4 R & R.C.BOND (WHOLESALE) LIMITED [One General Street, Pocklington Industrial Es...
5 MARKS & SPENCER FINANCIAL SERVICES PLC [Marks & Spencer Financial, Services Kings Mea...

Given the DataFrame,
df = pd.DataFrame({'Name': ['ABC'], 'Address': ['Line1, Line2, LineN, PostCode']})
Address Name
0 Line1, Line2, LineN, PostCode ABC
If you need only post code, you can extract that using rsplit and re-assign it to the column Address. It will save you the step of concat.
df['Address'] = df['Address'].str.rsplit(',').str[-1]
You get
Address Name
0 PostCode ABC
Edit: Give that you have dataframe with address values in list
df = pd.DataFrame({'Name': ['FAUCET INN LIMITED'], 'Address': [['Union, 88-90 George Street, London, W1U 8PA']]})
Address Name
0 [Union, 88-90 George Street, London, W1U 8PA] FAUCET INN LIMITED
You can get last element using
df['Address'] = df['Address'].apply(lambda x: x[0].split(',')[-1])
You get
Address Name
0 W1U 8PA FAUCET INN LIMITED

Just rsplit the existing column into 2 columns - the existing one and a new one. Or two new ones if you want to keep the existing column intact.
df['Address'], df['PostCode'] = df['Address'].str.rsplit(', ', 1).str
Edit: Since OP's Address column is a list with 1 string in it, here is a solution for that specifically:
df['Address'], df['PostCode'] = df['Address'].map(lambda x: x[0]).str.rsplit(', ', 1).str

rsplit returns a list, try rsplit(‘,’)[0] to get last element in source line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional extraction of a substring from a string - python

Related

In python, If cell contains specific string, remove specific string and while moving specific string to a different cell

How to identify dummy data in pandas and delete?

Python categorize and create new columns from CSV full addresses

How do I get all the values of different strings that end with a certain word

Extract last term after comma into new column

Categories

Resources