Replace values in None cells in Pandas with regular expression

Replace values in None cells in Pandas with regular expression - python

Could someone help me to solve one problem. I have a dataset, which contains restaurants and their addresses, and looks like this:
import pandas as pd
import re
df = pd.DataFrame (
data = [
['rest_1', 'city City_name, street Street, bld 1'],
['rest_2', 'city City_name, street Street Name, bld 2'],
['rest_3', 'City_name, street 1-st Street Name, building 2'],
['rest_4', 'city City_name, Street Name street, flat 1'],
['rest_5', 'City_name city, Streen Name avemue, flat 2'],
['rest_6', 'city City_name, bdr Street Name Second_name, flt 3'],
['rest_7', 'street Street, bld 3'],
['rest_8', 'Doublename Street street, building 4']
],
columns = ['restaurant', 'address']
)
print(df)
restaurant address
0 rest_1 city City_name, street Street, bld 1
1 rest_2 city City_name, street Street Name, bld 2
2 rest_3 City_name, street 1-st Street Name, building 2
3 rest_4 city City_name, Street Name street, flat 1
4 rest_5 City_name city, Streen Name avemue, flat 2
5 rest_6 city City_name, bdr Street Name Second_name, f...
6 rest_7 street Street, bld 3
7 rest_8 Doublename Street street, building 4
I need to create additional column with only street name. I've made a function with regular expression and applied it
def extract(street):
try:
street_name = re.search(',+[\w -№]*,',street).group()
return street_name[1:-1]
except:
print(street)
df['street'] = df['address'].apply(extract)
the problem is that addresses have different formats. And a few of them don`t contains city. So, as a result i get such table
print(df)
restaurant address street
0 rest_1 city City_name, street Street, bld 1 street Street
1 rest_2 city City_name, street Street Name, bld 2 street Street Name
2 rest_3 City_name, street 1-st Street Name, building 2 street 1-st Street Name
3 rest_4 city City_name, Street Name street, flat 1 Street Name street
4 rest_5 City_name city, Streen Name avemue, flat 2 Streen Name avemue
5 rest_6 city City_name, bdr Street Name Second_name, f... bdr Street Name Second_name
6 rest_7 street Street, bld 3 None
7 rest_8 Doublename Street street, building 4 None
How can I apply another regular expression only to None values cells in result dataframe and get the result such as
restaurant address street
0 rest_1 city City_name, street Street, bld 1 street Street
1 rest_2 city City_name, street Street Name, bld 2 street Street Name
2 rest_3 City_name, street 1-st Street Name, building 2 street 1-st Street Name
3 rest_4 city City_name, Street Name street, flat 1 Street Name street
4 rest_5 City_name city, Streen Name avenue, flat 2 Streen Name avenue
5 rest_6 city City_name, bdr Street Name Second_name, f... bdr Street Name Second_name
6 rest_7 street Street, bld 3 street Street
7 rest_8 Doublename Street street, building 4 Doublename Doublename Street street
Will be thankful for any help!

I've done it by simple cycle:
for i in df[df['street'].isnull()].index:
data.loc[i,'street'] = re.search('^[\w -№]*,',data.loc[i,'address']).group()
and got the result I needed

Related

Get the first word after specific word in a column

I am trying to get the zip code after the specific word 'zip_code' within a string.
I have a data frame with a column named "location", in this column there is a string, I want to identify the word "zip_code" and get the value after this word for each row.
Input
name location
Bar1 LA Jefferson zip_code 202378 Avenue free
Pizza Avenue 45 zip_code 45623 wichita st
Tacos Las Americas avenue 7 zip_code 67890 nicolas st
Expected output
name location
Bar1 202378
Pizza 45623
Tacos 67890
So far, following an example I was able to extract the zip code for any string
str = "address1 355 Turnpike Ste 4 address3 zip_code 02021 country US "
str.split("zip_code")[1].split()[0]
>> 02021
But I do not know how to do the same for each row of my column location

The best way is to use extract() which accepts regex and allows searching through each row.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['Bar1', 'Pizza', 'Tacos'],
'location':['LA Jefferson zip_code 202378 Avenue free', 'Avenue 45 zip_code 45623 wichita st', 'Las Americas avenue 7 zip_code 67890 nicolas st']})
df['location'] = df['location'].str.extract('zip_code\s(.*?)\s')
>>> df
name location
0 Bar1 202378
1 Pizza 45623
2 Tacos 67890

Consolidating data in Pandas

I have two datasets that I am doing a left outer merge on in Pandas. Here's the first:
Name Address
0 Joe Schmuckatelli 123 Main Street
1 Fred Putzarelli 456 Pine Street
2 Harry Cox 789 Vine Street
And the second:
Address InvoiceNum
0 123 Main Street 51450
1 456 Pine Street 51389
2 789 Vine Street 58343
3 123 Main Street 52216
4 456 Pine Street 53124-001
5 789 Vine Street 61215
6 789 Vine Street 51215-001
The merged data looks like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450
1 Joe Schmuckatelli 123 Main Street 52216
2 Fred Putzarelli 456 Pine Street 51389
3 Fred Putzarelli 456 Pine Street 53124-001
4 Harry Cox 789 Vine Street 58343
5 Harry Cox 789 Vine Street 61215
6 Harry Cox 789 Vine Street 51215-001
Ideally I would like to have one line per address with all of the invoice numbers for that address in the third column, like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
The code I used to merge the data looks like this:
mergedData = pd.merge(complaintData, invoiceData, on='Address', how='left')
Is there a way to do this easily in Pandas or some other way?

We can groupby aggregate the values in df2by joining strings together for each Address before join / merge with df1:
new_df = df1.join(
df2.groupby('Address')['InvoiceNum'].aggregate(', '.join),
on='Address',
how='left'
)
new_df:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
*Either join or merge work here, although join has slightly less overhead in this case since the result of groupby has Address as the index.
Setup:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Joe Schmuckatelli', 'Fred Putzarelli', 'Harry Cox'],
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street']
})
df2 = pd.DataFrame({
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street',
'123 Main Street', '456 Pine Street', '789 Vine Street',
'789 Vine Street'],
'InvoiceNum': ['51450', '51389', '58343', '52216', '53124-001', '61215',
'51215-001']
})

Delete the rows with repeated characters in the dataframe

I have a large dataset from csv file to clean with the patterns I've identified but I can't upload the file here so I've just hardcoded a small sample to give an overview of what I'm looking for. The identified patterns are the repeated characters in the values. However, if you look at the dataframe below, there are actually repeated 'single characters' like ssssss, fffff, aaaaa, etc and then the repeated 'double characters' like dgdg, bvbvbv, tutu, etc. There are also repeated 'triple characters' such as yutyut and fdgfdg.
Despite of this, would it be also possible to delete the rows with ANY repeated 'single/double/triple characters' so that I can apply them to the large dataset? For example, the dataframe here only shows the patterns I identified above, however, there could be repeated characters of ANY letters like 'uuuu', 'zzzz', 'eded, 'rsrsrs', 'xyzxyz', etc in the large dataset.
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 wssssss The Crescent tyutyut Mill Road
2 qfdgfdgdg dddfffff qdffgfdgfggfbvbvbv sefsdfdyuytutu
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
6 qaaaaass mjkhjk chfghfghh fghfhfh
Here's the code:
import pandas as pd
import numpy as np
data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'],
'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'],
'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'],
'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']}
address_details = pd.DataFrame(data)
#Code to delete the data for the identified patterns
print(address_details)
The output I expect is:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Please advise, thank you!

Try with str.contains and loc with agg:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)])
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
Or if you care about index:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)].reset_index(drop=True))
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Edit:
For only lowercase letters, try:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"([a-z]+)\1{1,}\b"), axis=1).any(1)].reset_index(drop=True))

Splitting rows in Pandas based on column values and mapping column names

I have a dataframe with two columns Person Name and Company Name. I want to create two more columns called Name and Name_Type. Name would be concat of Person and Company Name and Name_Type column would determine if the name is Person type or Company type. Some rows have empty strings, which creates four possibilities:
1) Empty Person + Empty Company = Can be left blank.
2) Empty Person + Company Name = Company Name Value
3) Person Name + Empty Person = Person Name Value
4) Both Name = Split them into two rows. Cannot figure out how to do that.
I am a Python and Pandas beginner, I haven't come across an answer online. Hoping to find something here. Please excuse format or other errors.
Input:
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
Company_name Person_name
0 Aaron
1 XYZ Inc
2 ABC LLC Phil
3 Joe
Expected output:
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name

You can use apply, unstack and merge
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
def logic(row):
if row.Company_name and row.Person_name:
return pd.Series([[row.Person_name, "Person_name"], [row.Company_name, "Company_name"]])
else:
return pd.Series([[row.Person_name, "Person_name"] if row.Person_name else [row.Company_name, "Company_name"]])
df2 = df.apply(logic, 1).unstack().apply(pd.Series).dropna().reset_index().set_index("level_1").sort_index()
dff = pd.merge(df,df2, left_index=True, right_index=True).iloc[:, [0,1,3,4]]
dff.columns = ["Company_name", "Person_name", "Name", "Name_Type"]
Output
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name

Use:
(df1.melt('index', var_name='Name_Type', value_name='Name')
.replace('',np.nan).dropna()
.merge(df1, on='index').sort_values('index')
.set_index('index'))
Output:
Name_Type Name Person_name Company_name
index
0 Person_name Aaron Aaron
1 Company_name XYZ Inc XYZ Inc
2 Person_name Phil Phil ABC LLC
2 Company_name ABC LLC Phil ABC LLC
3 Person_name Joe Joe

Pandas exits early before I can read entire Excel file

I'm trying to read in a Python excel file into Pandas, access particular columns of each row, and geocode an address to coordinates. Then write them to a csv
The geocoding part works good, and as far as I know my loop starts out good where it can read the address. However, it just stops as 22 rows. I have no clue why, I've been using Pandas with this same excel file for something else and it does fine. Just doing this, not so much. It has 27k rows in it. Printing out data.__len__() gives me 27395. Any help?
##### READ IN DATA
file = r'rollingsales_manhattan.xls'
# Read in the data from the Excel
data = pd.read_excel(file)
# g = geocoder.osm(str(data['ADDRESS'].iloc[0]) + " New York City, NY " + str(data['ZIP CODE'].iloc[0]))
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for index,d in enumerate(data):
print(str(data['ADDRESS'].iloc[index]) + " New York City, NY " + str(data['ZIP CODE'].iloc[index]))
Then my output...
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
229 EAST 2ND STREET New York City, NY 10009
243 EAST 7TH STREET New York City, NY 10009
238 EAST 4TH STREET New York City, NY 10009
303 EAST 4TH STREET New York City, NY 10009
Process finished with exit code 0

You need to use the iteritems() method to iterate over the Pandas series. To iterate them both, use map() like such...
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for a, z in map(None, data['ADDRESS'].iteritems(), data['ZIP CODE'].iteritems()):
print(str(a[1]) + " New York City, NY " + str(z[1]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace values in None cells in Pandas with regular expression - python

I've done it by simple cycle: for i in df[df['street'].isnull()].index: data.loc[i,'street'] = re.search('^[\w -№]*,',data.loc[i,'address']).group() and got the result I needed

Related

Get the first word after specific word in a column

Consolidating data in Pandas

Delete the rows with repeated characters in the dataframe

Splitting rows in Pandas based on column values and mapping column names

Pandas exits early before I can read entire Excel file

Categories

Resources