Consolidating data in Pandas

Consolidating data in Pandas - python

I have two datasets that I am doing a left outer merge on in Pandas. Here's the first:
Name Address
0 Joe Schmuckatelli 123 Main Street
1 Fred Putzarelli 456 Pine Street
2 Harry Cox 789 Vine Street
And the second:
Address InvoiceNum
0 123 Main Street 51450
1 456 Pine Street 51389
2 789 Vine Street 58343
3 123 Main Street 52216
4 456 Pine Street 53124-001
5 789 Vine Street 61215
6 789 Vine Street 51215-001
The merged data looks like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450
1 Joe Schmuckatelli 123 Main Street 52216
2 Fred Putzarelli 456 Pine Street 51389
3 Fred Putzarelli 456 Pine Street 53124-001
4 Harry Cox 789 Vine Street 58343
5 Harry Cox 789 Vine Street 61215
6 Harry Cox 789 Vine Street 51215-001
Ideally I would like to have one line per address with all of the invoice numbers for that address in the third column, like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
The code I used to merge the data looks like this:
mergedData = pd.merge(complaintData, invoiceData, on='Address', how='left')
Is there a way to do this easily in Pandas or some other way?

We can groupby aggregate the values in df2by joining strings together for each Address before join / merge with df1:
new_df = df1.join(
df2.groupby('Address')['InvoiceNum'].aggregate(', '.join),
on='Address',
how='left'
)
new_df:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
*Either join or merge work here, although join has slightly less overhead in this case since the result of groupby has Address as the index.
Setup:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Joe Schmuckatelli', 'Fred Putzarelli', 'Harry Cox'],
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street']
})
df2 = pd.DataFrame({
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street',
'123 Main Street', '456 Pine Street', '789 Vine Street',
'789 Vine Street'],
'InvoiceNum': ['51450', '51389', '58343', '52216', '53124-001', '61215',
'51215-001']
})

Related

Delete the rows with repeated characters in the dataframe

I have a large dataset from csv file to clean with the patterns I've identified but I can't upload the file here so I've just hardcoded a small sample to give an overview of what I'm looking for. The identified patterns are the repeated characters in the values. However, if you look at the dataframe below, there are actually repeated 'single characters' like ssssss, fffff, aaaaa, etc and then the repeated 'double characters' like dgdg, bvbvbv, tutu, etc. There are also repeated 'triple characters' such as yutyut and fdgfdg.
Despite of this, would it be also possible to delete the rows with ANY repeated 'single/double/triple characters' so that I can apply them to the large dataset? For example, the dataframe here only shows the patterns I identified above, however, there could be repeated characters of ANY letters like 'uuuu', 'zzzz', 'eded, 'rsrsrs', 'xyzxyz', etc in the large dataset.
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 wssssss The Crescent tyutyut Mill Road
2 qfdgfdgdg dddfffff qdffgfdgfggfbvbvbv sefsdfdyuytutu
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
6 qaaaaass mjkhjk chfghfghh fghfhfh
Here's the code:
import pandas as pd
import numpy as np
data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'],
'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'],
'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'],
'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']}
address_details = pd.DataFrame(data)
#Code to delete the data for the identified patterns
print(address_details)
The output I expect is:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Please advise, thank you!

Try with str.contains and loc with agg:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)])
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
Or if you care about index:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)].reset_index(drop=True))
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Edit:
For only lowercase letters, try:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"([a-z]+)\1{1,}\b"), axis=1).any(1)].reset_index(drop=True))

groupby data with columns that have mixed data types

Lets say I had this sample of a mixed dataset:
df:
Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
I'd like to group by all this data starting with property and have it look like this:
grouped:
Property Name Date of entry Old data Updated Data
City names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
State names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
grouped = pd.DataFrame(df.groupby(['Property','Name','Date of entry','Old Data', 'updated data'])
.size(),columns=['Count'])
grouped
and I get a type error saying: '<' not supported between instances of 'int' and 'datetime.datetime'
Is there some sort of formatting that I need to do to the df['Old data'] & df['Updated data'] columns to allow them to be added to the groupby?
added data types:
Property: Object
Name: Object
Date of entry: datetime
Old data: Object
Updated data: Object

*I modified your initial data to get a better view of the output.
You can try with pivot_table instead of groupby:
df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x)
Output:
Old data Updated data
Property Name Date of entry
Address Harry 2/3/2021 123 lane 123 street
Lisa 2/3/2021 123 lane 123 street
City Jack 1/8/2021 TX Miami
Jim 1/7/2021 Jacksonville Miami
Tammy 1/8/2021 TX Miami
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Order count Jack 3/4/2021 2 3
Tammy 3/4/2021 2 3
State Jack 1/8/2021 TX CA
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Zip Joe 2/2/2021 11111 22222
The whole code:
import pandas as pd
from io import StringIO
txt = '''Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
City Jack 1/8/2021 TX Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Order count Jack 3/4/2021 2 3
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Address Lisa 2/3/2021 123 lane 123 street
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
City Tammy 1/8/2021 TX Miami
'''
df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s{2,}', engine='python')
print(df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x))

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

Replace values in None cells in Pandas with regular expression

Could someone help me to solve one problem. I have a dataset, which contains restaurants and their addresses, and looks like this:
import pandas as pd
import re
df = pd.DataFrame (
data = [
['rest_1', 'city City_name, street Street, bld 1'],
['rest_2', 'city City_name, street Street Name, bld 2'],
['rest_3', 'City_name, street 1-st Street Name, building 2'],
['rest_4', 'city City_name, Street Name street, flat 1'],
['rest_5', 'City_name city, Streen Name avemue, flat 2'],
['rest_6', 'city City_name, bdr Street Name Second_name, flt 3'],
['rest_7', 'street Street, bld 3'],
['rest_8', 'Doublename Street street, building 4']
],
columns = ['restaurant', 'address']
)
print(df)
restaurant address
0 rest_1 city City_name, street Street, bld 1
1 rest_2 city City_name, street Street Name, bld 2
2 rest_3 City_name, street 1-st Street Name, building 2
3 rest_4 city City_name, Street Name street, flat 1
4 rest_5 City_name city, Streen Name avemue, flat 2
5 rest_6 city City_name, bdr Street Name Second_name, f...
6 rest_7 street Street, bld 3
7 rest_8 Doublename Street street, building 4
I need to create additional column with only street name. I've made a function with regular expression and applied it
def extract(street):
try:
street_name = re.search(',+[\w -№]*,',street).group()
return street_name[1:-1]
except:
print(street)
df['street'] = df['address'].apply(extract)
the problem is that addresses have different formats. And a few of them don`t contains city. So, as a result i get such table
print(df)
restaurant address street
0 rest_1 city City_name, street Street, bld 1 street Street
1 rest_2 city City_name, street Street Name, bld 2 street Street Name
2 rest_3 City_name, street 1-st Street Name, building 2 street 1-st Street Name
3 rest_4 city City_name, Street Name street, flat 1 Street Name street
4 rest_5 City_name city, Streen Name avemue, flat 2 Streen Name avemue
5 rest_6 city City_name, bdr Street Name Second_name, f... bdr Street Name Second_name
6 rest_7 street Street, bld 3 None
7 rest_8 Doublename Street street, building 4 None
How can I apply another regular expression only to None values cells in result dataframe and get the result such as
restaurant address street
0 rest_1 city City_name, street Street, bld 1 street Street
1 rest_2 city City_name, street Street Name, bld 2 street Street Name
2 rest_3 City_name, street 1-st Street Name, building 2 street 1-st Street Name
3 rest_4 city City_name, Street Name street, flat 1 Street Name street
4 rest_5 City_name city, Streen Name avenue, flat 2 Streen Name avenue
5 rest_6 city City_name, bdr Street Name Second_name, f... bdr Street Name Second_name
6 rest_7 street Street, bld 3 street Street
7 rest_8 Doublename Street street, building 4 Doublename Doublename Street street
Will be thankful for any help!

I've done it by simple cycle:
for i in df[df['street'].isnull()].index:
data.loc[i,'street'] = re.search('^[\w -№]*,',data.loc[i,'address']).group()
and got the result I needed

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?

Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Consolidating data in Pandas - python

Related

Delete the rows with repeated characters in the dataframe

groupby data with columns that have mixed data types

Split column in DataFrame based on item in list

Replace values in None cells in Pandas with regular expression

Group a dataframe by a column and concactenate strings in another

Categories

Resources