Combining columns of dataframe based on value in another column - python

Input df(example)
Country SubregionA SubregionB
BRA State of Acre Brasiléia
BRA State of Acre Cruzeiro do Sul
USA AL Bibb County
USA AL Blount County
USA AL Bullock County
Output df
Country SubregionA SubregionB
BRA State of Acre State of Acre - Brasiléia
BRA State of Acre State of Acre - Cruzeiro do Sul
USA AL AL Bibb County
USA AL AL Blount County
USA AL AL Bullock County
The code snippet is quite self explanatory, but when executed seems to run forever. What could be going wrong(Also the dataframe 'data' is quite large around 250K+ rows)
for row in data.itertuples():
region = data['Country']
if region == 'ARG' :
data['SubregionB'] = data[['SubregionA' 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
elif region == 'BRA' :
data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
elif region == 'USA':
data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
else:
pass
Explanation : Trying to join columns SubregionA and SubregionB based on values in the column name 'Country'. The separators are different and thus have written multiple if-else statements. Takes too long to execute, how can I make this faster?

You can use numpy.select with Series.isin and join columns with +:
print (df)
Country SubregionA SubregionB
0 BRA State of Acre Brasilia
1 BRA State of Acre Cruzeiro do Sul
2 USA AL Bibb County
3 USA AL Blount County
4 USA AL Bullock County
5 JAP AAA BBBB
reg1 = ['ARG','BRA']
reg2 = ['USA']
a = np.select([df['Country'].isin(reg1), df['Country'].isin(reg2)],
[df['SubregionA'] + ' - ' + df['SubregionB'],
df['SubregionA'] + ' ' + df['SubregionB']],
default=df['SubregionB'])
df['SubregionB'] = a
print (df)
Country SubregionA SubregionB
0 BRA State of Acre State of Acre - Brasilia
1 BRA State of Acre State of Acre - Cruzeiro do Sul
2 USA AL AL Bibb County
3 USA AL AL Blount County
4 USA AL AL Bullock County
5 JAP AAA BBBB

Related

Extracting strings from a list by a specific word

I have this column of addresses in pandas and I want to select only those addresses in the US, however I either get an empty string or thrown an error.
Here's what I have done:
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA
5 Hillstrust Primary School, 29 Nethan St, Govan, Glasgow G51 3LX, UK
6 5778+JM Godalming, UK
7 569 Durham Rd, Low Fell, Gateshead NE9 5EY, UK
8 Pennine Way, Barnard Castle DL12, UK
9 14 Studios Rd, Shepperton TW17 0QW, UK
matching = [s for s in final_data["full_address"] if "USA" in s]
matching
#returns: TypeError: argument of type 'float' is not iterable
#Whereas
ab = [final_data["full_address"]]
matching = [s for s in ab if "USA" in s]
matching
#returns: []
Expected output:
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA
Try this:
import pandas as pd
data = {
'full_address': [
'238 Lincoln St, Hahnville, LA 70057, USA', '101 Home Pl Ln, Hahnville, LA 70057, USA', '1250 Poydras St, New Orleans, LA 70113, USA',
'1117 Broadway STE 401, Tacoma, WA 98402, USA', '2715 N Junett St, Tacoma, WA 98407, USA', '5778+JM Godalming, UK', '569 Durham Rd, Low Fell, Gateshead NE9 5EY, UK',
'Pennine Way, Barnard Castle DL12, UK', '14 Studios Rd, Shepperton TW17 0QW, UK'
]
}
df = pd.DataFrame(data)
matching = df[df['full_address'].str.contains("USA")]
print(matching)
Output:
full_address
0 238 Lincoln St, Hahnville, LA 70057, USA
1 101 Home Pl Ln, Hahnville, LA 70057, USA
2 1250 Poydras St, New Orleans, LA 70113, USA
3 1117 Broadway STE 401, Tacoma, WA 98402, USA
4 2715 N Junett St, Tacoma, WA 98407, USA
Hello I have tried to recreate your scenario and in this it is working I just added a query with contain statement on specific column which is here is country
import pandas as pd
# Build cars DataFrame
names = ['238 Lincoln St, Hahnville, LA 70057, USA', '101 Home Pl Ln, Hahnville, LA 70057, USA', 'Hillstrust Govan, Glasgow G51 3LX, UK']
dict = { 'country':names}
cars = pd.DataFrame(dict)
b = cars.query('country.str.contains("USA")', engine='python')
print(b)

Pandas Merge Result Output Next Row

Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df
Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

how to preserve original indexes in the new dataframe

def answer_eight():
templist = list()
for county, region, p15, p14, ste, cty in zip(census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME):
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame.from_records(templist, columns=labels)
return df
STNAME CTYNAME
0 Iowa Washington County
1 Minnesota Washington County
2 Pennsylvania Washington County
3 Rhode Island Washington County
4 Wisconsin Washington County
All these CTYNAME has different indexes in the original census_df. How could I transfer them over to the new DF so the answer looks like:
STNAME CTYNAME
12 Iowa Washington County
222 Minnesota Washington County
400 Pennsylvania Washington County
2900 Rhode Island Washington County
2999 Wisconsin Washington County
I'd include the index with the other things your are zipping
def answer_eight():
templist = list()
index = list()
zipped = zip(
census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME,
census_df.index
)
for county, region, p15, p14, ste, cty, idx in zipped:
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
index.append(idx)
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame(templist, index, labels)
return df.rename_axis(census_df.index.name)
Before you start filtering, you can assign the original index to a column with:
census_df['original index'] = census_df.index
Then just treat it like one of the other columns you're selecting from.

Categories