Split a row into more rows based on a string (regex) - python

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?

You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

Related

PySpark: create new column based on dictionary values matching with string in another column

I have a dataframe A that looks like this:
ID
SOME_CODE
TITLE
1
024df3
Large garden in New York, New York
2
0ffw34
Small house in dark Detroit, Michigan
3
93na09
Red carpet in beautiful Miami
4
8339ct
Skyscraper in Los Angeles, California
5
84p3k9
Big shop in northern Boston, Massachusetts
I have also another dataframe B:
City
Shortcut
Los Angeles
LA
New York
NYC
Miami
MI
Boston
BO
Detroit
DTW
I would like to add new "SHORTCUT" column to dataframe A, based on the fact that "Title" column in A contains city from column "City" in dataframe B.
I have tried to use dataframe B as dictionary and map it to dataframe A, but I can't overcome fact that city names are in the middle of the sentence.
The desired output is:
ID
SOME_CODE
TITLE
SHORTCUT
1
024df3
Large garden in New York, New York
NYC
2
0ffw34
Small house in dark Detroit, Michigan
DTW
3
93na09
Red carpet in beautiful Miami, Florida
MI
4
8339ct
Skyscraper in Los Angeles, California
LA
5
84p3k9
Big shop in northern Boston, Massachusetts
BO
I will appreciate your help.
You can leverage pandas.apply function
And see if this helps:
import numpy as np
import pandas as pd
data1={'id':range(5),'some_code':["024df3","0ffw34","93na09","8339ct","84p3k9"],'title':["Large garden in New York, New York","Small house in dark Detroit, Michigan","Red carpet in beautiful Miami","Skyscraper in Los Angeles, California","Big shop in northern Boston, Massachusetts"]}
df1=pd.DataFrame(data=data1)
data2={'city':["Los Angeles","New York","Miami","Boston","Detroit"],"shortcut":["LA","NYC","MI","BO","DTW"]}
df2=pd.DataFrame(data=data2)
# Creating a list of cities.
cities=list(df2['city'].values)
def matcher(x):
for index,city in enumerate(cities):
if x.lower().find(city.lower())!=-1:
return df2.iloc[index]["shortcut"]
return np.nan
df1['shortcut']=df1['title'].apply(matcher)
print(df1.head())
This would generate the following o/p:
id some_code title shortcut
0 0 024df3 Large garden in New York, New York NYC
1 1 0ffw34 Small house in dark Detroit, Michigan DTW
2 2 93na09 Red carpet in beautiful Miami MI
3 3 8339ct Skyscraper in Los Angeles, California LA
4 4 84p3k9 Big shop in northern Boston, Massachusetts BO

Pandas: Split and/or update columns, based on inconsistent data?

So I have a column that contains baseball team names, and I want to split it into the 2 new columns, that will contain separately city name and team name.
Team
New York Giants
Atlanta Braves
Chicago Cubs
Chicago White Sox
I would like to get something like this:
Team
City
Franchise
New York Giants
New York
Giants
Atlanta Braves
Atlanta
Braves
Chicago Cubs
Chicago
Cubs
Chicago White Sox
Chicago
White Sox
What I have tried so far?
using split and rsplit --> it gets the job done, but can't unify it.
did the count df['cnt'] = df.asc.apply(lambda x: len(str(x).split(' '))) to get number of strings, so I know what kind of cases I have
There are 3 different cases:
Standard one (e.g. Atlanta Braves)
City with 2 strings (e.g. New York Giants)
Franchise name with 2 strings (e.g. Chicago White Sox )
What I would like to do?
Split based on conditions (if cnt=2 then split on 1st occurence). Can't find syntax for this, how this would go?
Update based on names (e.g. if ['Col_name'].str.contains("York" or "Angeles") then split on 2nd occurence . Also, can't find working syntax, example for this?
What would be a good approach to solve this?
Thanks!
Use:
#part of cities with space
cities = ['York','Angeles']
#test rows
m = df['Team'].str.contains('|'.join(cities))
#first split by first space to 2 new columns
df[['City','Franchise']] = df['Team'].str.split(n=1, expand=True)
#split by second space only filtered rows
s = df.loc[m, 'Team'].str.split(n=2)
#update values
df.update(pd.concat([s.str[:2].str.join(' '), s.str[2]], axis=1, ignore_index=True).set_axis(['City','Franchise'], axis=1))
print (df)
Team City Franchise
0 New York Giants New York Giants
1 Atlanta Braves Atlanta Braves
2 Chicago Cubs Chicago Cubs
3 Chicago White Sox Chicago White Sox

Replacing a string with list of them in a dataframe in pandas seperated by a capital letter

DATA
Metropolitan area Population (2016 est.)[8] NHL
0 New York 20153634 RangersIslandersDevils
1 Los Angeles 13310447 KingsDucks
2 San Jose 6657982 Sharks
3 Chicago 9512999 Blackhawks
I want the output to be:
Metropolitan area Population (2016 est.)[8] NHL
0 New York 20153634 ['Rangers','Islanders','Devils']
1 Los Angeles 13310447 ['Kings','Ducks']
2 San Jose 6657982 Sharks
3 Chicago 9512999 Blackhawks
I want these string to be in list so that I can use explode() later. please help
You can split using a regex with positive lookahead:
df['NHL'].str.split('[a-z](?=[A-Z])')
Output:
0 [Ranger, Islander, Devils]
1 [King, Ducks]
2 [Sharks]
3 [Blackhawks]
The pattern '[a-z](?=[A-Z])' looks for all lowercase letters followed by uppercase letters.

Python/Pandas for loop through a list only working on the last item in the list

This is a bit strange to me...
I have a DataFrame with a 'utility' column and an 'envelope' column.
I have a list of cities that get sent special envelopes:
['Chicago', 'New York', 'Dallas', 'LA']
I need to loop through each value in the utility column, check if it's in the list of cities that get sent special envelopes, and if it is, add the utility name to the envelope column.
This is the code I wrote to do that:
utilityEnv = ['Chicago', 'New York', 'Dallas', 'LA']
for i in utilityEnv :
print(i)
for j in df.index :
if i in df.at[j, 'utility'] :
print('true')
df.at[j, 'envelope'] = df.at[j, 'utility']
else :
df.at[j, 'envelope'] = 'ABF'
When I run this code, it prints the utility name, then a bunch of 'true'-s for each utility as it's supposed to each time it's going to set the envelope column to equal the utility column, but, the final df shows that the envelope columns were set to equal the utility column ONLY for LA, and not for any of the other cities. Even though there are many 'true'-s printed for the other utilities which means it made it into that block for other utilities as well.
For example:
This is what happens:
utility envelope
0 Chicago ABF
1 New York ABF
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas ABF
7 LA LA
8 Chicago ABF
9 Austin ABF
This is what supposed to happen:
utility envelope
0 Chicago Chicago
1 New York New york
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
Sorry about the formatting I had to do it on my phone
Any idea why this is happening??
Use Series.where with Series.isin
df['envelope']=df['utility'].where(df['utility'].isin(utilityEnv), 'ABF')
Output
utility envelope
0 Chicago Chicago
1 New York New York
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
This is much faster than using loops, panda methods are created for these things.
Here I show you
a correct code with loops but you should not use this
for i in df.index:
val = df.at[i,'utility']
if val in utilityEnv:
df.at[i,'envelop']=val
else:
df.at[i,'envelop']='ABF'

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

Categories