Pandas: Split and/or update columns, based on inconsistent data? - python

So I have a column that contains baseball team names, and I want to split it into the 2 new columns, that will contain separately city name and team name.
Team
New York Giants
Atlanta Braves
Chicago Cubs
Chicago White Sox
I would like to get something like this:
Team
City
Franchise
New York Giants
New York
Giants
Atlanta Braves
Atlanta
Braves
Chicago Cubs
Chicago
Cubs
Chicago White Sox
Chicago
White Sox
What I have tried so far?
using split and rsplit --> it gets the job done, but can't unify it.
did the count df['cnt'] = df.asc.apply(lambda x: len(str(x).split(' '))) to get number of strings, so I know what kind of cases I have
There are 3 different cases:
Standard one (e.g. Atlanta Braves)
City with 2 strings (e.g. New York Giants)
Franchise name with 2 strings (e.g. Chicago White Sox )
What I would like to do?
Split based on conditions (if cnt=2 then split on 1st occurence). Can't find syntax for this, how this would go?
Update based on names (e.g. if ['Col_name'].str.contains("York" or "Angeles") then split on 2nd occurence . Also, can't find working syntax, example for this?
What would be a good approach to solve this?
Thanks!

Use:
#part of cities with space
cities = ['York','Angeles']
#test rows
m = df['Team'].str.contains('|'.join(cities))
#first split by first space to 2 new columns
df[['City','Franchise']] = df['Team'].str.split(n=1, expand=True)
#split by second space only filtered rows
s = df.loc[m, 'Team'].str.split(n=2)
#update values
df.update(pd.concat([s.str[:2].str.join(' '), s.str[2]], axis=1, ignore_index=True).set_axis(['City','Franchise'], axis=1))
print (df)
Team City Franchise
0 New York Giants New York Giants
1 Atlanta Braves Atlanta Braves
2 Chicago Cubs Chicago Cubs
3 Chicago White Sox Chicago White Sox

Related

PySpark: create new column based on dictionary values matching with string in another column

I have a dataframe A that looks like this:
ID
SOME_CODE
TITLE
1
024df3
Large garden in New York, New York
2
0ffw34
Small house in dark Detroit, Michigan
3
93na09
Red carpet in beautiful Miami
4
8339ct
Skyscraper in Los Angeles, California
5
84p3k9
Big shop in northern Boston, Massachusetts
I have also another dataframe B:
City
Shortcut
Los Angeles
LA
New York
NYC
Miami
MI
Boston
BO
Detroit
DTW
I would like to add new "SHORTCUT" column to dataframe A, based on the fact that "Title" column in A contains city from column "City" in dataframe B.
I have tried to use dataframe B as dictionary and map it to dataframe A, but I can't overcome fact that city names are in the middle of the sentence.
The desired output is:
ID
SOME_CODE
TITLE
SHORTCUT
1
024df3
Large garden in New York, New York
NYC
2
0ffw34
Small house in dark Detroit, Michigan
DTW
3
93na09
Red carpet in beautiful Miami, Florida
MI
4
8339ct
Skyscraper in Los Angeles, California
LA
5
84p3k9
Big shop in northern Boston, Massachusetts
BO
I will appreciate your help.
You can leverage pandas.apply function
And see if this helps:
import numpy as np
import pandas as pd
data1={'id':range(5),'some_code':["024df3","0ffw34","93na09","8339ct","84p3k9"],'title':["Large garden in New York, New York","Small house in dark Detroit, Michigan","Red carpet in beautiful Miami","Skyscraper in Los Angeles, California","Big shop in northern Boston, Massachusetts"]}
df1=pd.DataFrame(data=data1)
data2={'city':["Los Angeles","New York","Miami","Boston","Detroit"],"shortcut":["LA","NYC","MI","BO","DTW"]}
df2=pd.DataFrame(data=data2)
# Creating a list of cities.
cities=list(df2['city'].values)
def matcher(x):
for index,city in enumerate(cities):
if x.lower().find(city.lower())!=-1:
return df2.iloc[index]["shortcut"]
return np.nan
df1['shortcut']=df1['title'].apply(matcher)
print(df1.head())
This would generate the following o/p:
id some_code title shortcut
0 0 024df3 Large garden in New York, New York NYC
1 1 0ffw34 Small house in dark Detroit, Michigan DTW
2 2 93na09 Red carpet in beautiful Miami MI
3 3 8339ct Skyscraper in Los Angeles, California LA
4 4 84p3k9 Big shop in northern Boston, Massachusetts BO

Split a row into more rows based on a string (regex)

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

Replacing a string with list of them in a dataframe in pandas seperated by a capital letter

DATA
Metropolitan area Population (2016 est.)[8] NHL
0 New York 20153634 RangersIslandersDevils
1 Los Angeles 13310447 KingsDucks
2 San Jose 6657982 Sharks
3 Chicago 9512999 Blackhawks
I want the output to be:
Metropolitan area Population (2016 est.)[8] NHL
0 New York 20153634 ['Rangers','Islanders','Devils']
1 Los Angeles 13310447 ['Kings','Ducks']
2 San Jose 6657982 Sharks
3 Chicago 9512999 Blackhawks
I want these string to be in list so that I can use explode() later. please help
You can split using a regex with positive lookahead:
df['NHL'].str.split('[a-z](?=[A-Z])')
Output:
0 [Ranger, Islander, Devils]
1 [King, Ducks]
2 [Sharks]
3 [Blackhawks]
The pattern '[a-z](?=[A-Z])' looks for all lowercase letters followed by uppercase letters.

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

Assign a Row to Data Frame Header that Starts with a Specific String from Excel- Pandas

I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.

Categories