How do I use df.str.replace() only for complete matches? - python

I want to replace values in my df, but only if the values are a complete match, not partial. Here's an example:
import pandas as pd
df = pd.DataFrame({'Name':['Mark', 'Laura', 'Adam', 'Roger', 'Anna'],
'City':['Los Santos', 'Montreal', 'Los', 'Berlin', 'Glasgow']})
print(df)
Name City
0 Mark Los Santos
1 Laura Montreal
2 Adam Los
3 Roger Berlin
4 Anna Glasgow
I want to replace Los by Los Santos but if I do it the intuitive way, it results in this:
df['City'].str.replace('Los', 'Los Santos')
Out[121]:
0 Los Santos Santos
1 Montreal
2 Los Santos
3 Berlin
4 Glasgow
Name: City, dtype: object
Obviously, I don't want Los Santos Santos.

Use Series.replace, because Series.str.replace by default replace by substrings:
df['City'] = df['City'].replace('Los', 'Los Santos')

You can also use:
df['City'].str.replace('.*Los$', 'Los Santos')
0 Los Santos
1 Montreal
2 Los Santos
3 Berlin
4 Glasgow
Name: City, dtype: object

Related

Pandas Groupby Newbie Conundrum

**Using Pandas 1.4.2, Python 3.9.12
I have a data frame as follows:
Neighbourhood No-show
0 JARDIM DA PENHA No
1 JARDIM DA PENHA Yes
2 MATA DA PRAIA No
3 PONTAL DE CAMBURI No
4 JARDIM DA PENHA No
5 MARIA ORTIZ Yes
6 MARIA ORTIZ Yes
7 MATA DA PRAIA Yes
8 PONTAL DE CAMBURI No
9 MARIA ORTIZ No
How would I use groupby to get the total(count) of 'Yes' and total(count) of 'No' grouped by each 'Neighbourhood'? I keep getting 'NoNoYesNo' if I use .sum() and if I can get these grouped correctly by Neighbourhood I think I can graph much easier.
This data frame is truncated as there are numerous other columns but these are the only 2 I care about for this exercise.
Use df.groupby() as follows:
totals = df.groupby(['Neighbourhood','No-show'])['No-show'].count()
print(totals)
Neighbourhood No-show
JARDIM DA PENHA No 2
Yes 1
MARIA ORTIZ No 1
Yes 2
MATA DA PRAIA No 1
Yes 1
PONTAL DE CAMBURI No 2
Name: No-show, dtype: int64
Good point raised by #JonClements: you might want to add .unstack(fill_value=0) to that. So:
totals_unstacked = df.groupby(['Neighbourhood','No-show'])['No-show'].count().unstack(fill_value=0)
print(totals_unstacked)
No-show No Yes
Neighbourhood
JARDIM DA PENHA 2 1
MARIA ORTIZ 1 2
MATA DA PRAIA 1 1
PONTAL DE CAMBURI 2 0
You can use:
df[['Neighbourhood', 'No-show']].value_counts()

How to replace a list with first element of list in pandas dataframe column?

I have a pandas dataframe df, which look like this:
df = pd.DataFrame({'Name':['Harry', 'Sam', 'Raj', 'Jamie', 'Rupert'],
'Country':['USA', "['USA', 'UK', 'India']", "['India', 'USA']", 'Russia', 'China']})
Name Country
Harry USA
Sam ['USA', 'UK', 'India']
Raj ['India', 'USA']
Jamie Russia
Rupert China
Some values in Country column are list, and I want to replace those list with the first element in the list, so that it will look like this:
Name Country
Harry USA
Sam USA
Raj India
Jamie Russia
Rupert China
As you have strings, you could use a regex here:
df['Country'] = df['Country'].str.extract('((?<=\[["\'])[^"\']*|^[^"\']+$)')
output (as a new column for clarity):
Name Country Country2
0 Harry USA USA
1 Sam ['USA', 'UK', 'India'] USA
2 Raj ['India', 'USA'] India
3 Jamie Russia Russia
4 Rupert China China
regex:
( # start capturing
(?<=\[["\']) # if preceded by [" or ['
[^"\']* # get all text until " or '
| # OR
^[^"\']+$ # get whole string if it doesn't contain " or '
) # stop capturing
Try something like:
import ast
def changeStringList(value):
try:
myList = ast.literal_eval(value)
return myList[0]
except:
return value
df["Country"] = df["Country"].apply(changeStringList)
df
Output
Name
Country
0
Harry
USA
1
Sam
USA
2
Raj
India
3
Jamie
Russia
4
Rupert
China
Note that, by using the changeStringList function, we try to reform the string list to an interpretable list of strings and return the first value. If it is not a list, then it returns the value itself.
Try this:
import ast
df['Country'] = df['Country'].where(df['Country'].str.contains('[', regex=False), '[\'' + df['Country'] + '\']').apply(ast.literal_eval).str[0]
Output:
>>> df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
A regex solution.
import re
tempArr = []
for val in df["Country"]:
if val.startswith("["):
val = re.findall(r"[A-Za-z]+",val)[0]
tempArr.append(val)
else: tempArr.append(val)
df["Country"] = tempArr
df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
If you have string you could use Series.str.strip in order to remove ']' or '[' and then use Series.str.split to convert all rows to list ,after that we could use .str accesor
df['Country'] = df['Country'].str.strip('[|]').str.split(',')\
.str[0].str.replace("'", "")
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China

Split a row into more rows based on a string (regex)

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

How to add pandas column values based on key from a dictionary in Python?

My dataframe looks like this:
I want to add the league the club plays in, and the country that league is based in, as new columns, for every row.
I initially tried this using dictionaries with the clubs/countries, and returning the key for a value:
club_country_dict = {'La Liga':['Real Madrid','FC Barcelona'],'France Ligue 1':['Paris Saint-Germain']}
key_list=list(club_country_dict.keys())
val_list=list(club_country_dict.values())
But this ran into issues since each of my keys values is actually a list, rather than a single value.
I then tried some IF THEN logic, by just having standalone variables for each league, and checking if the club value was in each variable:
la_Liga = ['Real Madrid','FC Barcelona']
for row in data:
if data['Club'] in la_Liga:
data['League'] = 'La Liga'
Apologies for the messy question. Basically I'm looking to add a two new columns to my dataset, 'League' and 'Country', based on the 'Club' column value. I'm not sure what's the easiest way to do this but I've hit walls trying to different ways. Thanks in advance.
You could convert the dictionary to a data frame and then merge:
df = pd.DataFrame({"Name": ["CR", "Messi", "neymar"], "Club": ["Real Madrid", "FC Barcelona", "Paris Saint-Germain"]})
df.merge(pd.DataFrame(club_country_dict.items(), columns=['League', 'Club']).explode('Club'),
on = 'Club', how='left')
Here is one of the simple way to solve it. Use Pandas apply function on rows
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
import pandas as pd
df = pd.DataFrame({"name": ["CR", "Messi", "neymar"], "club": ["RM", "BR", "PSG"]})
country = {"BR": "Spain", "PSG": "France", "RM": "Spain"}
df["country"] = df.apply(lambda row: country[row.club], axis=1)
print(df)
Output:
name club country
0 CR RM Spain
1 Messi BR Spain
2 neymar PSG France
Try pandas replace feature for Series.
df = pd.DataFrame({"Name" : ['Cristiano', 'L. Messi', "Neymar"], 'Club' : ["Real Madrid", "FC Barcelona", "Paris Saint-Germain"]})
df:
Name Club
0 Cristiano Real Madrid
1 L. Messi FC Barcelona
2 Neymar Paris Saint-Germain
Now add new column:
club_country_dict = {'Real Madrid': 'La Liga',
'FC Barcelona' : "La Liga",
'Paris Saint-Germain': 'France Ligue 1'}
df['League'] = df.Club.replace(club_country_dict)
df:
Name Club League
0 Cristiano Real Madrid La Liga
1 L. Messi FC Barcelona La Liga
2 Neymar Paris Saint-Germain France Ligue 1
To cope with the "list problem" in club_country_dict, convert it to
the following Series:
league_club = pd.Series(club_country_dict.values(), index=club_country_dict.keys(),
name='Club').explode()
The result is:
La Liga Real Madrid
La Liga FC Barcelona
France Ligue 1 Paris Saint-Germain
Name: Club, dtype: object
You should also have a "connection" between the league name and its
country (another Series):
league_country = pd.Series({'La Liga': 'Spain', 'France Ligue 1': 'France'}, name='Country')
Of course, add here other leagues of interest with their countries.
The next step is to join them into club_details DataFrame, with Club
as the index:
club_details = league_club.to_frame().join(league_country).reset_index()\
.rename(columns={'index':'League'}).set_index('Club')
The result is:
League Country
Club
Paris Saint-Germain France Ligue 1 France
Real Madrid La Liga Spain
FC Barcelona La Liga Spain
Then, assuming that your first DataFrame is named player, generate
the final result:
result = player.join(club_details, on='Club')
The result is:
Name Club League Country
0 Cristiano Ronaldo Real Madrid La Liga Spain
1 L. Messi FC Barcelona La Liga Spain
2 Neymar Paris Saint-Germain France Ligue 1 France

Python/Pandas for loop through a list only working on the last item in the list

This is a bit strange to me...
I have a DataFrame with a 'utility' column and an 'envelope' column.
I have a list of cities that get sent special envelopes:
['Chicago', 'New York', 'Dallas', 'LA']
I need to loop through each value in the utility column, check if it's in the list of cities that get sent special envelopes, and if it is, add the utility name to the envelope column.
This is the code I wrote to do that:
utilityEnv = ['Chicago', 'New York', 'Dallas', 'LA']
for i in utilityEnv :
print(i)
for j in df.index :
if i in df.at[j, 'utility'] :
print('true')
df.at[j, 'envelope'] = df.at[j, 'utility']
else :
df.at[j, 'envelope'] = 'ABF'
When I run this code, it prints the utility name, then a bunch of 'true'-s for each utility as it's supposed to each time it's going to set the envelope column to equal the utility column, but, the final df shows that the envelope columns were set to equal the utility column ONLY for LA, and not for any of the other cities. Even though there are many 'true'-s printed for the other utilities which means it made it into that block for other utilities as well.
For example:
This is what happens:
utility envelope
0 Chicago ABF
1 New York ABF
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas ABF
7 LA LA
8 Chicago ABF
9 Austin ABF
This is what supposed to happen:
utility envelope
0 Chicago Chicago
1 New York New york
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
Sorry about the formatting I had to do it on my phone
Any idea why this is happening??
Use Series.where with Series.isin
df['envelope']=df['utility'].where(df['utility'].isin(utilityEnv), 'ABF')
Output
utility envelope
0 Chicago Chicago
1 New York New York
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
This is much faster than using loops, panda methods are created for these things.
Here I show you
a correct code with loops but you should not use this
for i in df.index:
val = df.at[i,'utility']
if val in utilityEnv:
df.at[i,'envelop']=val
else:
df.at[i,'envelop']='ABF'

Categories