Remove characters from a string in a dataframe - python

python beginner here. I would like to change some characters in a column in a dataframe under certain conditions.
The dataframe looks like this:
import pandas as pd
import numpy as np
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df
My goal is to replace in the column last name the space followed by the parenthesis and the two letters.
Blue instead of Blue (VS).
There is 26 letter variations that I have to remove but only one format: last_name followed by space followed by parenthesis followed by two letters followed by parenthesis.
From what I understood it should be that in regexp:
( \(..\)
I tried using str.replace but it only works for exact match and it replaces the whole value.
I also tried this:
df.loc[df['favorite_color'].str.contains(‘VS’), 'favorite_color'] = ‘random’
it also replaces the whole value.
I saw that I can only rewrite the value but I also saw that using this:
df[0].str.slice(0, -5)
I could remove the last 5 characters of a string containing my search.
In my mind I should make a list of the 26 occurrences that I want to be removed and parse through the column to remove those while keeping the text before. I searched for post similar to my problem but could not find a solution. Do you have any idea for a direction ?

You can use str.replace with pattern "(\(.*?\))"
Ex:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df["newCol"] = df["favorite_color"].str.replace("(\(.*?\))", "").str.strip()
print( df )
Output:
age favorite_color grade name newCol
0 20 blue (VS) 88 Willard Morris blue
1 19 red 92 Al Jennings red
2 22 yellow (AG) 95 Omar Mullins yellow
3 21 green 70 Spencer McDaniel green

Related

Parse the column value and save the first section in new column

I need to parse column values in a data frame and save the first parsed section in a new column if it has a parsing delimiter like "-" if not leave it empty
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','11-02-11-55-00445','test', '31-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', 'green'],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
df.head()
adding a new column that has the first parsed section and the expected column values are :
01
11
null
31
df['parsed'] = df['code'].apply(lambda x: x.split('-')[0] if '-' in x else 'null')
will output:
name code favorite_color grade parsed
0 Willard Morris 01-02-11-55-00115 blue 88 01
1 Al Jennings 11-02-11-55-00445 blue 92 11
2 Omar Mullins test yellow 95 null
3 Spencer McDaniel 31-0t-11-55-00115 green 70 31

Replace certain values of one column, with different values from a different df, pandas

I have a df,
for example -
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['python', 'medical','sql','c++'],
})
and another df -
df2 = pd.DataFrame({'occupation': ['data scientist', 'data analyst'],
'knowledge':['5', '4'],
})
I want to replace the knowledge values of the first DF with the knowledge values of the second, but only for the rows which are the same.
making the first DF look like that:
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['5', 'medical','4','c++'],
})
I tried to do stuff with replace, but it didn't work...
You may try this:
occ_know_dict = df2.set_index('occupation').to_dict()['knowledge']
df['knowledge'] = df[['knowledge','occupation']].apply(
lambda row: occ_know_dict[row['occupation']] if row['occupation'] in occ_know_dict else row['knowledge'], axis=1)
You can try map the corresponding knowledge column which shares the same occupation of df2 to df1 then update the value to df.
df['knowledge'].update(df['occupation'].map(df2.set_index('occupation')['knowledge']))
Note that update happens inplace.
print(df)
name age occupation knowledge
0 name1 21 data scientist 5
1 name2 23 doctor medical
2 name3 24 data analyst 4
3 name4 28 engineer c++

Group by and Filter with Pandas without loosing groupby

I'm a begginer in the subject and didn't find anything to help me here so far.
I'm struggling in grouping my data and then filtering with a value I need.
Like the example,
I need to Know, for example, how many Red Cars Juan bought.
(Red Cars sells for each client).
When I try, I loose the group or the filter, I can't do both.
Can someone help me or suggest a post please?
Edit1.
With the help of the community, I find this as my solution:
df = df.loc[:, df.columns.intersection(['Name', 'Car colour', 'Amount'])]
df = df.query('Car colour == Red')
df.groupby(['Name', 'Car colour'])['Amount'].sum().reset_index()
If you want to consider amount sold by group of Name and Car_color then try
df.groupby(['Name', 'Car colour'])['Amount'].sum().reset_index()
# Name Car colour Amount
0 Juan green 1
1 Juan red 3
2 Wilson blue 1
3 carlos yellow 1
GroupBy.sum
df.groupby(['Name','Car Color']).sum()
output:
import pandas as pd
data = {"Name": ["Juan", "Wilson", "Carlos", "Juan", "Juan", "Wilson", "Juan", "Carlos"],
"Car Color": ["Red", "Blue", "Yellow", "Red", "Red", "Red", "Red", "Green"],
"Amount": [24, 28, 40, 22, 29, 33, 31, 50]}
df = pd.DataFrame(data)
print(df)
You can group by multiple columns by passing a list of column names to the groupby function, then taking the sum of each group.
import pandas as pd
df = pd.DataFrame({'Name': ['Juan', 'Wilson', 'Carlos', 'Juan', 'Juan', 'Wilson', 'Juan'],
'Car Color': ['Red', 'Blue', 'Yellow', 'Red', 'Red', 'Red', 'Green'],
'Amount': [1, 1, 1, 1, 1, 1, 1]})
print(df)
agg_df = df.groupby(['Name', 'Car Color']).sum()
print(agg_df)
Output:
Name Car Color
Carlos Yellow 1
Juan Green 1
Red 3
Wilson Blue 1
Red 1
Note that the resulting dataframe has a multi-index, so you can get the number of red cars that Juan bought by passing a tuple of values to loc.
cars = agg_df.loc[[('Juan', 'Red')]]
print(cars)
Output:
Amount
Name Car Color
Juan Red 3

Is there a way to create a pandas dataframe column based on current sort position and another column?

I have multiple columns of data and I want to find where a value lies within its "class" as well as overall.
Here is some example data (let's assume the "class" we're measuring against is is eye_color and the metric is score):
raw_data = {'name': ['Alex', 'Alicia', 'Omar', 'Louise', 'Alice'],
'age': [20, 19, 35, 24, 32],
'eye_color': ['blue', 'blue', 'brown', "green", "brown"],
'score': [88, 92, 95, 70, 96]}
df = pd.DataFrame(raw_data)
df = df.sort_values(['eye_color', 'score'], ascending=[True, False])
I want to create a column that would use the current sort order to give a value of "Brown1" for Alice, "Brown2" for Omar, "Green1" for Louise, etc.
I'm not sure how to approach and am fairly sure there's an easy way to do it before I overengineer a function that re-sorts based on each class and then recreates an index or something...
Use groupby().cumcount():
df['new'] = df['eye_color'] + df.groupby('eye_color').cumcount().add(1).astype(str)
Output:
name age eye_color score new
1 Alicia 19 blue 92 blue1
0 Alex 20 blue 88 blue2
4 Alice 32 brown 96 brown1
2 Omar 35 brown 95 brown2
3 Louise 24 green 70 green1

filter pandas where some columns contain any of the words in a list

I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas
You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999
Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)
You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]

Categories