Parse the column value and save the first section in new column - python

I need to parse column values in a data frame and save the first parsed section in a new column if it has a parsing delimiter like "-" if not leave it empty
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','11-02-11-55-00445','test', '31-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', 'green'],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
df.head()
adding a new column that has the first parsed section and the expected column values are :
01
11
null
31

df['parsed'] = df['code'].apply(lambda x: x.split('-')[0] if '-' in x else 'null')
will output:
name code favorite_color grade parsed
0 Willard Morris 01-02-11-55-00115 blue 88 01
1 Al Jennings 11-02-11-55-00445 blue 92 11
2 Omar Mullins test yellow 95 null
3 Spencer McDaniel 31-0t-11-55-00115 green 70 31

Related

Pandas dataframe sorting string values and by descending aggregated values

I'm working on transforming a dataframe to show the top 3 earners.
The dataframe looks like this
data = {'Name': ['Allistair', 'Bob', 'Carrie', 'Diane', 'Allistair', 'Bob', 'Carrie','Evelyn'], 'Sale': [20, 21, 19, 18, 5, 300, 35, 22]}
df = pd.DataFrame(data)
print(df)
Name Sale
0 Allistair 20
1 Bob 21
2 Carrie 19
3 Diane 18
4 Allistair 5
5 Bob 300
6 Carrie 35
7 Evelyn 22
In my actual dataset, I have several more columns and rows, and I want to print out and get to
something like
Name Sale
0 Bob 321
1 Carrie 35
2 Allistair 25
Every iteration that I've searched through doesn't quite get there because I get
'Name' is both an index level and a column label, which is ambiguous.
Use groupby:
>>> df.groupby('Name').sum().sort_values('Sale', ascending=False)
Sale
Name
Bob 321
Carrie 54
Allistair 25
Evelyn 22
Diane 18
Thanks to #Andrej Kasely above,
df.groupby("Name")["Sale"].sum().nlargest(3)

Is there a way to create a pandas dataframe column based on current sort position and another column?

I have multiple columns of data and I want to find where a value lies within its "class" as well as overall.
Here is some example data (let's assume the "class" we're measuring against is is eye_color and the metric is score):
raw_data = {'name': ['Alex', 'Alicia', 'Omar', 'Louise', 'Alice'],
'age': [20, 19, 35, 24, 32],
'eye_color': ['blue', 'blue', 'brown', "green", "brown"],
'score': [88, 92, 95, 70, 96]}
df = pd.DataFrame(raw_data)
df = df.sort_values(['eye_color', 'score'], ascending=[True, False])
I want to create a column that would use the current sort order to give a value of "Brown1" for Alice, "Brown2" for Omar, "Green1" for Louise, etc.
I'm not sure how to approach and am fairly sure there's an easy way to do it before I overengineer a function that re-sorts based on each class and then recreates an index or something...
Use groupby().cumcount():
df['new'] = df['eye_color'] + df.groupby('eye_color').cumcount().add(1).astype(str)
Output:
name age eye_color score new
1 Alicia 19 blue 92 blue1
0 Alex 20 blue 88 blue2
4 Alice 32 brown 96 brown1
2 Omar 35 brown 95 brown2
3 Louise 24 green 70 green1

Filter dataframes based on 2 columns of another dataframe in python

I have a DataFrame like this:
data = {'Name':['Tom', 'Jack', 'nick', 'juli', 'Tom', 'nick', 'juli','nick', 'juli','Tom'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo', 'maths', 'maths', 'geo', 'science'], 'marks':[99, 98, 95, 90, 99, 98, 97, 95, 96, 98]}
df1 = pd.DataFrame(data)
df1
Name subject marks
0 Tom eng 99
1 Jack maths 98
2 nick geo 95
3 juli maths 90
4 Tom science 99
5 nick geo 98
6 juli maths 97
7 nick maths 95
8 juli geo 96
9 Tom science 98
another dataframe as :
data2 = {'Name':['Jack', 'nick', 'Tom', 'juli', 'Tom', 'nick','nick', 'juli'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo', 'maths', 'geo']}
df2 = pd.DataFrame(data2)
df2
Name subject
0 Jack eng
1 nick maths
2 Tom geo
3 juli maths
4 Tom science
5 nick geo
6 nick maths
7 juli geo
I want to filter df2 based on combination of 'Names' and 'subject' in df1. If a particular combination of 'Name' and 'subject' in df1 appears more than once and then it is matched in df2. If it matches then we get those rows from df2 as output.
Desired output:
pd.DataFrame({'Names':['Tom', 'juli', 'nick'], 'subject': ['science', 'maths', 'geo']})
Name subject
0 nick geo
1 juli maths
2 Tom science
can anyone help without using 'merge' option?
I believe you need filter rows with duplicated values only by DataFrame.duplicated with keep=False chained without this parameter and for them first rows and then use merge for inner join:
df11 = df1[df1.duplicated(subset=['Name','subject'], keep=False) &
df1.duplicated(subset=['Name','subject'])]
df3 = df11.merge(df2, suffixes=('_',''))[df2.columns]
print (df3)
Name subject
0 nick geo
1 juli maths
2 Tom science
Another similar idea is filter columns by df2 in merge:
cols = df2.columns
df11 = df1.loc[df1[cols].duplicated(keep=False) & df1[cols].duplicated(), cols]
df3 = df11.merge(df2)
print (df3)
Name subject
0 nick geo
1 juli maths
2 Tom science

How to calculate specific cell values in python pandas?

There is a pandas dataframe as follow:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
I want to divide age and grade numeric cell values equal blue in favorite_color column to 125.0 value and yellow values divide to 130.0 and green to 135.0. Results mus be inserted in new columns age_new, grade_new.
By below code I receive error.
df['age_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['age_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['age_new'] =(df.loc[df['favorite_color']=='green']/135.0)
df['grade_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['grade_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['grade_new'] =(df.loc[df['favorite_color']=='green']/135.0)
Error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
map
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.assign(
mods=df.favorite_color.map(mods),
age_new=lambda d: d.age / d.mods,
grade_new=lambda d: d.grade / d.mods
)
name age favorite_color grade mods age_new grade_new
0 Willard Morris 20 blue 88 125 0.160000 0.704000
1 Al Jennings 19 blue 92 125 0.152000 0.736000
2 Omar Mullins 22 yellow 95 130 0.169231 0.730769
3 Spencer McDaniel 21 green 70 135 0.155556 0.518519
Similar
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.join(df[['age', 'grade']].div(df.favorite_color.map(mods), axis=0).add_suffix('_new'))
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519
You can use .replace instead of .loc, so that you only perform the operation once.
import pandas as pd
raw_data = {
'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
color_d = {
"blue": 125,
"yellow": 130,
"green": 135
}
df[["age_new", "grade_new"]] = df[["age", "grade"]].div(
df['favorite_color'].replace(color_d),
axis=0)
df.head()
Which gives
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519

Remove characters from a string in a dataframe

python beginner here. I would like to change some characters in a column in a dataframe under certain conditions.
The dataframe looks like this:
import pandas as pd
import numpy as np
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df
My goal is to replace in the column last name the space followed by the parenthesis and the two letters.
Blue instead of Blue (VS).
There is 26 letter variations that I have to remove but only one format: last_name followed by space followed by parenthesis followed by two letters followed by parenthesis.
From what I understood it should be that in regexp:
( \(..\)
I tried using str.replace but it only works for exact match and it replaces the whole value.
I also tried this:
df.loc[df['favorite_color'].str.contains(‘VS’), 'favorite_color'] = ‘random’
it also replaces the whole value.
I saw that I can only rewrite the value but I also saw that using this:
df[0].str.slice(0, -5)
I could remove the last 5 characters of a string containing my search.
In my mind I should make a list of the 26 occurrences that I want to be removed and parse through the column to remove those while keeping the text before. I searched for post similar to my problem but could not find a solution. Do you have any idea for a direction ?
You can use str.replace with pattern "(\(.*?\))"
Ex:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df["newCol"] = df["favorite_color"].str.replace("(\(.*?\))", "").str.strip()
print( df )
Output:
age favorite_color grade name newCol
0 20 blue (VS) 88 Willard Morris blue
1 19 red 92 Al Jennings red
2 22 yellow (AG) 95 Omar Mullins yellow
3 21 green 70 Spencer McDaniel green

Categories