Python Pandas calculate value_counts of two columns and use groupby - python

I have a dataframe :
data = {'label': ['cat','dog','dog','cat','cat'],
'breeds': [ 'bengal','shar pei','pug','maine coon','maine coon'],
'nicknames':[['Loki','Loki' ],['Max'],['Toby','Zeus ','Toby'],['Marty'],['Erin ','Erin']],
'eye color':[['blue','green'],['green'],['brown','brown','brown'],['blue'],['green','brown']]
Output:
label breeds nicknames eye color
0 cat bengal [Loki,Loki] [blue, green]
1 dog shar pei [Max] [green]
2 dog pug [Toby,Zeus,Toby] [brown, brown, brown]
3 cat maine coon [Marty] [blue]
4 cat maine coon [Erin,Erin] [green, brown]
I want to apply the groupby :frame['label', 'breeds'], and calculate value_counts(unique value ) of nicknames and eye color,but output them in different columns: 'nickname_count','eye_count'
This code outputs only in one column, how do I output separately?
frame2=frame.groupby(['label','breeds'])['nicknames','eye color'].apply(lambda x: x.astype('str').value_counts().to_dict())

First, we use a groupby with sum on the lists as sum concatenates the lists together :
>>> df_grouped = df.groupby(['label', 'breeds']).agg({'nicknames': sum, 'eye color': sum}).reset_index()
>>> df_grouped
label breeds nicknames eye color
0 cat bengal [Loki, Loki] [blue, green]
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown]
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown]
3 dog shar pei [Max] [green]
Then, we can count the number of unique values in list by converting it to set, using len and save the output in two new columns to get the expected result :
>>> df_grouped['nickname_count'] = df_grouped['nicknames'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped['eye_count'] = df_grouped['eye color'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped
label breeds nicknames eye color nickname_count eye_count
0 cat bengal [Loki, Loki] [blue, green] 1 2
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown] 3 3
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown] 2 1
3 dog shar pei [Max] [green] 1 1

Related

How to split phrases into words in a data frame by multiple delimiters, ignoring NaN?

I have a date frame that needs to be split and cleared. I'm trying to separate phrases into words using regular expressions, but I'm not getting exactly what I want. In addition, I need to lowercase the words and remove extra characters (I wanted to do this with strip() and lower(), but I don’t know where to apply them). Another problem with NaN, they need to be ignored, but they become lists.
Right now my function looks like this:
def splitSentence(Sentence):
words = re.split(';|,|/|&|\||-|:| ', str(Sentence))
words.sort()
# words.strip(' , !').lower()
return words
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
df
Name Color
0 Mark NaN
1 Ann NaN
2 John NaN
3 Elsa blue teal/ blue gray
4 Emma blue| green|grey it changes
5 Andrew BLACK!!!!
6 Max blue&green
7 Rose dichromatic: one blue| one green
8 Donald green;very very orangey brown and blue
9 Hugh Hazel, Green,Gray
10 Alex dark-coffee
I apply my function to the dataframe and get this:
df['Color'].apply(lambda x: splitSentence(x))
0 [nan]
1 [nan]
2 [nan]
3 [, blue, blue, gray, teal]
4 [, blue, changes, green, grey, it]
5 [BLACK!!!!]
6 [blue, green]
7 [, , blue, dichromatic, green, one, one]
8 [and, blue, brown, green, orangey, very, very]
9 [, Gray, Green, Hazel]
10 [coffee, dark]
But I need to get this (without the square brackets):
0 NaN
1 NaN
2 NaN
3 blue, gray, teal
4 blue, changes, green, grey, it
5 black
6 blue, green
7 blue, dichromatic, green, one
8 and, blue, brown, green, orangey, very
9 gray, green, hazel
10 coffee, dark
Can you please tell me how can I fix my code? Thanks
import pandas as pd
import numpy as np
import re
# Create DataFrame
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
# Function
def splitSentence(Sentence):
# Added ! to punctuation list
words = re.split(';|,|/|&|\||-|:|!| ', str(Sentence))
words.sort()
# Creating a new list of lowercase words without empty strings
new_words_list = [x.lower() for x in words if x != '']
# Joining all elements in the list with ','
joined_string = ",".join(new_words_list)
return joined_string
df['Color'].apply(lambda x: splitSentence(x))
Out[27]: 0 nan
1 nan
2 nan
3 blue,blue,gray,teal
4 blue,changes,green,grey,it
5 black
6 blue,green
7 blue,dichromatic,green,one,one
8 and,blue,brown,green,orangey,very,very
9 gray,green,hazel
10 coffee,dark

How to rename Pandas columns based on mapping?

I have a dataframe where column Name contains values such as the following (the rest of the columns do not affect how this question is answered I hope):
Chicken
Chickens
Fluffy Chicken
Whale
Whales
Blue Whale
Tiger
White Tiger
Big Tiger
Now, I want to ensure that we rename these entries to be like the following:
Chicken
Chicken
Chicken
Whale
Whale
Whale
Tiger
Tiger
Tiger
Essentially substituting anything that has 'Chicken' to just be 'Chicken, anything with 'Whale' to be just 'Whale, and anything with 'Tiger' to be just 'Tiger'.
What is the best way to do this? There are almost 1 million rows in the dataframe.
Sorry just to add, I have a list of what we expect i.e.
['Chicken', 'Whale', 'Tiger']
They can appear in any order in the column
What I should also add is, the column might contain things like "Mushroom" or "Eggs" that do not need substituting from the original list.
Try with str.extract
#l = ['Chicken', 'Whale', 'Tiger']
df['new'] = df['col'].str.extract('('+'|'.join(l)+')')[0]
Out[10]:
0 Chicken
1 Chicken
2 Chicken
3 Whale
4 Whale
5 Whale
6 Tiger
7 Tiger
8 Tiger
Name: 0, dtype: object

How to split two strings into different columns in Python with Pandas?

I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!
The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun
have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])

Find extras between two columns of two dataframes - subtract [duplicate]

This question already has answers here:
Diff of two Dataframes
(7 answers)
Closed 4 years ago.
I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.
In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)
Dataframe A
Animal Name
dog john
dog henry
dog betty
dog smith
cat charlie
fish tango
lion foxtrot
lion lima
Dataframe B
Animal Name
dog john
cat charlie
dog betty
fish tango
lion foxtrot
dog smith
In this case, the extra would be:
Animal Name
dog henry
lion lima
Attempt: I tried using
df_c = df_a.subtract(df_b, axis='columns')
but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?
You are looking for a left_only merge.
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]
Output
Animal Name
1 dog henry
7 lion lima
Explanation:
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
Gives:
Animal Name _merge
0 dog john both
1 dog henry left_only
2 dog betty both
3 dog smith both
4 cat charlie both
5 fish tango both
6 lion foxtrot both
7 lion lima left_only
The extra animals are in df_a only, which is denoted by left_only.
Using isin
df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]:
Animal Name
1 dog henry
7 lion lima

Merge two dataframes by partial string match

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.
df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.
Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.
Example dataframes:
df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))
df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))
df3 (merged)
code_1 code_2 name color
123456789012 123456789012 bob blue
210987654321 2109876543 joe red
567890543211 7890543211 sally green
987656789001 98765678900 john purple
123456654321 12345665432 lucy orange
678905432156 678905432156 alan brown
Try this SQL join.
library(sqldf)
sqldf("select a.code_1, b.code_2, a.name, b.color
from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")
giving:
code_1 code_2 name color
1 123456789012 123456789012 bob blue
2 210987654321 2109876543 joe red
3 567890543211 7890543211 sally green
4 987656789001 98765678900 john purple
5 123456654321 12345665432 lucy orange
6 678905432156 678905432156 alan brown
Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.
We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:
df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))
df_merge = merge(df1, df2, by = "matchID")[-1]
Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.
Results:
> df2
code color matchID
1 123456789012 blue 1
2 2109876543 red 2
3 7890543211 green 3
4 98765678900 purple 4
5 12345665432 orange 5
6 678905432156 brown 6
7 14124124124 black
> df_merge
code.x name code.y color
1 123456789012 bob 123456789012 blue
2 210987654321 joe 2109876543 red
3 567890543211 sally 7890543211 green
4 987656789001 john 98765678900 purple
5 123456654321 lucy 12345665432 orange
6 678905432156 alan 678905432156 brown
Data (Added non-match for better demo):
df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
stringsAsFactors = FALSE)
df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
stringsAsFactors = FALSE)
Updated per new info. This should work:
df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)
combined <- merge(df1,df2, by.x="code_1", by.y="New")
code_1 name code_2 color
1 123456654321 lucy 12345665432 orange
2 123456789012 bob 123456789012 blue
3 210987654321 joe 2109876543 red
4 567890543211 sally 7890543211 green
5 678905432156 alan 678905432156 brown
6 987656789001 john 98765678900 purple
In python/pandas, you can do:
from pandas import DataFrame, Series
df1 = DataFrame(dict(
code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))
df2 = DataFrame(dict(
code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))
matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]
print(
df1.assign(subcode=Series(data=df2['code2'], index=matches))
.merge(df2, left_on='subcode', right_on='code2')
.drop('subcode', axis='columns')
)
And that dumps:
code1 name code2 color
0 123456789012 bob 123456789012 blue
1 210987654321 joe 2109876543 red
2 567890543211 sally 7890543211 green
3 987656789001 john 98765678900 purple
4 123456654321 lucy 12345665432 orange
5 678905432156 alan 678905432156 brown
Note: I hate using loops with dataframes, but this, uh, works, I guess.

Categories