Complex Multiindex Sort with Categorical Values in Pandas - python

I have a dataframe with 4 categorical index levels:
grey cat male ralph ...
grey cat female bessie ...
yellow parrot female lisa ...
black dog male fido ...
orange parrot female janie ...
orange parrot male pete ...
black dog male will ...
grey cat female wanda ...
white dog female karen ...
black cat male albert ...
I want to sort the data in the following order not referring specifically to the index values:
First by animal
Second by color
third by gender
I'll potentially want to slice by (animal,color,gender) groups.
and where first, second, and third categorical values are the same I want the records sorted in ascending alphabetical order of the 4th (see grey, cat, female below - bessie is ordered before wanda). On thinking about this level 4 (name) may not need to be an index level?)
so the resulting dataframe would like the following (only indices shown)
black cat male albert ...
grey cat female bessie ...
grey cat female wanda ...
grey cat male ralph ...
black dog male fido ...
black dog male will ...
white dog female karen ...
orange parrot female janie ...
orange parrot male pete ...
yellow parrot female lisa ...
I may use the code for other data sets so I want to write it generically (not referring to the specific contents of the data set.
I'm stumped. Can someone provide some guidance?
Thanks.

Related

How to rename Pandas columns based on mapping?

I have a dataframe where column Name contains values such as the following (the rest of the columns do not affect how this question is answered I hope):
Chicken
Chickens
Fluffy Chicken
Whale
Whales
Blue Whale
Tiger
White Tiger
Big Tiger
Now, I want to ensure that we rename these entries to be like the following:
Chicken
Chicken
Chicken
Whale
Whale
Whale
Tiger
Tiger
Tiger
Essentially substituting anything that has 'Chicken' to just be 'Chicken, anything with 'Whale' to be just 'Whale, and anything with 'Tiger' to be just 'Tiger'.
What is the best way to do this? There are almost 1 million rows in the dataframe.
Sorry just to add, I have a list of what we expect i.e.
['Chicken', 'Whale', 'Tiger']
They can appear in any order in the column
What I should also add is, the column might contain things like "Mushroom" or "Eggs" that do not need substituting from the original list.
Try with str.extract
#l = ['Chicken', 'Whale', 'Tiger']
df['new'] = df['col'].str.extract('('+'|'.join(l)+')')[0]
Out[10]:
0 Chicken
1 Chicken
2 Chicken
3 Whale
4 Whale
5 Whale
6 Tiger
7 Tiger
8 Tiger
Name: 0, dtype: object

Python Pandas calculate value_counts of two columns and use groupby

I have a dataframe :
data = {'label': ['cat','dog','dog','cat','cat'],
'breeds': [ 'bengal','shar pei','pug','maine coon','maine coon'],
'nicknames':[['Loki','Loki' ],['Max'],['Toby','Zeus ','Toby'],['Marty'],['Erin ','Erin']],
'eye color':[['blue','green'],['green'],['brown','brown','brown'],['blue'],['green','brown']]
Output:
label breeds nicknames eye color
0 cat bengal [Loki,Loki] [blue, green]
1 dog shar pei [Max] [green]
2 dog pug [Toby,Zeus,Toby] [brown, brown, brown]
3 cat maine coon [Marty] [blue]
4 cat maine coon [Erin,Erin] [green, brown]
I want to apply the groupby :frame['label', 'breeds'], and calculate value_counts(unique value ) of nicknames and eye color,but output them in different columns: 'nickname_count','eye_count'
This code outputs only in one column, how do I output separately?
frame2=frame.groupby(['label','breeds'])['nicknames','eye color'].apply(lambda x: x.astype('str').value_counts().to_dict())
First, we use a groupby with sum on the lists as sum concatenates the lists together :
>>> df_grouped = df.groupby(['label', 'breeds']).agg({'nicknames': sum, 'eye color': sum}).reset_index()
>>> df_grouped
label breeds nicknames eye color
0 cat bengal [Loki, Loki] [blue, green]
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown]
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown]
3 dog shar pei [Max] [green]
Then, we can count the number of unique values in list by converting it to set, using len and save the output in two new columns to get the expected result :
>>> df_grouped['nickname_count'] = df_grouped['nicknames'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped['eye_count'] = df_grouped['eye color'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped
label breeds nicknames eye color nickname_count eye_count
0 cat bengal [Loki, Loki] [blue, green] 1 2
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown] 3 3
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown] 2 1
3 dog shar pei [Max] [green] 1 1

How can I match a dataframe's rows in python, creating a new column based on correct matching?

Working in python with a dataframe, I'm trying to match certain rows and create a new column based on successful matching - e.g. if 'Breed' + 'Color' match, put 'Name' of matched row in 'Mate' column of the Male in the pair. For example, in the table below Adam/Eve and Antony/Cleopatra should be matched, resulting in Eve and Cleopatra being put in the 'Mate' column for Adam and Antony, respectively. Since Clyde and Beauty have different breeds, this does not occur.
Name
Breed
Color
Sex
Mate?
Adam
Boxer
White
Male
(Eve)
Eve
Boxer
White
Female
Antony
Lab
Chocolate
Male
(Cleopatra)
Cleopatra
Lab
Chocolate
Female
Clyde
Husky
Gray
Male
Beauty
Bulldog
Gray
Female
Thanks!!
from collections import defaultdict
# First collect the Potential names of mates
mates = defaultdict(list)
for row in df.itertuples():
props = (row.Breed, row.Color)
mates[props].append(row.Name)
# Secondly create Mate column
def return_mates(name, breed, color):
match = (breed, color)
return [m for m in mates[match] if m != name]
df.loc[:, 'Mate'] = df[['Name', 'Breed', 'Color']].apply(lambda x: return_mates(*x), axis=1)
One way:
df['Mate?'] = df.Name.map(dict(df.groupby(['Breed', 'Color'])['Name'].agg(list).apply(pd.Series).dropna().values)).fillna('')
OUTPUT:
Name Breed Color Sex Mate?
0 Adam Boxer White Male Eve
1 Eve Boxer White Female
2 Antony Lab Chocolate Male Cleopatra
3 Cleopatra Lab Chocolate Female
4 Clyde Husky Gray Male
5 Beauty Bulldog Gray Female

Find extras between two columns of two dataframes - subtract [duplicate]

This question already has answers here:
Diff of two Dataframes
(7 answers)
Closed 4 years ago.
I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.
In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)
Dataframe A
Animal Name
dog john
dog henry
dog betty
dog smith
cat charlie
fish tango
lion foxtrot
lion lima
Dataframe B
Animal Name
dog john
cat charlie
dog betty
fish tango
lion foxtrot
dog smith
In this case, the extra would be:
Animal Name
dog henry
lion lima
Attempt: I tried using
df_c = df_a.subtract(df_b, axis='columns')
but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?
You are looking for a left_only merge.
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]
Output
Animal Name
1 dog henry
7 lion lima
Explanation:
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
Gives:
Animal Name _merge
0 dog john both
1 dog henry left_only
2 dog betty both
3 dog smith both
4 cat charlie both
5 fish tango both
6 lion foxtrot both
7 lion lima left_only
The extra animals are in df_a only, which is denoted by left_only.
Using isin
df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]:
Animal Name
1 dog henry
7 lion lima

Merge two dataframes by partial string match

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.
df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.
Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.
Example dataframes:
df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))
df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))
df3 (merged)
code_1 code_2 name color
123456789012 123456789012 bob blue
210987654321 2109876543 joe red
567890543211 7890543211 sally green
987656789001 98765678900 john purple
123456654321 12345665432 lucy orange
678905432156 678905432156 alan brown
Try this SQL join.
library(sqldf)
sqldf("select a.code_1, b.code_2, a.name, b.color
from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")
giving:
code_1 code_2 name color
1 123456789012 123456789012 bob blue
2 210987654321 2109876543 joe red
3 567890543211 7890543211 sally green
4 987656789001 98765678900 john purple
5 123456654321 12345665432 lucy orange
6 678905432156 678905432156 alan brown
Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.
We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:
df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))
df_merge = merge(df1, df2, by = "matchID")[-1]
Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.
Results:
> df2
code color matchID
1 123456789012 blue 1
2 2109876543 red 2
3 7890543211 green 3
4 98765678900 purple 4
5 12345665432 orange 5
6 678905432156 brown 6
7 14124124124 black
> df_merge
code.x name code.y color
1 123456789012 bob 123456789012 blue
2 210987654321 joe 2109876543 red
3 567890543211 sally 7890543211 green
4 987656789001 john 98765678900 purple
5 123456654321 lucy 12345665432 orange
6 678905432156 alan 678905432156 brown
Data (Added non-match for better demo):
df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
stringsAsFactors = FALSE)
df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
stringsAsFactors = FALSE)
Updated per new info. This should work:
df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)
combined <- merge(df1,df2, by.x="code_1", by.y="New")
code_1 name code_2 color
1 123456654321 lucy 12345665432 orange
2 123456789012 bob 123456789012 blue
3 210987654321 joe 2109876543 red
4 567890543211 sally 7890543211 green
5 678905432156 alan 678905432156 brown
6 987656789001 john 98765678900 purple
In python/pandas, you can do:
from pandas import DataFrame, Series
df1 = DataFrame(dict(
code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))
df2 = DataFrame(dict(
code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))
matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]
print(
df1.assign(subcode=Series(data=df2['code2'], index=matches))
.merge(df2, left_on='subcode', right_on='code2')
.drop('subcode', axis='columns')
)
And that dumps:
code1 name code2 color
0 123456789012 bob 123456789012 blue
1 210987654321 joe 2109876543 red
2 567890543211 sally 7890543211 green
3 987656789001 john 98765678900 purple
4 123456654321 lucy 12345665432 orange
5 678905432156 alan 678905432156 brown
Note: I hate using loops with dataframes, but this, uh, works, I guess.

Categories