Pandas - Multiindex Division [i.e. Division by Group] - python

Aim: I'm trying to divide each row in a multilevel index by the total number in each group.
More specifically: Given the following data, I want to divide the number of Red and Blue marbles by the total number in each group (i.e. the sum across Date, Country and Colour)
Number
Date Country Colour
2011 US Red 4
Blue 6
2012 IN Red 9
IE Red 5
Blue 5
2013 JP Red 15
Blue 25
This would give the following answer:
Number
Date Country Colour
2011 US Red 0.4
Blue 0.6
2012 IN Red 1.0
IE Red 0.5
Blue 0.5
2013 JP Red 0.375
Blue 0.625
Here is the code to reproduce the data:
arrays = [np.array(['2011', '2011', '2012', '2012', '2012', '2013', '2013']),
np.array(['US', 'US', 'IN', 'IE', 'IE', 'JP', 'JP', 'GB']),
np.array(['Red', 'Blue', 'Red', 'Red', 'Blue', 'Red', 'Blue', 'Blue'])]
df = pd.DataFrame(np.random.rand(7, 1)*10, index=arrays, columns=['number'])
df.index.names = ['Date', 'Country', 'Colour']

A shorter version would be:
df.groupby(level=['Date', 'Country']).transform(lambda x: x/x.sum())
number
Date Country Colour
2011 US Red 0.400
Blue 0.600
2012 IN Red 1.000
IE Red 0.500
Blue 0.500
2013 JP Red 0.375
Blue 0.625

Related

Merge two dataframes on nearest value while duplicating rows

I have two dataframes,
DF1 = NUM1 Car COLOR
100 Honda blue
100 Honda yellow
200 Volvo red
DF2 = NUM2 Car STATE
110 Honda good
110 Honda bad
230 Volvo not bad
230 Volvo excellent
I want to merge them on nearest value in columns NUM1 & NUM2 in order to get this desired dataframe:
DF3 = NUM CAR COLOR STATE
100 HONDA blue good
100 HONDA blue bad
100 HONDA yellow good
100 HONDA yellow bad
200 VOLVO red not bad
200 VOLVO red excellent
I've tried this:
df3 = pd.merge_asof(df1, df2, left_on="NUM1", right_on="NUM2")
But this is the result I get:
DF3 = NUM CAR COLOR STATE
100 HONDA blue good
100 HONDA yellow good
200 VOLVO red not bad
IIUC, you might need to combine merge_asof and merge:
key = pd.merge_asof(DF1.reset_index().sort_values(by='NUM1'),
DF2['NUM2'],
left_on='NUM1', right_on='NUM2',
direction='nearest')['NUM2']
DF1.merge(DF2.drop(columns=DF1.columns.intersection(DF2.columns)),
left_on=key, right_on='NUM2')

How to split phrases into words in a data frame by multiple delimiters, ignoring NaN?

I have a date frame that needs to be split and cleared. I'm trying to separate phrases into words using regular expressions, but I'm not getting exactly what I want. In addition, I need to lowercase the words and remove extra characters (I wanted to do this with strip() and lower(), but I don’t know where to apply them). Another problem with NaN, they need to be ignored, but they become lists.
Right now my function looks like this:
def splitSentence(Sentence):
words = re.split(';|,|/|&|\||-|:| ', str(Sentence))
words.sort()
# words.strip(' , !').lower()
return words
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
df
Name Color
0 Mark NaN
1 Ann NaN
2 John NaN
3 Elsa blue teal/ blue gray
4 Emma blue| green|grey it changes
5 Andrew BLACK!!!!
6 Max blue&green
7 Rose dichromatic: one blue| one green
8 Donald green;very very orangey brown and blue
9 Hugh Hazel, Green,Gray
10 Alex dark-coffee
I apply my function to the dataframe and get this:
df['Color'].apply(lambda x: splitSentence(x))
0 [nan]
1 [nan]
2 [nan]
3 [, blue, blue, gray, teal]
4 [, blue, changes, green, grey, it]
5 [BLACK!!!!]
6 [blue, green]
7 [, , blue, dichromatic, green, one, one]
8 [and, blue, brown, green, orangey, very, very]
9 [, Gray, Green, Hazel]
10 [coffee, dark]
But I need to get this (without the square brackets):
0 NaN
1 NaN
2 NaN
3 blue, gray, teal
4 blue, changes, green, grey, it
5 black
6 blue, green
7 blue, dichromatic, green, one
8 and, blue, brown, green, orangey, very
9 gray, green, hazel
10 coffee, dark
Can you please tell me how can I fix my code? Thanks
import pandas as pd
import numpy as np
import re
# Create DataFrame
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
# Function
def splitSentence(Sentence):
# Added ! to punctuation list
words = re.split(';|,|/|&|\||-|:|!| ', str(Sentence))
words.sort()
# Creating a new list of lowercase words without empty strings
new_words_list = [x.lower() for x in words if x != '']
# Joining all elements in the list with ','
joined_string = ",".join(new_words_list)
return joined_string
df['Color'].apply(lambda x: splitSentence(x))
Out[27]: 0 nan
1 nan
2 nan
3 blue,blue,gray,teal
4 blue,changes,green,grey,it
5 black
6 blue,green
7 blue,dichromatic,green,one,one
8 and,blue,brown,green,orangey,very,very
9 gray,green,hazel
10 coffee,dark

pandas filter if string in row A contains row b element

Let's say we have this df
d = pd.DataFrame({'year': [2010, 2020, 2010], 'colors': ['red', 'white', 'blue'], "shirt" : ["red shirt", "green and red shirt", "yellow shirt"] })
like this:
year colors shirt
0 2010 red red shirt
1 2020 white green and red shirt
2 2010 blue yellow shirt
I want to filter out rows in which the "shirt" column contains the "colors" column also considering the "year" column
desired output:
year colors shirt
0 2010 red red shirt
I tried this d[(d.year == 2010) & (d.shirt.str.contains(d.colors))] but I am getting this error:
'Series' objects are mutable, thus they cannot be hashed
It is a big df that I am working on. How can I solve with some pandas function?
I believe you need df.apply
Ex:
df = pd.DataFrame({'year': [2010, 2020, 2010], 'colors': ['red', 'white', 'blue'], "shirt" : ["red shirt", "green and red shirt", "yellow shirt"] })
print(df[(df.year == 2010) & df.apply(lambda x: x.colors in x.shirt, axis=1)])
Output:
year colors shirt
0 2010 red red shirt

For loop and if for multiple conditions and change another column in the same row in pandas

I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200

Select two sets of columns by column names in Pandas

Take the DataFrame in the answer of Loc vs. iloc vs. ix vs. at vs. iat? for example.
df = pd.DataFrame(
{'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']
)
Now I want all columns except 'food' and 'height'.
I thought something like df.loc[:,['age':'color', 'score':'state']] would work, but Python returns SyntaxError: invalid syntax.
I am aware of that there is one way to work around: df.drop(columns = ['food', 'height']). However, in my real life situation, I have hundreds of columns to be dropped. Typing out all column names is so inefficient.
I am expecting something similar with dplyr::select(df, -(food:height)) or dplyr::select(df, age:color, score:state) in R language.
Also have read Selecting/Excluding sets of columns in Pandas.
First, find all columns lying between food and height (inclusive).
c = df.iloc[-1:0].loc[:, 'food':'height'].columns
Next, filter with difference/isin/setdiff1d -
df[df.columns.difference(c)]
Or,
df.loc[:, ~df.columns.isin(c)]
Or,
df[np.setdiff1d(df.columns, c)]
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
First get positions of columns names by Index.get_loc and then use numpy.r_ for join all slicers together:
a = np.r_[df.columns.get_loc('age'):df.columns.get_loc('color')+1,
df.columns.get_loc('score'):df.columns.get_loc('state')+1]
df = df.iloc[:, a]
print (df)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
One option for flexible column selection is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(slice('age', 'color'), slice('score', 'state'))
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
df.select_columns(slice('food', 'height'), invert = True)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX

Categories