Select two sets of columns by column names in Pandas - python

Take the DataFrame in the answer of Loc vs. iloc vs. ix vs. at vs. iat? for example.
df = pd.DataFrame(
{'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']
)
Now I want all columns except 'food' and 'height'.
I thought something like df.loc[:,['age':'color', 'score':'state']] would work, but Python returns SyntaxError: invalid syntax.
I am aware of that there is one way to work around: df.drop(columns = ['food', 'height']). However, in my real life situation, I have hundreds of columns to be dropped. Typing out all column names is so inefficient.
I am expecting something similar with dplyr::select(df, -(food:height)) or dplyr::select(df, age:color, score:state) in R language.
Also have read Selecting/Excluding sets of columns in Pandas.

First, find all columns lying between food and height (inclusive).
c = df.iloc[-1:0].loc[:, 'food':'height'].columns
Next, filter with difference/isin/setdiff1d -
df[df.columns.difference(c)]
Or,
df.loc[:, ~df.columns.isin(c)]
Or,
df[np.setdiff1d(df.columns, c)]
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX

First get positions of columns names by Index.get_loc and then use numpy.r_ for join all slicers together:
a = np.r_[df.columns.get_loc('age'):df.columns.get_loc('color')+1,
df.columns.get_loc('score'):df.columns.get_loc('state')+1]
df = df.iloc[:, a]
print (df)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX

One option for flexible column selection is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(slice('age', 'color'), slice('score', 'state'))
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
df.select_columns(slice('food', 'height'), invert = True)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX

Related

How to find sum of Few Columns in pandas Dataframe and leave the rest as it is? [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Functions for finding the average

I am a newbie here. English is not my native language so excuse any grammatical mistakes. I need to compute the average BMI per hair colour using the df.
# 1. Here we import pandas
import pandas as pd
# 2. Here we import numpy
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Age':[18, 21, 28, 19, 23, 22, 18, 24, 25, 20],
'Hair colour':['Blonde', 'Brown', 'Black', 'Blonde', 'Blonde', 'Black','Brown', 'Brown', 'Black', 'Black'],
'Length (in cm)':np.random.normal(175, 10, 10).round(1),
'Weight (in kg)':np.random.normal(70, 5, 10).round(1)},
index = ['Leon', 'Mirta', 'Nathan', 'Linda', 'Bandar', 'Violeta', 'Noah', 'Niji', 'Lucy', 'Mark'],)
I should get vectors with names.
Firstly, I wrote the function of BMI:
# function
def BMI():
df['weight (in kg)'] / (df['Length']/100)**2
However, I don't know what my next step is.
Can you advise me on how to find the average BMI per hair colour?
You can use df.groupby() which is a functionality within Pandas
For your particular case, you may use
df.groupby('Hair colour').mean()['BMI']
which gives output
Hair colour
Black 23.003356
Blonde 18.806844
Brown 23.271460
Name: BMI, dtype: float64
You can either filter or groupby.
Your BMI function does not make sense as you are:
referencing columns that do not exist
do nothing with its return so it gets discarded
Filtering:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Age':[18, 21, 28, 19, 23, 22, 18, 24, 25, 20],
'Hair colour':['Blonde', 'Brown', 'Black', 'Blonde',
'Blonde', 'Black','Brown', 'Brown', 'Black',
'Black'],
'Length (in cm)':np.random.normal(175, 10, 10).round(1),
'Weight (in kg)':np.random.normal(70, 5, 10).round(1)},
index = ['Leon', 'Mirta', 'Nathan', 'Linda', 'Bandar',
'Violeta', 'Noah', 'Niji', 'Lucy', 'Mark'],)
print(df)
# calculate BMI - not as function, using correct column names
df["BMI"] = df['Weight (in kg)'] / (df['Length (in cm)']/100)**2
print(df)
# filter to brown
brown = df[df["Hair colour"] == "Brown"]
print(brown)
print(brown["BMI"].mean())
Output:
# calculated BMI
Age Hair colour Length (in cm) Weight (in kg) BMI
Leon 18 Blonde 192.6 70.7 19.059296
Mirta 21 Brown 179.0 77.3 24.125339
Nathan 28 Black 184.8 73.8 21.609884
Linda 19 Blonde 197.4 70.6 18.118006
Bandar 23 Blonde 193.7 72.2 19.243229
Violeta 22 Black 165.2 71.7 26.272359
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875
Lucy 25 Black 174.0 71.6 23.649095
Mark 20 Black 179.1 65.7 20.482087
# filtered output
Age Hair colour Length (in cm) Weight (in kg) BMI
Mirta 21 Brown 179.0 77.3 24.125339
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875
# avg BMI
23.271459786871446
Groupby:
# use groupby
grouped = df.groupby('Hair colour')
print(*grouped, sep="\n\n")
# https://stackoverflow.com/questions/51091331
print(grouped.get_group("Brown")["BMI"].mean())
Output:
# grouped output
('Black', Age Hair colour Length (in cm) Weight (in kg) BMI
Nathan 28 Black 184.8 73.8 21.609884
Violeta 22 Black 165.2 71.7 26.272359
Lucy 25 Black 174.0 71.6 23.649095
Mark 20 Black 179.1 65.7 20.482087)
('Blonde', Age Hair colour Length (in cm) Weight (in kg) BMI
Leon 18 Blonde 192.6 70.7 19.059296
Linda 19 Blonde 197.4 70.6 18.118006
Bandar 23 Blonde 193.7 72.2 19.243229)
('Brown', Age Hair colour Length (in cm) Weight (in kg) BMI
Mirta 21 Brown 179.0 77.3 24.125339
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875)
# avg BMI
23.271459786871446

check if a name in one dataframe exist in other dataframe python

I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.
Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22

Pandas replacing the number only for the columns that contains number

I have dataframe that has more than 100 columns. but here I am trying to replacing the number all across the dataframe whose column contains the number (Int/float/any formate of number).
I know how to take care column seperately, but i am looking for some smart code that efficiently replacing the value to -5 if Value <= 0 and 111 if value > 50.
Below is the code.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Name': ['Avery Bradley', 'Jae Crowder', 'John Holland', 'R.J. Hunter'],
'Team': ['Boston Celtics',
'Boston Celtics',
'Boston Celtics',
'Boston Celtics'],
'Number1': [0.0, 999.0, -30.0, 28.0],
'Number2': [1000, 500, -10, 25],
'Position': ['PG', 'SF', 'SG', 'SG']})
#df["Number1"].values[df["Number1"] > 50] = 999
#df["Number1"].values[df["Number1"] < 0] = -5
df[ df > 50 ] = 888
df[ df < 0 ] = -5
You can use select_dtypes with np.select for multiple conditions here:
m = df.select_dtypes(np.number)
df[m.columns] = np.select([m>50,m<0],[888,-5],m)
print(df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888.0 PG
1 Jae Crowder Boston Celtics 888.0 888.0 SF
2 John Holland Boston Celtics -5.0 -5.0 SG
3 R.J. Hunter Boston Celtics 28.0 25.0 SG
Use:
c = df.select_dtypes(np.number).columns
df[c] = df[c].mask(df[c] > 50, 888)
df[c] = df[c].mask(df[c] < 0, -5)
print (df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888 PG
1 Jae Crowder Boston Celtics 888.0 888 SF
2 John Holland Boston Celtics -5.0 -5 SG
3 R.J. Hunter Boston Celtics 28.0 25 SG

Pandas - Multiindex Division [i.e. Division by Group]

Aim: I'm trying to divide each row in a multilevel index by the total number in each group.
More specifically: Given the following data, I want to divide the number of Red and Blue marbles by the total number in each group (i.e. the sum across Date, Country and Colour)
Number
Date Country Colour
2011 US Red 4
Blue 6
2012 IN Red 9
IE Red 5
Blue 5
2013 JP Red 15
Blue 25
This would give the following answer:
Number
Date Country Colour
2011 US Red 0.4
Blue 0.6
2012 IN Red 1.0
IE Red 0.5
Blue 0.5
2013 JP Red 0.375
Blue 0.625
Here is the code to reproduce the data:
arrays = [np.array(['2011', '2011', '2012', '2012', '2012', '2013', '2013']),
np.array(['US', 'US', 'IN', 'IE', 'IE', 'JP', 'JP', 'GB']),
np.array(['Red', 'Blue', 'Red', 'Red', 'Blue', 'Red', 'Blue', 'Blue'])]
df = pd.DataFrame(np.random.rand(7, 1)*10, index=arrays, columns=['number'])
df.index.names = ['Date', 'Country', 'Colour']
A shorter version would be:
df.groupby(level=['Date', 'Country']).transform(lambda x: x/x.sum())
number
Date Country Colour
2011 US Red 0.400
Blue 0.600
2012 IN Red 1.000
IE Red 0.500
Blue 0.500
2013 JP Red 0.375
Blue 0.625

Categories