Recode multiple values in several columns in Python [similar to R] - python

I am trying to translate my R script to python. I have a survey data with several date of birth and education level columns for each family member(from family member 1 to member 10): here a sample:
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3
12 1958 2001 2005 1 5 1
13 1990 1999 1932 2 1 3
14 1974 1965 1965 3 3 3
15 1963 1963 1990 4 3 1
16 2020 1995 1988 1 1 2
I had a function in R in order to check the logic and re code wrong education level in all columns.Like this
# R function
edu_recode <- function(dob, edu){
case_when(
dob >= 2003 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 8,
dob > 2000 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 1,
dob >= 1996 & (edu == 3 | edu == 4) ~ 2,
dob > 1995 & edu == 4 ~ 3,
(dob >= 2001 & dob <= 2002) & edu == 8 ~ 1,
TRUE ~ as.numeric(edu)
)
}
and apply it for all columns like this:
library(tidyverse)
df %>%
mutate(education_1 = edu_recode(dob_1,education_1),
education_2 = edu_recode(dob_2,education_2),
education_3 = edu_recode(dob_3,education_3),
education_4 = edu_recode(dob_4,education_4),
education_5 = edu_recode(dob_5,education_5),
education_6 = edu_recode(dob_6,education_6),
education_7 = edu_recode(dob_7,education_7),
education_8 = edu_recode(dob_8,education_8),
education_9 = edu_recode(dob_9,education_9),
education_10 = edu_recode(dob_10,education_10)
)
is there a way to do similar process in Python instead of manually recoding each column?

You can write a function that combines pipe with np.select, as well as a dictionary (to abstract as much manual processing as possible):
def edu_recode(df, dob, edu):
df = df.copy()
cond1 = (df[dob] >= 2003) & (df[edu].isin([1, 4]))
cond2 = (df[dob] > 2000) & (df[edu].isin([1, 4]))
cond3 = (df[dob] > 1996) & (df[edu].isin([3, 4]))
cond4 = (df[dob] > 1995) & (df[edu] == 4)
cond5 = (df[dob].isin([2001, 2002])) & (df[edu] == 8)
condlist = [cond1, cond2, cond3, cond4, cond5]
choicelist = [8, 1, 2, 3, 1]
return np.select(condlist, choicelist, pd.to_numeric(df[edu]))
# sticking to the sample data, you can extend this
mapping = {f"education_{num}": df.pipe(edu_recode, f"dob_{num}",
f"education_{num}")
for num in range(1, 4)}
df.assign(**mapping)
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3
0 12 1958 2001 2005 1 5 8
1 13 1990 1999 1932 2 1 3
2 14 1974 1965 1965 3 3 3
3 15 1963 1963 1990 4 3 1
4 16 2020 1995 1988 8 1 2

Related

How to add column for every month and generate number i.e. 1,2,3..etc

I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]

Pandas filter using multiple conditions and ignore entries that contain duplicates of substring

I have a dataframe derived from a massive list of market tickers from a crypto exchange.
The list includes ALL combos yet I only need the tickers that are vs USD stablecoins.
The 1st 15 entries of the original dataframe...
Asset Price
0 1INCHBTC 0.00009650
1 1INCHBUSD 5.74340000
2 1INCHUSDT 5.74050000
3 AAVEBKRW 164167.00000000
4 AAVEBNB 0.77600000
5 AAVEBTC 0.00615200
6 AAVEBUSD 365.00200000
7 AAVEDOWNUSDT 2.02505200
8 AAVEETH 0.17212000
9 AAVEUPUSDT 81.89500000
10 AAVEUSDT 365.57600000
11 ACMBTC 0.00018420
12 ACMBUSD 10.91700000
13 ACMUSDT 10.89500000
14 ADAAUD 1.59600000
Now...there are many USD stablecoins, however not every ticker has a pair with one.
So I used the most popular ones in order to make sure every asset has at least one match.
df = df.loc[(df.Asset.str[-3:] == 'DAI')|
(df.Asset.str[-4:] == 'USDT')|
(df.Asset.str[-4:] == 'BUSD')|
(df.Asset.str[-4:] == 'TUSD')]
The 1st 15 entries of the new but 'messy' dataframe...
Asset Price
0 1INCHBUSD 5.74340000
1 1INCHUSDT 5.74050000
2 AAVEBUSD 365.00200000
3 AAVEDOWNUSDT 2.02505200
4 AAVEUPUSDT 81.89500000
5 AAVEUSDT 365.57600000
6 ACMBUSD 10.91700000
7 ACMUSDT 10.89500000
8 ADABUSD 1.21439000
9 ADADOWNUSDT 3.46482700
10 ADATUSD 1.21284000
11 ADAUPUSDT 76.12900000
12 ADAUSDT 1.21394000
13 AERGOBUSD 0.43012000
14 AIONBUSD 0.07210000
How do i filter/merge entries in this dataframe so that it removes duplicates?
I also need the substring to be removed at the end, so I'm left with just the asset and the USD price.
It should look something like this...
Asset Price
0 1INCH 5.74340000
2 AAVE 365.00200000
3 AAVEDOWN 2.02505200
4 AAVEUP 81.89500000
6 ACM 10.91700000
8 ADA 1.21439000
9 ADADOWN 3.46482700
11 ADAUP 76.12900000
13 AERGO 0.43012000
14 AION 0.07210000
This is for a portfolio tracker.
Also if there is a better way to do this without the middle step I'm all ears.
According your expected output, you want to remove duplicates but keep first item:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.drop_duplicates(subset="Asset", keep="first")
print(df)
Prints:
Asset Price
0 1INCH 5.743400
2 AAVE 365.002000
3 AAVEDOWN 2.025052
4 AAVEUP 81.895000
6 ACM 10.917000
8 ADA 1.214390
9 ADADOWN 3.464827
11 ADAUP 76.129000
13 AERGO 0.430120
14 AION 0.072100
EDIT: To group and average:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.groupby("Asset")["Price"].mean().reset_index()
print(df)
Prints:
Asset Price
0 1INCH 5.741950
1 AAVE 365.289000
2 AAVEDOWN 2.025052
3 AAVEUP 81.895000
4 ACM 10.906000
5 ADA 1.213723
6 ADADOWN 3.464827
7 ADAUP 76.129000
8 AERGO 0.430120
9 AION 0.072100
Just do
con1 = df.Asset.str[-3:] == 'DAI'
con2 = df.Asset.str[-4:] == 'USDT'
con3 = df.Asset.str[-4:] == 'BUSD'
con4 = df.Asset.str[-4:] == 'TUSD'
df['new'] = np.select(['con1','con2','con3','con4'],
['DAI','USDT','BUSD','TUSD'])
out = df[con1 | con2 | con3 | con4].groupby('new').head(1)
or
df[con1 | con2 | con3 | con4].drop_duplicates('new')

Matching lists to dataframes

I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!
IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult
Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)
To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young

Pandas dataframe vectorizing/filtering: ValueError: Can only compare identically-labeled Series objects

I have two dataframes with NHL hockey stats. One contains every game played by every team for the last ten years, and the other is where I want to fill it up with calculated values. Simply put, I want to take a metric from a team's first five games, sum it, and put that into the other df. I've trimmed my dfs below to exclude other stats and will only look at one stat.
df_all contains all of the games:
>>> df_all
season gameId playerTeam opposingTeam gameDate xGoalsFor xGoalsAgainst
1 2008 2008020001 NYR T.B 20081004 2.287 2.689
6 2008 2008020003 NYR T.B 20081005 1.793 0.916
11 2008 2008020010 NYR CHI 20081010 1.938 2.762
16 2008 2008020019 NYR PHI 20081011 3.030 3.020
21 2008 2008020034 NYR N.J 20081013 1.562 3.454
... ... ... ... ... ... ... ...
142576 2015 2015030185 L.A S.J 20160422 2.927 2.042
142581 2017 2017030171 L.A VGK 20180411 1.275 2.279
142586 2017 2017030172 L.A VGK 20180413 1.907 4.642
142591 2017 2017030173 L.A VGK 20180415 2.452 3.159
142596 2017 2017030174 L.A VGK 20180417 2.427 1.818
df_sum_all will contain the calculated stats, for now it has a bunch of empty columns:
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0 0 0 0
1 2009 NYR 0 0 0 0
2 2010 NYR 0 0 0 0
3 2011 NYR 0 0 0 0
4 2012 NYR 0 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0 0 0 0
328 2015 L.A 0 0 0 0
329 2016 L.A 0 0 0 0
330 2017 L.A 0 0 0 0
331 2018 L.A 0 0 0 0
Here's my function for calculating the ratio of xGoalsFor and xGoalsAgainst.
def calcRatio(statfor, statagainst, games, season, team, statsdf):
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
tempRatio = tempFor / tempAgainst
return tempRatio
I believe it's logical enough. I input the stat I want to make a ratio from, how many games to sum, the season and team to match on, and then where to get the stats from. I've tested these functions separately and know that I can filter just fine, and sum the stats, and so forth. Here's an example of a standalone implementation of the tempFor calculation:
>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618
See? It returns a value. However I can't do the same across the whole dataframe. What am I missing? I thought the way this works is essentially for every row, it sets the 'xg5' column to the output of the calcRatio function, which uses that row's 'season' and 'team' to filter on df_all.
>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in calcRatio
File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects
Cheers, thanks for any help!
Update: I used iterrows() and it worked fine, so I must just not understand vectorization very well. It's the same function, though - why does it work in one fashion, but not another?
>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
... emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
...
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0.826260 0 0 0
1 2009 NYR 1.288390 0 0 0
2 2010 NYR 0.915942 0 0 0
3 2011 NYR 0.730498 0 0 0
4 2012 NYR 0.980744 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0.823998 0 0 0
328 2015 L.A 1.147412 0 0 0
329 2016 L.A 1.054947 0 0 0
330 2017 L.A 1.369005 0 0 0
331 2018 L.A 0.721411 0 0 0
[332 rows x 6 columns]
"ValueError: Can only compare identically-labeled Series objects"
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
The input for variables:
team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all
So in the code, (statsdf.playerTeam == team), it will compare between series from df_sum_all and from df_all.
If these two are not identically labeled, you will see the above error.

Replacing values in a pandas dataframe based on multiple conditions

I have a fairly simple question based on this sample code:
x1 = 10*np.random.randn(10,3)
df1 = pd.DataFrame(x1)
I am looking for a single DataFrame derived from df1 where positive values are replaced with "up", negative values are replaced with "down", and 0 values, if any, are replaced with "zero". I have tried using the .where() and .mask() methods but could not obtain the desired result.
I have seen other posts which filter according to multiple conditions at once, but they do not show how to replace values according to different conditions.
df1.apply(np.sign).replace({-1: 'down', 1: 'up', 0: 'zero'})
Output:
0 1 2
0 down up up
1 up down down
2 up down down
3 down down up
4 down down up
5 down up up
6 down up down
7 up down down
8 up up down
9 down up up
P.S. Getting exactly zero with randn is pretty unlikely, of course
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
or you can do it this way as well,
gm.loc[(gm['employrate'] <55) & (gm['employrate'] > 50),'employrate']=11
here informal syntax can be:
<dataset>.loc[<filter1> & (<filter2>),'<variable>']='<value>'
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax we used here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
In general, you could use np.select on the values and re-build the DataFrame
import pandas as pd
import numpy as np
df1 = pd.DataFrame(10*np.random.randn(10, 3))
df1.iloc[0, 0] = 0 # So we can check the == 0 condition
conds = [df1.values < 0 , df1.values > 0]
choices = ['down', 'up']
pd.DataFrame(np.select(conds, choices, default='zero'),
index=df1.index,
columns=df1.columns)
Output:
0 1 2
0 zero down up
1 up down up
2 up up up
3 down down down
4 up up up
5 up up up
6 up up down
7 up up down
8 down up down
9 up up down
IF condition with OR
from pandas import DataFrame
names = {'First_name': ['Jon','Bill','Maria','Emma']}
df = DataFrame(names,columns=['First_name'])
df.loc[(df['First_name'] == 'Bill') | (df['First_name'] == 'Emma'), 'name_match'] = 'Match'
df.loc[(df['First_name'] != 'Bill') & (df['First_name'] != 'Emma'), 'name_match'] = 'Mismatch'
print (df)
Output
First_name name_match
0 Jon Mismatch
1 Bill Match
2 Maria Mismatch
3 Emma Match

Categories