Binning Categorical Columns Programatically Using Python - python

I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column
Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'
IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.
EG:
0 Brighton
1 Yokohama
2 Levin
3 Melbourne
4 Coffeyville
5 Whakatane
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Te Awamutu
11 Bishkek
12 Melbourne
13 Whanganui
14 Coffeyville
15 New York
16 Brisbane
17 Greymouth
18 Brisbane
19 Chuo City
20 Accra
21 Levin
22 Waiouru
23 Brisbane
24 New York
25 Chuo City
26 Lucerne
27 Whanganui
28 Los Angeles
29 Melbourne
df['city'].head(30).value_counts(ascending=False, normalize=True)*100
Melbourne 16.666667
Levin 10.000000
Brisbane 10.000000
Whanganui 6.666667
Coffeyville 6.666667
New York 6.666667
Chuo City 6.666667
Waiouru 3.333333
Greymouth 3.333333
Te Awamutu 3.333333
Bishkek 3.333333
Lucerne 3.333333
Ashburn 3.333333
Yokohama 3.333333
Whakatane 3.333333
Accra 3.333333
Brighton 3.333333
Los Angeles 3.333333
From Ashburn down - it should be renamed to 'other'
I have tried the below which is a start, but not exactly what I want:
city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)
def count(df):
if df["city_count"] > 10:
return "High"
elif df["city_count"] < 0:
return "Medium"
else:
return "Low"
df.apply(count, axis=1)
I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this

We can groupby on city and get the size of each city. We divide those values by the length of our dataframe with len and calculate the cumsum. Last step is to check from which point we exceed the threshold, so we can broadcast the boolean series back to your dataframe with map.
threshold = 0.7
m = df['city'].map(df.groupby('city')['city'].size().sort_values(ascending=False).div(len(df)).cumsum().le(threshold))
df['city'] = np.where(m, df['city'], 'Other')
city
0 Other
1 Other
2 Levin
3 Melbourne
4 Coffeyville
5 Other
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Other
11 Bishkek
12 Melbourne
13 Other
14 Coffeyville
15 New York
16 Brisbane
17 Other
18 Brisbane
19 Chuo City
20 Other
21 Levin
22 Other
23 Brisbane
24 New York
25 Chuo City
26 Other
27 Other
28 Other
29 Melbourne
old method
If I understand you correctly you want calculate a cumulative sum with .cumsum and check when it exceeds your set threshold.
Then we use np.where to conditionally fill in the City name or Other.
threshold = 80
m = df['Normalized'].cumsum().le(threshold)
df['City'] = np.where(m, df['City'], 'Other')
City Normalized
0 Auckland 40.399513
1 Christchurch 13.130783
2 Wellington 12.267604
3 Hamilton 4.026242
4 Tauranga 3.867353
5 (not set) 3.540075
6 Dunedin 2.044508
7 Other 1.717975
8 Other 1.632849
9 Other 1.520342
10 Other 1.255651
11 Other 1.173878
12 Other 1.040508
13 Other 0.988166
14 Other 0.880502
15 Other 0.766877
16 Other 0.601468
17 Other 0.539067
18 Other 0.471824
19 Other 0.440903
20 Other 0.440344
21 Other 0.405884
22 Other 0.365836
23 Other 0.321131
24 Other 0.306602
25 Other 0.280524
26 Other 0.237123
27 Other 0.207878
28 Other 0.186084
29 Other 0.167085
30 Other 0.163732
31 Other 0.154977
Note: this method assumed that your Normalized column is sorted descending.

Related

Pull specific values from one dataframe based on values in another

I have two dataframes
df1:
Country
value
Average
Week Rank
UK
42
42
1
US
9
9.5
2
DE
10
9.5
3
NL
15
15.5
4
ESP
16
15.5
5
POL
17
18
6
CY
18
18
7
IT
20
18
8
AU
17
18
9
FI
18
18
10
SW
20
18
11
df2:
Country
value
Average
Year Rank
US
42
42
1
UK
9
9.5
2
ESP
10
9.5
3
SW
15
15.5
4
IT
16
15.5
5
POL
17
18
6
NO
18
18
7
SL
20
18
8
PO
17
18
9
FI
18
18
10
NL
20
18
11
DE
17
18
12
AU
18
18
13
CY
20
18
14
Im looking to create a column in df1 that shows the 'Year Rank' of the countries in df1 so that I have the following:
Country
value
Average
Week Rank
Year Rank
UK
42
42
1
2
US
9
9.5
2
1
DE
10
9.5
3
9
NL
15
15.5
4
8
ESP
16
15.5
5
3
POL
17
18
6
6
CY
18
18
7
7
IT
20
18
8
5
AU
17
18
9
13
FI
18
18
10
10
SW
20
18
11
4
How would i loop through the countries in df1 and find the corresponding rank in df2?
Edit: I am only looking for the yearly rank of the countries in df1
Thanks!
Use:
df1['Year Rank'] = df1.merge(df2, on='Country')['Year Rank']

How to select row before and after NaN in pandas?

I have a dataframe which looks like this :
Name Age Job
0 Alex 20 Student
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
4 Rosa 20 senior manager
5 johanes 25 Dentist
6 lina 23 Student
7 yaser 25 Pilot
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
.
.
.
.
I want to select the rows before and after the row that has NaN values in column Job with the row itself. For that I have the following code :
Rows = df[df. Shift(1, fill_value="dummy").Job. isna() | df.Job. isna()| df. Shift(-1, fill_value="dummy"). df. isna()]
print(Rows)
the result is this:
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
The only problem here is the row number 10, it should be double in the result because this row is one time the row after NaN which is number 9 and at the same time the row before NaN value which is row number 11( the row is between two rows with NaN value). So at the end I want to have this :
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
So every row which is between two rows with NaN values should be also two times in the result (or should be dupplicate). Is there any way to do this? Any help will be appreciated.
Use concat with rows before, after and match by condition:
m = df.Job.isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)],
df[m]]).sort_index()
print (df)
Name Age Job
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter

Pandas: Swapping specific column values within one Dataframe and calculate its weighted averages

There exist the following dataframe:
year
pop0
pop1
city0
city1
2019
20
40
Malibu
NYC
2018
8
60
Sydney
Dublin
2018
36
23
NYC
Malibu
2020
17
44
Malibu
NYC
2019
5
55
Sydney
Dublin
I would like to calculate the weighted average for the population of each city pair as a new column. For example, the w_mean for Malibu / NYC = (23+20+17)/(36+40+44) = 0.5.
Following is the desired output:
year
pop0
pop1
city0
city1
w_mean
2018
23
36
Malibu
NYC
0.5
2019
20
40
Malibu
NYC
0.5
2020
17
44
Malibu
NYC
0.5
2018
8
60
Sydney
Dublin
0.113
2019
5
55
Sydney
Dublin
0.113
I already sorted the dataframe by its columns, but I have issues swapping the 3rd row from NYC/Malibu to Malibu/NYC with its populations. Besides that, I can only calculate the w_mean for each row but not for each group. I tried groupby().mean() but didn't get any useful output.
Current code:
import pandas as pd
data = pd.DataFrame({'year': ["2019", "2018", "2018", "2020", "2019"], 'pop0': [20,8,36,17,5], 'pop1': [40,60,23,44,55], 'city0': ['Malibu','Sydney','NYC','Malibu','Sydney'], 'city1': ['NYC','Dublin','Malibu','NYC','Dublin']})
new = data.sort_values(by=['city0', 'city1'])
new['w_mean'] = new.apply(lambda row: row.pop0 / row.pop1, axis=1)
print(new)
What you can do is creating a creating tuples of (city, population), put the two tuples in a row into a list and then sort it. By doing this for all rows, you can extract the new cities and populations (sorted alphabetically by city). This can be done as follows:
cities = [sorted([(e[0], e[1]), (e[2], e[3])]) for e in data[['city0','pop0','city1','pop1']].values]
data[['city0', 'pop0']] = [e[0] for e in cities]
data[['city1', 'pop1']] = [e[1] for e in cities]
Resulting dataframe:
year pop0 pop1 city0 city1
0 2019 20 40 Malibu NYC
1 2018 60 8 Dublin Sydney
2 2018 23 36 Malibu NYC
3 2020 17 44 Malibu NYC
4 2019 55 5 Dublin Sydney
Now, the mean_w column can be created using groupby and transform to create the two sums and then divide as follows:
data[['pop0_sum', 'pop1_sum']] = data.groupby(['city0', 'city1'])[['pop0', 'pop1']].transform('sum')
data['w_mean'] = data['pop0_sum'] / data['pop1_sum']
Result:
year pop0 pop1 city0 city1 pop0_sum pop1_sum w_mean
0 2019 20 40 Malibu NYC 60 120 0.500000
1 2018 60 8 Dublin Sydney 115 13 8.846154
2 2018 23 36 Malibu NYC 60 120 0.500000
3 2020 17 44 Malibu NYC 60 120 0.500000
4 2019 55 5 Dublin Sydney 115 13 8.846154
Any extra columns can now be dropped.
If the resulting w_mean column should always be less than zero, then the last division can be done as follows instead:
data['w_mean'] = np.where(data['pop0_sum'] > data['pop1_sum'], data['pop1_sum'] / data['pop0_sum'], data['pop0_sum'] / data['pop1_sum'])
This will give 0.5 for the Malibu & NYC pair and 0.113043 for Dublin & Sydney.

Python: Sum values in col 1 if previous values in column 2 equals value in column 3 on each row

My data is in a csv like this
ID Date Year Home Team Away Team HP AP
1 09/02 1966 Miami Oakland 14 23
2 09/03 1966 Houston Denver 45 7
3 09/10 1966 Oakland Houston 31 0
4 09/27 1966 Houston Oakland 18 10
5 10/20 1966 Oakland Houston 21 18
On each row I want to sum the previously accumulated home and away points for both home team and away team.
I have used pandas groupby to get the home points for the home team and the away points for the away team similar to below
df1['HT_HP']=df1.groupby('Home Team')['HP'].apply(lambda x : x.shift().cumsum())
But I can't do it to get the previously scored away points for the home team and the previously scored home points for the away team.
so for the first Oakland Houston game there would be a column with Oakland 23 away points and a separate column for Houston 45 home points
Expected outcome:
ID Date Year Home Team Away Team HP AP HT_AP AT_HP
1 09/02 1966 Miami Oakland 14 23 NaN NaN
2 09/03 1966 Houston Denver 45 7 NaN NaN
3 09/10 1966 Oakland Houston 31 0 23 45
4 09/27 1966 Houston Oakland 18 10 0 31
5 10/20 1966 Oakland Houston 21 18 33 63
I've tried this
df1['HT_AGS'] = df1.where(df1['AwayTeam']==df1['HomeTeam']).groupby('HomeTeam')['FTHG'].apply(lambda x : x.shift().cumsum())
This returns a full column of NaN values
In excel it would be something similar to sumifs(F1:F3,D1:D3,E4)

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories