I have two dataframes
df1:
Country
value
Average
Week Rank
UK
42
42
1
US
9
9.5
2
DE
10
9.5
3
NL
15
15.5
4
ESP
16
15.5
5
POL
17
18
6
CY
18
18
7
IT
20
18
8
AU
17
18
9
FI
18
18
10
SW
20
18
11
df2:
Country
value
Average
Year Rank
US
42
42
1
UK
9
9.5
2
ESP
10
9.5
3
SW
15
15.5
4
IT
16
15.5
5
POL
17
18
6
NO
18
18
7
SL
20
18
8
PO
17
18
9
FI
18
18
10
NL
20
18
11
DE
17
18
12
AU
18
18
13
CY
20
18
14
Im looking to create a column in df1 that shows the 'Year Rank' of the countries in df1 so that I have the following:
Country
value
Average
Week Rank
Year Rank
UK
42
42
1
2
US
9
9.5
2
1
DE
10
9.5
3
9
NL
15
15.5
4
8
ESP
16
15.5
5
3
POL
17
18
6
6
CY
18
18
7
7
IT
20
18
8
5
AU
17
18
9
13
FI
18
18
10
10
SW
20
18
11
4
How would i loop through the countries in df1 and find the corresponding rank in df2?
Edit: I am only looking for the yearly rank of the countries in df1
Thanks!
Use:
df1['Year Rank'] = df1.merge(df2, on='Country')['Year Rank']
I have a dataframe which looks like this :
Name Age Job
0 Alex 20 Student
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
4 Rosa 20 senior manager
5 johanes 25 Dentist
6 lina 23 Student
7 yaser 25 Pilot
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
.
.
.
.
I want to select the rows before and after the row that has NaN values in column Job with the row itself. For that I have the following code :
Rows = df[df. Shift(1, fill_value="dummy").Job. isna() | df.Job. isna()| df. Shift(-1, fill_value="dummy"). df. isna()]
print(Rows)
the result is this:
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
The only problem here is the row number 10, it should be double in the result because this row is one time the row after NaN which is number 9 and at the same time the row before NaN value which is row number 11( the row is between two rows with NaN value). So at the end I want to have this :
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
So every row which is between two rows with NaN values should be also two times in the result (or should be dupplicate). Is there any way to do this? Any help will be appreciated.
Use concat with rows before, after and match by condition:
m = df.Job.isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)],
df[m]]).sort_index()
print (df)
Name Age Job
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
There exist the following dataframe:
year
pop0
pop1
city0
city1
2019
20
40
Malibu
NYC
2018
8
60
Sydney
Dublin
2018
36
23
NYC
Malibu
2020
17
44
Malibu
NYC
2019
5
55
Sydney
Dublin
I would like to calculate the weighted average for the population of each city pair as a new column. For example, the w_mean for Malibu / NYC = (23+20+17)/(36+40+44) = 0.5.
Following is the desired output:
year
pop0
pop1
city0
city1
w_mean
2018
23
36
Malibu
NYC
0.5
2019
20
40
Malibu
NYC
0.5
2020
17
44
Malibu
NYC
0.5
2018
8
60
Sydney
Dublin
0.113
2019
5
55
Sydney
Dublin
0.113
I already sorted the dataframe by its columns, but I have issues swapping the 3rd row from NYC/Malibu to Malibu/NYC with its populations. Besides that, I can only calculate the w_mean for each row but not for each group. I tried groupby().mean() but didn't get any useful output.
Current code:
import pandas as pd
data = pd.DataFrame({'year': ["2019", "2018", "2018", "2020", "2019"], 'pop0': [20,8,36,17,5], 'pop1': [40,60,23,44,55], 'city0': ['Malibu','Sydney','NYC','Malibu','Sydney'], 'city1': ['NYC','Dublin','Malibu','NYC','Dublin']})
new = data.sort_values(by=['city0', 'city1'])
new['w_mean'] = new.apply(lambda row: row.pop0 / row.pop1, axis=1)
print(new)
What you can do is creating a creating tuples of (city, population), put the two tuples in a row into a list and then sort it. By doing this for all rows, you can extract the new cities and populations (sorted alphabetically by city). This can be done as follows:
cities = [sorted([(e[0], e[1]), (e[2], e[3])]) for e in data[['city0','pop0','city1','pop1']].values]
data[['city0', 'pop0']] = [e[0] for e in cities]
data[['city1', 'pop1']] = [e[1] for e in cities]
Resulting dataframe:
year pop0 pop1 city0 city1
0 2019 20 40 Malibu NYC
1 2018 60 8 Dublin Sydney
2 2018 23 36 Malibu NYC
3 2020 17 44 Malibu NYC
4 2019 55 5 Dublin Sydney
Now, the mean_w column can be created using groupby and transform to create the two sums and then divide as follows:
data[['pop0_sum', 'pop1_sum']] = data.groupby(['city0', 'city1'])[['pop0', 'pop1']].transform('sum')
data['w_mean'] = data['pop0_sum'] / data['pop1_sum']
Result:
year pop0 pop1 city0 city1 pop0_sum pop1_sum w_mean
0 2019 20 40 Malibu NYC 60 120 0.500000
1 2018 60 8 Dublin Sydney 115 13 8.846154
2 2018 23 36 Malibu NYC 60 120 0.500000
3 2020 17 44 Malibu NYC 60 120 0.500000
4 2019 55 5 Dublin Sydney 115 13 8.846154
Any extra columns can now be dropped.
If the resulting w_mean column should always be less than zero, then the last division can be done as follows instead:
data['w_mean'] = np.where(data['pop0_sum'] > data['pop1_sum'], data['pop1_sum'] / data['pop0_sum'], data['pop0_sum'] / data['pop1_sum'])
This will give 0.5 for the Malibu & NYC pair and 0.113043 for Dublin & Sydney.
My data is in a csv like this
ID Date Year Home Team Away Team HP AP
1 09/02 1966 Miami Oakland 14 23
2 09/03 1966 Houston Denver 45 7
3 09/10 1966 Oakland Houston 31 0
4 09/27 1966 Houston Oakland 18 10
5 10/20 1966 Oakland Houston 21 18
On each row I want to sum the previously accumulated home and away points for both home team and away team.
I have used pandas groupby to get the home points for the home team and the away points for the away team similar to below
df1['HT_HP']=df1.groupby('Home Team')['HP'].apply(lambda x : x.shift().cumsum())
But I can't do it to get the previously scored away points for the home team and the previously scored home points for the away team.
so for the first Oakland Houston game there would be a column with Oakland 23 away points and a separate column for Houston 45 home points
Expected outcome:
ID Date Year Home Team Away Team HP AP HT_AP AT_HP
1 09/02 1966 Miami Oakland 14 23 NaN NaN
2 09/03 1966 Houston Denver 45 7 NaN NaN
3 09/10 1966 Oakland Houston 31 0 23 45
4 09/27 1966 Houston Oakland 18 10 0 31
5 10/20 1966 Oakland Houston 21 18 33 63
I've tried this
df1['HT_AGS'] = df1.where(df1['AwayTeam']==df1['HomeTeam']).groupby('HomeTeam')['FTHG'].apply(lambda x : x.shift().cumsum())
This returns a full column of NaN values
In excel it would be something similar to sumifs(F1:F3,D1:D3,E4)
I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0