Even though I was googling a lot, I couldn't find the solution for my problem.
I have dataframe
filter10 REF
0 NaN 0.00
1 NaN 0.75
2 NaN 1.50
3 NaN 2.25
4 NaN 3.00
5 NaN 3.75
6 NaN 4.50
...
15 2.804688 11.25
16 3.021875 12.00
17 3.578125 12.75
18 3.779688 13.50
...
27 NaN 20.25
28 NaN 21.00
29 NaN 21.75
30 NaN 22.50
31 6.746875 NaN
32 NaN NaN
...
I would like now to add the column df['DIFF'] where function goes through whole column filter10 and when it is the number it finds closest number in REF column.
And afterwards calculate the difference between them and put it the same row as number in filter10 is.
I would like this output:
filter10 REF DIFF
0 NaN 0.00 NaN
1 NaN 0.75 NaN
2 NaN 1.50 NaN
3 NaN 2.25 NaN
4 NaN 3.00 NaN
5 NaN 3.75 NaN
6 NaN 4.50 NaN
...
15 2.804688 11.25 -0.195312 # 2.804688 - 3 (find closest value in REF) = -0.195312
16 3.021875 12.00 0.021875
17 3.578125 12.75 -0.171875
18 3.779688 13.50 0.029688
...
27 NaN 20.25 NaN
28 NaN 21.00 NaN
29 NaN 21.75 NaN
30 NaN 22.50 NaN
31 6.746875 NaN -0.003125
32 NaN NaN NaN
...
Use pandas.merge_asof to find the nearest value:
df['DIFF'] = (pd.merge_asof(df['filter10'].dropna().sort_values().reset_index(),
df[['REF']].dropna().sort_values('REF'),
left_on='filter10', right_on='REF', direction='nearest')
.set_index('index')['REF'].rsub(df['filter10'])
)
Output:
filter10 REF DIFF
0 NaN 0.00 NaN
1 NaN 0.75 NaN
2 NaN 1.50 NaN
3 NaN 2.25 NaN
4 NaN 3.00 NaN
5 NaN 3.75 NaN
6 NaN 4.50 NaN
15 2.804688 11.25 -0.195312
16 3.021875 12.00 0.021875
17 3.578125 12.75 -0.171875
18 3.779688 13.50 0.029688
27 NaN 20.25 NaN
28 NaN 21.00 NaN
29 NaN 21.75 NaN
30 NaN 22.50 NaN
31 6.746875 NaN 2.246875 # likely different due to missing data
32 NaN NaN NaN
As an alternative, one can use cKDTree for this:
from io import StringIO
import pandas as pd
s=""" filter10 REF
0 NaN 0.00
1 NaN 0.75
2 NaN 1.50
3 NaN 2.25
4 NaN 3.00
5 NaN 3.75
6 NaN 4.50
15 2.804688 11.25
16 3.021875 12.00
17 3.578125 12.75
18 3.779688 13.50
27 NaN 20.25
28 NaN 21.00
29 NaN 21.75
30 NaN 22.50
31 6.746875 NaN
32 NaN NaN"""
df = pd.read_csv(StringIO(s), delimiter=r"\s+")
from scipy.spatial import cKDTree # or simply KDTree
tree = cKDTree(df.REF.values[:,None]) # needs to be an nxm array, hence the [:,None] which is called a broadcast in numpy world
df['DIFF'] = df.filter10 - np.array([df.REF[i] if not np.isinf(dist) else np.nan for dist,i in [tree.query(x,1) for x in df.filter10]])
# filter10 REF DIFF
#0 NaN 0.00 NaN
#1 NaN 0.75 NaN
#2 NaN 1.50 NaN
#3 NaN 2.25 NaN
#4 NaN 3.00 NaN
#5 NaN 3.75 NaN
#6 NaN 4.50 NaN
#15 2.804688 11.25 -0.195312
#16 3.021875 12.00 0.021875
#17 3.578125 12.75 -0.171875
#18 3.779688 13.50 0.029688
#27 NaN 20.25 NaN
#28 NaN 21.00 NaN
#29 NaN 21.75 NaN
#30 NaN 22.50 NaN
#31 6.746875 NaN 2.246875
#32 NaN NaN NaN
The query method returns infinity when the point in question is nan.
I have the following dataframe:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 1.1 11.2 75 7 NaN
9 12, 43 2 Sample A 1.4 3.3 87 6 Temperature was too hot
10 12, 43 3 Sample A 0.7 7.4 91 5 NaN
11 19, 21 1 Sample B 2.1 3.2 99 7 NaN
12 19, 21 2 Sample B 2.2 11.3 76 8 NaN
13 19, 21 3 Sample B 1.9 9.3 89 9 sample spilt by user
14 19, 21 1 Sample C 3.2 4.0 112 10 NaN
15 19, 21 2 Sample C 2.1 5.0 96 15 NaN
16 19, 21 3 Sample C 2.7 7.0 105 13 Was too cold
17 11, 22 1 Sample C 2.4 3.0 121 19 NaN
I'd like to do two separate things. Firstly, I'd like to calculate the mean values for each 'smell', 'volatility', 'taste' and 'odour' columns of the 'Sample' Control, where the 'Book_No' is the same value. Then, subtract those mean values from the individual Sample A, Sample B and Sample C, where the 'Book_No' matches those of the control. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C -0.200000 1.00 89.0 7.00 NaN
15 19, 21 2 Sample C -1.300000 2.00 73.0 12.00 NaN
16 19, 21 3 Sample C -0.700000 4.00 82.0 10.00 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
I've tried the following codes, but neither seems to give me what I need, plus I'd need to copy and paste the code and change the column name for each column I'd like to apply it to:
df['Smell'] = df['Smell'] - df.groupby(['Book_No', 'Sample'])['Smell'].transform('mean')
and I've tried to apply a mask:
mask = df['Book_No'].unique()
df.loc[~mask, 'Smell'] = (df['Smell'] - df['Smell'].where(mask).groupby([df['Book_No'],df['Sample']]).transform('mean'))
Then, separately, I'd like to subtract the control values from the sample values, when the Book_No and replicate values match. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Unnamed: 7
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 0.8 1.2 4 6 NaN
9 12, 43 2 Sample A 1.0 -4.7 24 3 Temperature was too hot
10 12, 43 3 Sample A 0.6 4.4 69 3 NaN
11 19, 21 1 Sample B 1.0 1.2 19 4 NaN
12 19, 21 2 Sample B 1.8 3.3 76 4 NaN
13 19, 21 3 Sample B 1.0 6.3 85 3 sample spilt by user
14 19, 21 1 Sample C 2.1 2.0 32 7 NaN
15 19, 21 2 Sample C 1.7 -3.0 96 11 NaN
16 19, 21 3 Sample C 1.8 4.0 101 7 Was too cold
17 11, 22 1 Sample C -1.0 0.0 98 16 NaN
Could anyone kindly offer their assistance to help with these two scenarios?
Thank you in advance for any help
Splitting into different columns and reordering:
# This may be useful to you in the future, plus, ints are better than strings:
df[['Book', 'No']] = df.Book_No.str.split(', ', expand=True).astype(int)
cols = df.columns.tolist()
df = df[cols[-2:] + cols[1:-2]]
You should only focus on one problem at a time in your questions, so I'll help with the first part.
# Set some vars so we don't have to type these over and over:
cols = ['Smell', 'Volatility', 'Taste', 'Odour']
mask = df.Sample.eq('control')
group = ['Book', 'No']
# Find your control values:
ctrl_means = df[mask].groupby(group)[cols].mean()
# Apply your desired change:
df.loc[~mask, cols] = (df[~mask].groupby(group)[cols]
.apply(lambda x: x.sub(ctrl_means.loc[x.name])))
print(df)
Output:
Book No Replicate Sample Smell Taste Odour Volatility Notes
0 12 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
First we get the mean of the control samples:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
control_means = df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
Then subtract it from the remaining samples to get the fixed sample data. To utilize pandas automatic alignment, we need to temporarily set the index:
new_idx = ['Book_No', df.index]
fixed_samples = (df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- control_means).droplevel(0)
Finally simply assign them back into the dataframe:
df.loc[df.Sample.ne('control'), cols] = fixed_samples
Result:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19, 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19, 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
If you want you can squeeze it into a one-liner, but this his hardly comprehensible:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
new_idx = ['Book_No', df.index]
df.loc[df.Sample.ne('control'), cols] = (
df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
).droplevel(0)
I have a df below:
Country Product Value 11/01/1998 12/01/1998 01/01/1999 ... 07/01/2022 08/01/2022 09/01/2022 10/01/2022 11/01/2022 12/01/2022
0 France NaN Market 3330 7478 2273 ... NaN NaN NaN NaN NaN NaT
1 France NaN World 362 798 306 ... NaN NaN NaN NaN NaN NaT
3 Germany NaN Market 1452 2025 1314 ... NaN NaN NaN NaN NaN NaT
4 Germany NaN World 209 246 182 ... NaN NaN NaN NaN NaN NaT
6 Spain NaN Market 1943 2941 1426 ... NaN NaN NaN NaN NaN NaT
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
343 Serbia and Montenegro 0 World 0 0 0 ... NaN NaN NaN NaN NaN NaT
345 Slovenia 0 Market 26 24 20 ... NaN NaN NaN NaN NaN NaT
346 Slovenia 0 World 0 0 1 ... NaN NaN NaN NaN NaN NaT
348 Slovakia 0 Market 2 2 0 ... NaN NaN NaN NaN NaN NaT
349 Slovakia 0 World 1 1 0 ... NaN NaN NaN NaN NaN NaT
I'm trying to rearrange the data and I figure that I need some sort of combination between transpose, melt, and/or stack. I've read through the documentation, but I can't seem to make sense of it. All combinations that I have tried haven't been able to give me what I need.
Columns should be: Country, Product, Market, World, Date (transpose the dates), and then the values should be under the Market or World Columns.
Any ideas?
Thanks so much and let me know if I can provide more information.
IIUC you need a combination of melt, set_index and unstack:
print (df.melt(id_vars=["Country", "Product", "Value"])
.set_index(["Country", "Product", "Value", "variable"])
.unstack("Value").reset_index())
Country Product variable value
Value Market World
0 France NaN 01/01/1999 2273 306
1 France NaN 07/01/2022 NaN NaN
2 France NaN 08/01/2022 NaN NaN
3 France NaN 09/01/2022 NaN NaN
4 France NaN 10/01/2022 NaN NaN
5 France NaN 11/01/1998 3330 362
6 France NaN 11/01/2022 NaN NaN
7 France NaN 12/01/1998 7478 798
8 France NaN 12/01/2022 NaT NaT
9 Germany NaN 01/01/1999 1314 182
10 Germany NaN 07/01/2022 NaN NaN
11 Germany NaN 08/01/2022 NaN NaN
12 Germany NaN 09/01/2022 NaN NaN
13 Germany NaN 10/01/2022 NaN NaN
14 Germany NaN 11/01/1998 1452 209
15 Germany NaN 11/01/2022 NaN NaN
16 Germany NaN 12/01/1998 2025 246
17 Germany NaN 12/01/2022 NaT NaT
18 Serbia and Montenegro 0.0 01/01/1999 NaN 0
19 Serbia and Montenegro 0.0 07/01/2022 NaN NaN
20 Serbia and Montenegro 0.0 08/01/2022 NaN NaN
21 Serbia and Montenegro 0.0 09/01/2022 NaN NaN
22 Serbia and Montenegro 0.0 10/01/2022 NaN NaN
23 Serbia and Montenegro 0.0 11/01/1998 NaN 0
24 Serbia and Montenegro 0.0 11/01/2022 NaN NaN
25 Serbia and Montenegro 0.0 12/01/1998 NaN 0
26 Serbia and Montenegro 0.0 12/01/2022 NaN NaT
27 Slovakia 0.0 01/01/1999 0 0
28 Slovakia 0.0 07/01/2022 NaN NaN
29 Slovakia 0.0 08/01/2022 NaN NaN
30 Slovakia 0.0 09/01/2022 NaN NaN
31 Slovakia 0.0 10/01/2022 NaN NaN
32 Slovakia 0.0 11/01/1998 2 1
33 Slovakia 0.0 11/01/2022 NaN NaN
34 Slovakia 0.0 12/01/1998 2 1
35 Slovakia 0.0 12/01/2022 NaT NaT
36 Slovenia 0.0 01/01/1999 20 1
37 Slovenia 0.0 07/01/2022 NaN NaN
38 Slovenia 0.0 08/01/2022 NaN NaN
39 Slovenia 0.0 09/01/2022 NaN NaN
40 Slovenia 0.0 10/01/2022 NaN NaN
41 Slovenia 0.0 11/01/1998 26 0
42 Slovenia 0.0 11/01/2022 NaN NaN
43 Slovenia 0.0 12/01/1998 24 0
44 Slovenia 0.0 12/01/2022 NaT NaT
45 Spain NaN 01/01/1999 1426 NaN
46 Spain NaN 07/01/2022 NaN NaN
47 Spain NaN 08/01/2022 NaN NaN
48 Spain NaN 09/01/2022 NaN NaN
49 Spain NaN 10/01/2022 NaN NaN
50 Spain NaN 11/01/1998 1943 NaN
51 Spain NaN 11/01/2022 NaN NaN
52 Spain NaN 12/01/1998 2941 NaN
53 Spain NaN 12/01/2022 NaT NaN
I want to automate the search process on a website and scrape the table of individual players (I'm getting the players' names from an Excel sheet). I want to add that scraped information to an existing Excel sheet with the list of players. For each year that player has been in the league, the player's name needs to be in the first column. So far, I was able to grab the information from the existing Excel sheet, but I'm not sure how to automate the search process using that. I'm not sure if Selenium can help. The website is https://basketball.realgm.com/.
import openpyxl
path = r"C:\Users\Name\Desktop\NBAPlayers.xlsx"
workbook = openpyxl.load_workbook(path)
sheet = workbook.active
rows = sheet.max_row
cols = sheet.max_column
print(rows)
print(cols)
for r in range(2, rows+1):
for c in range(2,cols+1):
print(sheet.cell(row=r,column=c).value, end=" ")
print()
I presume you have got the names from excel sheet so I used a name list and using python request module and get the page text and then use beautiful soup to get table content and Then I have use pandas to get the info in dataframe.
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
playernames=['Dominique Jones', 'Joe Young', 'Darius Adams', 'Lester Hudson', 'Marcus Denmon', 'Courtney Fortson']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
print(url)
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
table=soup.select_one(".tablesaw ")
dfs=pd.read_html(str(table))
for df in dfs:
print(df)
Output:
https://basketball.realgm.com/search?q=Dominique+Jones
Player Pos HT ... Draft Year College NBA
0 Dominique Jones G 6-4 ... 2010 South Florida Dallas Mavericks
1 Dominique Jones G 6-2 ... 2009 Liberty -
2 Dominique Jones PG 5-9 ... 2011 Fort Hays State -
[3 rows x 8 columns]
https://basketball.realgm.com/search?q=Joe+Young
Player Pos HT ... Draft Year College NBA
0 Joe Young F 6-6 ... 2007 Holy Cross -
1 Joe Young G 6-0 ... 2009 Canisius -
2 Joe Young G 6-2 ... 2015 Oregon Indiana Pacers
3 Joe Young G 6-2 ... 2009 Central Missouri -
[4 rows x 8 columns]
https://basketball.realgm.com/search?q=Darius+Adams
Player Pos HT ... Draft Year College NBA
0 Darius Adams PG 6-1 ... 2011 Indianapolis -
1 Darius Adams G 6-0 ... 2018 Coast Guard Academy -
[2 rows x 8 columns]
https://basketball.realgm.com/search?q=Lester+Hudson
Season Team GP GS MIN ... STL BLK PF TOV PTS
0 2009-10 * All Teams 25 0 5.3 ... 0.32 0.12 0.48 0.56 2.32
1 2009-10 * BOS 16 0 4.4 ... 0.19 0.12 0.44 0.56 1.38
2 2009-10 * MEM 9 0 6.8 ... 0.56 0.11 0.56 0.56 4.00
3 2010-11 WAS 11 0 6.7 ... 0.36 0.09 0.91 0.64 1.64
4 2011-12 * All Teams 16 0 20.9 ... 0.88 0.19 1.62 2.00 10.88
5 2011-12 * CLE 13 0 24.2 ... 1.08 0.23 2.00 2.31 12.69
6 2011-12 * MEM 3 0 6.5 ... 0.00 0.00 0.00 0.67 3.00
7 2014-15 LAC 5 0 11.1 ... 1.20 0.20 0.80 0.60 3.60
8 CAREER NaN 57 0 10.4 ... 0.56 0.14 0.91 0.98 4.70
[9 rows x 23 columns]
https://basketball.realgm.com/search?q=Marcus+Denmon
Season Team Location GP GS ... STL BLK PF TOV PTS
0 2012-13 SAN Las Vegas 5 0 ... 0.4 0.0 1.60 0.20 5.40
1 2013-14 SAN Las Vegas 5 1 ... 0.8 0.0 2.20 1.20 10.80
2 2014-15 SAN Las Vegas 6 2 ... 0.5 0.0 1.50 0.17 5.00
3 2015-16 SAN Salt Lake City 2 0 ... 0.0 0.0 0.00 0.00 0.00
4 CAREER NaN NaN 18 3 ... 0.5 0.0 1.56 0.44 6.17
[5 rows x 24 columns]
https://basketball.realgm.com/search?q=Courtney+Fortson
Season Team GP GS MIN FGM ... AST STL BLK PF TOV PTS
0 2011-12 * All Teams 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
1 2011-12 * HOU 6 0 8.2 1.00 ... 0.83 0.5 0.0 0.33 0.83 3.00
2 2011-12 * LAC 4 0 11.5 1.25 ... 1.25 0.0 0.0 0.75 1.25 4.25
3 CAREER NaN 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
[4 rows x 23 columns]
You have to have url list with players and scrape the pages using beautiful soup.
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
My data frame is below:
Date Country GDP
0 2011 United States 345.0
1 2012 United States 0.0
2 2013 United States 457.0
3 2014 United States 577.0
4 2015 United States 0.0
5 2016 United States 657.0
6 2011 UK 35.0
7 2012 UK 64.0
8 2013 UK 54.0
9 2014 UK 67.0
10 2015 UK 687.0
11 2016 UK 0.0
12 2011 China 34.0
13 2012 China 54.0
14 2013 China 678.0
15 2014 China 355.0
16 2015 China 5678.0
17 2016 China 345.0
I want to calculate what is the GDP percentage of one country among all 3 countries each year. I would like to add one more column called parc in the dataframe:
I implemented below code:
import pandas as pd
countrylist=['United States','UK','China']
for country in countrylist:
for year in range (2011,2016):
df['perc']=(df['GDP'][(df['Country']==country) & (df['Date']==year)]).astype(float)/df['GDP'][df['Date']==year].sum()
print (df['perc'])
My output is like
0 0.833333
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
0 NaN
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
0 NaN
1 NaN
2 0.384357
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
....
I noticed that my previous results got wipe out when new loop start. So ultimately I only have last perc value. I should provide some position info when df['perc'] happen such as:
df['perc'][([(df['Country']==country) & (df['Date']==year)])]=(df['GDP'][(df['Country']==country) & (df['Date']==year)]).astype(float)/df['GDP'][df['Date']==year].sum()
But it doesn't work. How can I dynamically insert value?
Ideally, I should have:
Date Country GDP perc
0 2011 United States 345.0 0.81
1 2012 United States 0.0 0.0
2 2013 United States 457.0 0.23
3 2014 United States 577.0 xx
4 2015 United States 0.0 xx
5 2016 United States 657.0 xx
6 2011 UK 35.0 xx
7 2012 UK 64.0 xx
8 2013 UK 54.0 xx
9 2014 UK 67.0 xx
10 2015 UK 687.0 xx
11 2016 UK 0.0 xx
12 2011 China 34.0 xx
13 2012 China 54.0 xx
14 2013 China 678.0 xx
15 2014 China 355.0 xx
16 2015 China 5678.0 xx
17 2016 China 345.0 xx
You can using transform sum here
df.GDP/df.groupby('Date').GDP.transform('sum')
Out[161]:
0 0.833333
1 0.000000
2 0.384357
3 0.577578
4 0.000000
5 0.655689
6 0.084541
7 0.542373
8 0.045416
9 0.067067
10 0.107934
11 0.000000
12 0.082126
13 0.457627
14 0.570227
15 0.355355
16 0.892066
17 0.344311
Name: GDP, dtype: float64