I have 2 data frames, from which I want to create a third data frame(country) from data from the 2 data frames.
Below the data:
Indicator 1
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
2 Albania 24.5 23.1 21.8 20.4 19.2 17.9 16.7 15.5 14.4 13.3
195 Zambia 153.0 142.0 130.0 119.0 110.0 101.0 95.4 90.4 85.1 80.3
Indicator2
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
2 Albania 76.0 75.9 75.6 75.8 76.2 76.9 77.5 77.6 78.0 78.1
193 Zambia 45.2 45.9 46.6 47.7 48.7 50.0 51.9 54.1 55.7 56.5
I need to create a new data frame for each country like below
Angloa
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Indicator1 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
Indicator2 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
I need to know the code for creating this new data frame
What you asked can be done this way :
# Setting up DataFrames
indicator1 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["200.0", "24.5", "153.0"],
'2002': ["193.0", "23.1", "142.0"]
})
indicator2 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["53.4", "76.0", "45.2"],
'2002': ["54.5", "75.9", "45.9"]
})
# For each country
for index, row in indicator1.iterrows():
# create a new variable with the country as name
globals()[f"{row['country']}"] = {}
# For each column of your 2 dataframes
for key, value in indicator1.iteritems():
if key != 'country':
globals()[f"{row['country']}"][key] = [row[key], indicator2.iloc[indicator2[indicator2['country'] == row[
'country']].index.values[0]][key]]
globals()[f"{row['country']}"] = pd.DataFrame(globals()[f"{row['country']}"])
I've only did it with an extract from your data, but it can be generalised. I'm not sure saving the newly created DataFrame like this is the best way, but I had no variable idea so I let you worry about this.
print(Angola)
# Output :
2001 2002
0 200.0 193.0
1 53.4 54.5
Related
I'm trying to scrape some NFL data from:
url = https://www.pro-football-reference.com/years/2019/opp.htm.
I first tried to scrape the data from the tables with pandas. I've done this before and it's always been straight forward. I expected pandas to return a list of all tables found on the page. However, when I ran
dfs = pd.read_html(url)
I only received the first two tables from the web page, Team Defense and Team Advanced Defense.
I then went to try to scrape the other tables with bs4 and requests. To test, I first only tried to scrape the first table:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table', id = 'advanced_defense')
rows = table.find_all('tr')
for tr in rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
I was then able to simply change the id such that I returned both the Team Defense and Team Advanced Defense - the same two tables that pandas returned.
However, when I try to use the same method to scrape the other tables on the page I receive an error. I obtained the id by inspecting the web page in the same manner as the first two tables and am unable to get a result.
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table', id = 'passing')
rows = table.find_all('tr')
for tr in rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
It is not able to find anything for table when attempting to scrape any of the other tables on the page as I receive the following error
AttributeError: 'NoneType' object has no attribute 'find_all'
I find it strange how both pandas and bs4 are only able to return the Team Defense and Team Advanced Defense tables.
I only intend to scrape the Team Defense, Passing Defense, and Rushing Defense tables.
How could I approach successfully scraping the Passing Defense and Rushing Defense tables?
So the sports reference.com sites are tricky in that the first table (or a few tables) do show up in the html source. The other tables are dynamically rendered. HOWEVER, those other tables are within the Comments within the html. So to get those other tables, you have to pull out the comments, then can use pandas or beautifulsoup to get those table tags.
So you can grab the team stats as you normally would. Then pull the comments and parse those other tables.
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
dfs = [pd.read_html(url, header=0, attrs={'id':'team_stats'})[0]]
dfs[0].columns = dfs[0].iloc[0,:]
dfs[0] = dfs[0].iloc[1:,:].reset_index(drop=True)
for each in comments:
if 'table' in each and ('id="passing"' in each or 'id="rushing"' in each):
dfs.append(pd.read_html(each)[0])
Output:
for df in dfs:
print (df)
0 Rk Tm G PF ... 1stPy Sc% TO% EXP
0 1 New England Patriots 16 225 ... 39 19.4 17.3 165.75
1 2 Buffalo Bills 16 259 ... 33 23.6 12.4 39.85
2 3 Baltimore Ravens 16 282 ... 39 32.9 14.6 16.61
3 4 Chicago Bears 16 298 ... 30 31.5 10.7 -4.15
4 5 Minnesota Vikings 16 303 ... 31 34.5 17.0 -7.88
5 6 Pittsburgh Steelers 16 303 ... 30 29.9 19.0 85.78
6 7 Kansas City Chiefs 16 308 ... 39 34.6 13.6 -65.69
7 8 San Francisco 49ers 16 310 ... 30 29.0 14.2 77.41
8 9 Green Bay Packers 16 313 ... 20 34.5 14.1 -63.65
9 10 Denver Broncos 16 316 ... 34 37.3 8.4 -35.98
10 11 Dallas Cowboys 16 321 ... 38 35.5 9.9 -36.81
11 12 Tennessee Titans 16 331 ... 27 32.1 11.8 -54.20
12 13 New Orleans Saints 16 341 ... 43 34.7 12.7 -41.89
13 14 Los Angeles Chargers 16 345 ... 28 37.3 8.2 -86.11
14 15 Philadelphia Eagles 16 354 ... 28 33.9 10.2 -29.57
15 16 New York Jets 16 359 ... 40 34.4 10.1 -0.06
16 17 Los Angeles Rams 16 364 ... 30 33.7 12.7 -11.53
17 18 Indianapolis Colts 16 373 ... 23 39.3 13.1 -58.37
18 19 Houston Texans 16 385 ... 28 39.3 13.1 -160.87
19 20 Cleveland Browns 16 393 ... 37 36.9 11.2 -91.15
20 21 Jacksonville Jaguars 16 397 ... 33 37.4 9.2 -120.09
21 22 Seattle Seahawks 16 398 ... 25 37.1 16.3 -92.02
22 23 Atlanta Falcons 16 399 ... 30 42.8 9.0 -105.34
23 24 Oakland Raiders 16 419 ... 52 41.2 8.5 -159.71
24 25 Cincinnati Bengals 16 420 ... 21 39.8 8.8 -132.66
25 26 Detroit Lions 16 423 ... 39 40.1 9.0 -142.55
26 27 Washington Redskins 16 435 ... 34 41.9 12.2 -135.83
27 28 Arizona Cardinals 16 442 ... 38 42.6 9.5 -174.55
28 29 Tampa Bay Buccaneers 16 449 ... 39 39.6 13.5 12.23
29 30 New York Giants 16 451 ... 32 39.7 8.7 -105.11
30 31 Carolina Panthers 16 470 ... 30 41.4 9.4 -116.88
31 32 Miami Dolphins 16 494 ... 34 45.6 8.8 -175.02
32 NaN Avg Team NaN 365.0 ... 32.9 36.0 11.8 -56.6
33 NaN League Total NaN 11680 ... 1054 36.0 11.8 NaN
34 NaN Avg Tm/G NaN 22.8 ... 2.1 36.0 11.8 NaN
[35 rows x 28 columns]
Rk Tm G Cmp ... NY/A ANY/A Sk% EXP
0 1.0 San Francisco 49ers 16.0 318.0 ... 4.80 4.6 8.5 58.30
1 2.0 New England Patriots 16.0 303.0 ... 5.00 3.5 8.1 117.74
2 3.0 Pittsburgh Steelers 16.0 314.0 ... 5.50 4.7 9.5 20.19
3 4.0 Buffalo Bills 16.0 348.0 ... 5.20 4.7 7.4 30.01
4 5.0 Los Angeles Chargers 16.0 328.0 ... 6.50 6.3 6.1 -92.16
5 6.0 Baltimore Ravens 16.0 318.0 ... 5.70 5.2 6.4 15.40
6 7.0 Cleveland Browns 16.0 318.0 ... 6.30 6.1 6.9 -64.09
7 8.0 Kansas City Chiefs 16.0 352.0 ... 5.70 5.2 7.2 -36.78
8 9.0 Chicago Bears 16.0 362.0 ... 5.90 5.7 5.3 -47.04
9 10.0 Dallas Cowboys 16.0 370.0 ... 5.90 6.1 6.4 -67.46
10 11.0 Denver Broncos 16.0 348.0 ... 6.30 6.1 6.9 -61.45
11 12.0 Los Angeles Rams 16.0 348.0 ... 5.90 5.7 8.2 -42.76
12 13.0 Carolina Panthers 16.0 347.0 ... 6.20 5.8 8.9 -63.03
13 14.0 Green Bay Packers 16.0 326.0 ... 6.30 5.7 7.0 -27.30
14 15.0 Minnesota Vikings 16.0 394.0 ... 5.80 5.3 7.4 -34.01
15 16.0 Jacksonville Jaguars 16.0 327.0 ... 6.70 6.7 8.3 -98.77
16 17.0 New York Jets 16.0 363.0 ... 6.10 6.0 5.6 -79.16
17 18.0 Washington Redskins 16.0 371.0 ... 6.50 6.7 7.8 -135.17
18 19.0 Philadelphia Eagles 16.0 348.0 ... 6.30 6.4 7.0 -88.15
19 20.0 New Orleans Saints 16.0 371.0 ... 5.90 5.8 7.8 -94.59
20 21.0 Cincinnati Bengals 16.0 308.0 ... 7.40 7.4 5.8 -126.81
21 22.0 Atlanta Falcons 16.0 351.0 ... 6.90 7.0 5.0 -128.75
22 23.0 Indianapolis Colts 16.0 394.0 ... 6.60 6.4 6.8 -86.44
23 24.0 Tennessee Titans 16.0 386.0 ... 6.40 6.2 6.7 -92.39
24 25.0 Oakland Raiders 16.0 337.0 ... 7.40 7.8 5.7 -177.69
25 26.0 Miami Dolphins 16.0 344.0 ... 7.40 7.7 4.0 -172.01
26 27.0 Seattle Seahawks 16.0 383.0 ... 6.70 6.2 4.5 -77.18
27 28.0 New York Giants 16.0 369.0 ... 7.10 7.4 6.1 -152.48
28 29.0 Houston Texans 16.0 375.0 ... 6.90 7.1 5.0 -160.60
29 30.0 Tampa Bay Buccaneers 16.0 408.0 ... 6.10 6.2 6.6 -38.17
30 31.0 Arizona Cardinals 16.0 421.0 ... 7.00 7.7 6.2 -190.81
31 32.0 Detroit Lions 16.0 381.0 ... 7.10 7.7 4.4 -162.94
32 NaN Avg Team NaN 354.1 ... 6.29 6.2 6.7 -73.60
33 NaN League Total NaN 11331.0 ... 6.29 6.2 6.7 NaN
34 NaN Avg Tm/G NaN 22.1 ... 6.29 6.2 6.7 NaN
[35 rows x 25 columns]
Rk Tm G Att ... TD Y/A Y/G EXP
0 1.0 Tampa Bay Buccaneers 16.0 362.0 ... 11.0 3.3 73.8 56.23
1 2.0 New York Jets 16.0 417.0 ... 12.0 3.3 86.9 72.34
2 3.0 Philadelphia Eagles 16.0 353.0 ... 13.0 4.1 90.1 47.64
3 4.0 New Orleans Saints 16.0 345.0 ... 12.0 4.2 91.3 39.45
4 5.0 Baltimore Ravens 16.0 340.0 ... 12.0 4.4 93.4 -1.25
5 6.0 New England Patriots 16.0 365.0 ... 7.0 4.2 95.5 33.13
6 7.0 Indianapolis Colts 16.0 383.0 ... 8.0 4.1 97.9 21.54
7 8.0 Oakland Raiders 16.0 405.0 ... 15.0 3.9 98.1 17.69
8 9.0 Chicago Bears 16.0 414.0 ... 16.0 3.9 102.0 38.83
9 10.0 Buffalo Bills 16.0 388.0 ... 12.0 4.3 103.1 10.92
10 11.0 Dallas Cowboys 16.0 407.0 ... 14.0 4.1 103.5 25.11
11 12.0 Tennessee Titans 16.0 415.0 ... 14.0 4.0 104.5 28.27
12 13.0 Minnesota Vikings 16.0 404.0 ... 8.0 4.3 108.0 21.01
13 14.0 Pittsburgh Steelers 16.0 462.0 ... 7.0 3.8 109.6 63.09
14 15.0 Atlanta Falcons 16.0 421.0 ... 13.0 4.2 110.9 17.98
15 16.0 Denver Broncos 16.0 426.0 ... 9.0 4.2 111.4 12.72
16 17.0 San Francisco 49ers 16.0 401.0 ... 11.0 4.5 112.6 9.91
17 18.0 Los Angeles Chargers 16.0 429.0 ... 15.0 4.2 112.8 1.08
18 19.0 Los Angeles Rams 16.0 444.0 ... 15.0 4.1 113.1 21.49
19 20.0 New York Giants 16.0 469.0 ... 19.0 3.9 113.3 40.51
20 21.0 Detroit Lions 16.0 455.0 ... 13.0 4.1 115.9 17.32
21 22.0 Seattle Seahawks 16.0 388.0 ... 22.0 4.9 117.7 -17.45
22 23.0 Green Bay Packers 16.0 411.0 ... 15.0 4.7 120.1 -42.18
23 24.0 Arizona Cardinals 16.0 439.0 ... 9.0 4.4 120.1 15.13
24 25.0 Houston Texans 16.0 403.0 ... 12.0 4.8 121.1 -6.34
25 26.0 Kansas City Chiefs 16.0 416.0 ... 14.0 4.9 128.2 -41.35
26 27.0 Miami Dolphins 16.0 485.0 ... 15.0 4.5 135.4 -6.14
27 28.0 Jacksonville Jaguars 16.0 435.0 ... 23.0 5.1 139.3 -21.95
28 29.0 Carolina Panthers 16.0 445.0 ... 31.0 5.2 143.5 -62.69
29 30.0 Cleveland Browns 16.0 463.0 ... 19.0 5.0 144.7 -37.50
30 31.0 Washington Redskins 16.0 493.0 ... 14.0 4.7 146.2 -6.89
31 32.0 Cincinnati Bengals 16.0 504.0 ... 17.0 4.7 148.9 -12.07
32 NaN Avg Team NaN 418.3 ... 14.0 4.3 112.9 11.10
33 NaN League Total NaN 13387.0 ... 447.0 4.3 112.9 NaN
34 NaN Avg Tm/G NaN 26.1 ... 0.9 4.3 112.9 NaN
[35 rows x 9 columns]
I'm trying to plot fantasy points from two players in every game since the start of the NBA season.
I've created a dataframe that has the lines of every player, every night, and I want to plot every date that each have played.
The two dataframes look as such.
kemba[['Date','FP']]
Date FP
Rk
260 10/23/2019 2.0
532 10/25/2019 28.0
754 10/26/2019 49.0
1390 10/30/2019 35.0
1628 11/1/2019 39.5
2178 11/5/2019 32.5
2463 11/7/2019 17.5
2800 11/9/2019 40.0
3103 11/11/2019 37.5
3410 11/13/2019 37.0
3699 11/15/2019 25.0
4001 11/17/2019 22.5
4186 11/18/2019 22.0
4494 11/20/2019 9.5
4750 11/22/2019 4.0
5637 11/27/2019 50.5
5904 11/29/2019 19.0
6193 12/1/2019 22.5
6677 12/4/2019 43.5
6975 12/6/2019 26.0
7454 12/9/2019 33.5
7769 12/11/2019 57.0
7861 12/12/2019 31.5
8614 12/18/2019 35.5
9071 12/20/2019 5.0
9289 12/22/2019 26.0
100 12/25/2019 23.0
ingram[['Date','FP']]
Date FP
Rk
22 10/22/2019 31.5
441 10/25/2019 37.5
646 10/26/2019 57.0
984 10/28/2019 41.5
1439 10/31/2019 30.0
1718 11/2/2019 10.5
1994 11/4/2019 59.0
2586 11/8/2019 30.0
2757 11/9/2019 31.5
4245 11/19/2019 30.5
4532 11/21/2019 38.5
4864 11/23/2019 40.5
5022 11/24/2019 32.5
5496 11/27/2019 22.0
5784 11/29/2019 43.0
6111 12/1/2019 31.0
6404 12/3/2019 40.0
6737 12/5/2019 27.0
7038 12/7/2019 18.0
7372 12/9/2019 38.5
7668 12/11/2019 29.0
7958 12/13/2019 38.0
8283 12/15/2019 32.5
8551 12/17/2019 24.0
8612 12/18/2019 48.0
8891 12/20/2019 30.5
102 12/23/2019 31.0
55 12/25/2019 46.5
The data that I've plotted is such:
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=ingram['Date']
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=kemba['Date']
fig=plt.figure()
plt.plot(kemba_date,kemba_fp,color='#FF5733',linewidth=1,marker='.',label='Walker')
plt.plot(ingram_date,ingram_fp,color='#33A7FF',marker='.',label='Ingram')
fig.autofmt_xdate()
plt.show()
When I do this, the link for Ingram is all over the place. Any idea on what went wrong?
This is the plot I get
It looks like Date might not be formatted as a date.
Modify your code as follows:
import pandas as pd
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=pd.to_datetime(ingram['Date'])
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=pd.to_datetime(kemba['Date'])
df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))
I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.
Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
I have excel file with below data and I want to read data where First Column contains 'Area' & transpose, then again move & find where Column contains 'Area' & transpose
In this data total 3 table data given, I want to split it & then transpose. First Column contains Area code and other column name contains Year
Area 1980 1981 1982 1983
AU 33.7 38.8 40.2 42.5
BE 54.6 51.6 49.7 48.9
FI 43.2 49.6 58.8 71.1
Area 1979 1980 1981 1982
AU 29.8 33.7 38.8 40.2
BE 54.2 54.6 51.6 49.7
CA 39.4 44.3 50.6 48
Area 1978 1979 1980 1981
DK 58 57.2 54.5 53.2
FI 37.7 43.2 49.6 58.8
FR 41.6 49.9 55.4 58.5
Final Result expected:
Area variable value
AU 1980 33.7
other values
How to achieve this?
Assuming that we have the following list of DataFrame's:
In [106]: dfs
Out[106]:
[ Area 1980 1981 1982 1983
0 AU 33.7 38.8 40.2 42.5
1 BE 54.6 51.6 49.7 48.9
2 FI 43.2 49.6 58.8 71.1, Area 1979 1980 1981 1982
0 AU 29.8 33.7 38.8 40.2
1 BE 54.2 54.6 51.6 49.7
2 CA 39.4 44.3 50.6 48.0, Area 1978 1979 1980 1981
0 DK 58.0 57.2 54.5 53.2
1 FI 37.7 43.2 49.6 58.8
2 FR 41.6 49.9 55.4 58.5]
first we concatenate them horizontally:
In [107]: df = pd.concat([x.set_index('Area') for x in dfs], axis=1)
In [108]: df
Out[108]:
1980 1981 1982 1983 1979 1980 1981 1982 1978 1979 1980 1981
AU 33.7 38.8 40.2 42.5 29.8 33.7 38.8 40.2 NaN NaN NaN NaN
BE 54.6 51.6 49.7 48.9 54.2 54.6 51.6 49.7 NaN NaN NaN NaN
CA NaN NaN NaN NaN 39.4 44.3 50.6 48.0 NaN NaN NaN NaN
DK NaN NaN NaN NaN NaN NaN NaN NaN 58.0 57.2 54.5 53.2
FI 43.2 49.6 58.8 71.1 NaN NaN NaN NaN 37.7 43.2 49.6 58.8
FR NaN NaN NaN NaN NaN NaN NaN NaN 41.6 49.9 55.4 58.5
now we can stack DF and rename columns:
In [109]: df.stack().reset_index() \
.rename(columns={'level_0':'Area','level_1':'variable',0:'value'})
Out[109]:
Area variable value
0 AU 1980 33.7
1 AU 1981 38.8
2 AU 1982 40.2
3 AU 1983 42.5
4 AU 1979 29.8
5 AU 1980 33.7
6 AU 1981 38.8
7 AU 1982 40.2
8 BE 1980 54.6
9 BE 1981 51.6
10 BE 1982 49.7
11 BE 1983 48.9
12 BE 1979 54.2
13 BE 1980 54.6
14 BE 1981 51.6
15 BE 1982 49.7
16 CA 1979 39.4
17 CA 1980 44.3
18 CA 1981 50.6
19 CA 1982 48.0
20 DK 1978 58.0
21 DK 1979 57.2
22 DK 1980 54.5
23 DK 1981 53.2
24 FI 1980 43.2
25 FI 1981 49.6
26 FI 1982 58.8
27 FI 1983 71.1
28 FI 1978 37.7
29 FI 1979 43.2
30 FI 1980 49.6
31 FI 1981 58.8
32 FR 1978 41.6
33 FR 1979 49.9
34 FR 1980 55.4
35 FR 1981 58.5
what have you tried thus far?
Pandas is a really good library to use for data parsing etc.
you could implement something along the lines of...
import pandas as pd
df = pd.DataFrame.from_csv(csv_filename)
def create_new_table(df):
start = 0
end = 3
while (df.last_valid_index() != end):
#create a new dataframe with the relevant column
newdf.transpose()
start = end
end = end + 3