Read excel row by row and do transpose, Python 3.6 - python

I have excel file with below data and I want to read data where First Column contains 'Area' & transpose, then again move & find where Column contains 'Area' & transpose
In this data total 3 table data given, I want to split it & then transpose. First Column contains Area code and other column name contains Year
Area 1980 1981 1982 1983
AU 33.7 38.8 40.2 42.5
BE 54.6 51.6 49.7 48.9
FI 43.2 49.6 58.8 71.1
Area 1979 1980 1981 1982
AU 29.8 33.7 38.8 40.2
BE 54.2 54.6 51.6 49.7
CA 39.4 44.3 50.6 48
Area 1978 1979 1980 1981
DK 58 57.2 54.5 53.2
FI 37.7 43.2 49.6 58.8
FR 41.6 49.9 55.4 58.5
Final Result expected:
Area variable value
AU 1980 33.7
other values
How to achieve this?

Assuming that we have the following list of DataFrame's:
In [106]: dfs
Out[106]:
[ Area 1980 1981 1982 1983
0 AU 33.7 38.8 40.2 42.5
1 BE 54.6 51.6 49.7 48.9
2 FI 43.2 49.6 58.8 71.1, Area 1979 1980 1981 1982
0 AU 29.8 33.7 38.8 40.2
1 BE 54.2 54.6 51.6 49.7
2 CA 39.4 44.3 50.6 48.0, Area 1978 1979 1980 1981
0 DK 58.0 57.2 54.5 53.2
1 FI 37.7 43.2 49.6 58.8
2 FR 41.6 49.9 55.4 58.5]
first we concatenate them horizontally:
In [107]: df = pd.concat([x.set_index('Area') for x in dfs], axis=1)
In [108]: df
Out[108]:
1980 1981 1982 1983 1979 1980 1981 1982 1978 1979 1980 1981
AU 33.7 38.8 40.2 42.5 29.8 33.7 38.8 40.2 NaN NaN NaN NaN
BE 54.6 51.6 49.7 48.9 54.2 54.6 51.6 49.7 NaN NaN NaN NaN
CA NaN NaN NaN NaN 39.4 44.3 50.6 48.0 NaN NaN NaN NaN
DK NaN NaN NaN NaN NaN NaN NaN NaN 58.0 57.2 54.5 53.2
FI 43.2 49.6 58.8 71.1 NaN NaN NaN NaN 37.7 43.2 49.6 58.8
FR NaN NaN NaN NaN NaN NaN NaN NaN 41.6 49.9 55.4 58.5
now we can stack DF and rename columns:
In [109]: df.stack().reset_index() \
.rename(columns={'level_0':'Area','level_1':'variable',0:'value'})
Out[109]:
Area variable value
0 AU 1980 33.7
1 AU 1981 38.8
2 AU 1982 40.2
3 AU 1983 42.5
4 AU 1979 29.8
5 AU 1980 33.7
6 AU 1981 38.8
7 AU 1982 40.2
8 BE 1980 54.6
9 BE 1981 51.6
10 BE 1982 49.7
11 BE 1983 48.9
12 BE 1979 54.2
13 BE 1980 54.6
14 BE 1981 51.6
15 BE 1982 49.7
16 CA 1979 39.4
17 CA 1980 44.3
18 CA 1981 50.6
19 CA 1982 48.0
20 DK 1978 58.0
21 DK 1979 57.2
22 DK 1980 54.5
23 DK 1981 53.2
24 FI 1980 43.2
25 FI 1981 49.6
26 FI 1982 58.8
27 FI 1983 71.1
28 FI 1978 37.7
29 FI 1979 43.2
30 FI 1980 49.6
31 FI 1981 58.8
32 FR 1978 41.6
33 FR 1979 49.9
34 FR 1980 55.4
35 FR 1981 58.5

what have you tried thus far?
Pandas is a really good library to use for data parsing etc.
you could implement something along the lines of...
import pandas as pd
df = pd.DataFrame.from_csv(csv_filename)
def create_new_table(df):
start = 0
end = 3
while (df.last_valid_index() != end):
#create a new dataframe with the relevant column
newdf.transpose()
start = end
end = end + 3

Related

Plotting a stacked horizontal barplot

I have this dataframe called "df_pressure":
Ranking Squad Press Succ Succ% Fail Fail%
11 1 Manchester City 4254 1381 32.5 2873 67.5
10 2 Liverpool 5360 1731 32.3 3629 67.7
5 3 Chelsea 5533 1702 30.8 3831 69.2
16 4 Tottenham 5477 1523 27.8 3954 72.2
0 5 Arsenal 4772 1440 30.2 3332 69.8
12 6 Manchester Utd 5069 1462 28.8 3607 71.2
18 7 West Ham 4917 1372 27.9 3545 72.1
9 8 Leicester City 5982 1719 28.7 4263 71.3
3 9 Brighton 5670 1832 32.3 3838 67.7
19 10 Wolves 5529 1633 29.5 3896 70.5
13 11 Newcastle Utd 5430 1460 26.9 3970 73.1
6 12 Crystal Palace 6041 1809 29.9 4232 70.1
2 13 Brentford 5566 1609 28.9 3957 71.1
1 14 Aston Villa 5515 1524 27.6 3991 72.4
15 15 Southampton 5869 1806 30.8 4063 69.2
7 16 Everton 6346 1892 29.8 4454 70.2
8 17 Leeds United 7078 2118 29.9 4960 70.1
4 18 Burnley 5527 1499 27.1 4028 72.9
17 19 Watford 5730 1656 28.9 4074 71.1
14 20 Norwich City 6146 1570 25.5 4576 74.5
I then decided to create another dataframe for some columns only:
df_pressure_perc=df_pressure[['Squad','Succ%','Fail%']]
df_pressure_perc.reset_index(drop=True, inplace=True)
df_pressure_perc.set_index('Squad')
print(df_pressure_perc)
Output:
Squad Succ% Fail%
0 Manchester City 32.5 67.5
1 Liverpool 32.3 67.7
2 Chelsea 30.8 69.2
3 Tottenham 27.8 72.2
4 Arsenal 30.2 69.8
5 Manchester Utd 28.8 71.2
6 West Ham 27.9 72.1
7 Leicester City 28.7 71.3
8 Brighton 32.3 67.7
9 Wolves 29.5 70.5
10 Newcastle Utd 26.9 73.1
11 Crystal Palace 29.9 70.1
12 Brentford 28.9 71.1
13 Aston Villa 27.6 72.4
14 Southampton 30.8 69.2
15 Everton 29.8 70.2
16 Leeds United 29.9 70.1
17 Burnley 27.1 72.9
18 Watford 28.9 71.1
19 Norwich City 25.5 74.5
Based on this new dataframe "df_pressure_perc", I decided to create a stacked barplot. Upon creating it with the following code: df_pressure_perc.plot(kind='barh', stacked=True, ylabel='Squad', colormap='tab10', figsize=(10, 6))
I realised my viz Y axis were not labelled in terms of the Squad names. Would like to seek some advice on how I can reflect the Y axis in Squad names instead of 0-19.
Visualization(stacked barplot)

How to create a dataframe matrix from other data frames

I have 2 data frames, from which I want to create a third data frame(country) from data from the 2 data frames.
Below the data:
Indicator 1
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
2 Albania 24.5 23.1 21.8 20.4 19.2 17.9 16.7 15.5 14.4 13.3
195 Zambia 153.0 142.0 130.0 119.0 110.0 101.0 95.4 90.4 85.1 80.3
Indicator2
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
2 Albania 76.0 75.9 75.6 75.8 76.2 76.9 77.5 77.6 78.0 78.1
193 Zambia 45.2 45.9 46.6 47.7 48.7 50.0 51.9 54.1 55.7 56.5
I need to create a new data frame for each country like below
Angloa
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Indicator1 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
Indicator2 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
I need to know the code for creating this new data frame
What you asked can be done this way :
# Setting up DataFrames
indicator1 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["200.0", "24.5", "153.0"],
'2002': ["193.0", "23.1", "142.0"]
})
indicator2 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["53.4", "76.0", "45.2"],
'2002': ["54.5", "75.9", "45.9"]
})
# For each country
for index, row in indicator1.iterrows():
# create a new variable with the country as name
globals()[f"{row['country']}"] = {}
# For each column of your 2 dataframes
for key, value in indicator1.iteritems():
if key != 'country':
globals()[f"{row['country']}"][key] = [row[key], indicator2.iloc[indicator2[indicator2['country'] == row[
'country']].index.values[0]][key]]
globals()[f"{row['country']}"] = pd.DataFrame(globals()[f"{row['country']}"])
I've only did it with an extract from your data, but it can be generalised. I'm not sure saving the newly created DataFrame like this is the best way, but I had no variable idea so I let you worry about this.
print(Angola)
# Output :
2001 2002
0 200.0 193.0
1 53.4 54.5

pandas grouping aggregtation across multiple columns in a dataframe

I would like to derive the min and max for each year, region, and weather_type from a pandas dataframe. The dataframe looks like this:
year jan feb mar apr may jun aug sept oct nov dec region weathertype
1862 42.0 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 Anglia Rain
1863 58.3 15.7 24.0 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32.0 Anglia Rain
1864 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 Anglia Rain
What is needed are two new columns giving the min and max for each region and year, effecting grouping across rows, with the result added to the existing dataframe as two new columns:
year min max
1862 42.0 81.9
1863 15.7 77.5
1864 12.3 81.5
My approach has been to use this code:
weather_data['max_value'] = weather_data.groupby(['year','region','weathertype'])['jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'].transform(np.min)
However, this produces a non-aggregated subset of the data, which is a duplication of the existing frame, resulting the following error:
Wrong number of items passed 12, placement implies 1
I then melted the dataframe into a long, rather than wide format:
year region Option_1 variable value
1862 Anglia Rain jan 42.0
1863 Anglia Rain jan 58.3
1864 Anglia Rain jan 20.5
I used this code to produce what i needed:
weather_data['min_value'] = weather_data['value'].groupby(weather_data['region','Option_1']).transform(np.min)
but this either produces a key error where there is a single list.
[['region','Option_1]]
produces
Grouper for <class 'pandas.core.frame.DataFrame'> not 1-dimensional
Any suggestions are this point are gratefully received.
I would do:
(df.set_index(['year','region','weathertype'])
.assign(min=lambda x: x.min(axis=1),
max=lambda x: x.max(axis=1)
)
.reset_index())
Output:
year region weathertype jan feb mar apr may jun aug sept oct nov dec min max
-- ------ -------- ------------- ----- ----- ----- ----- ----- ----- ----- ------ ----- ----- ----- ----- -----
0 1862 Anglia Rain 42 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 8.2 82.7
1 1863 Anglia Rain 58.3 15.7 24 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32 15.7 77.5
2 1864 Anglia Rain 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 12.3 81.5

Not able to read txt file without comma separator in pandas python

CODE
import pandas
df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True)
df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST']
DATA
SlNo District Total Male Female Total Male Female SC ST SC ST
1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7
2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8
3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7
4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6
5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0
6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6
7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1
8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
9 Arawal 11479 57677 53802 294 179 115 18.8 0.04
10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5
11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7
Saran
12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5
13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0
14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8
15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4
19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8
20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9
21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5
22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8
23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0
24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3
25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6
26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6
27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9
28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines:
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
If you can somehow remove the space between E. Champaran and W. Champaran then you can do this:
df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True)
print(df)
SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1
0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7
1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8
2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7
3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6
4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0
5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6
6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1
7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN
9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5
10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7
11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5
12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0
13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8
14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4
15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3
16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1
17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4
18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8
19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9
20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5
21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8
22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0
23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3
24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6
25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6
26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9
27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this:
df = pandas.read_csv('biharpopulation.txt', delimiter='\t')

Reshaping Pandas DataFrame: switch columns to indices and repeated values as columns

I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.
Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68

Categories