Reshaping Pandas DataFrame: switch columns to indices and repeated values as columns - python

I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.

Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68

Related

How to create a dataframe matrix from other data frames

I have 2 data frames, from which I want to create a third data frame(country) from data from the 2 data frames.
Below the data:
Indicator 1
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
2 Albania 24.5 23.1 21.8 20.4 19.2 17.9 16.7 15.5 14.4 13.3
195 Zambia 153.0 142.0 130.0 119.0 110.0 101.0 95.4 90.4 85.1 80.3
Indicator2
country 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 Angola 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
2 Albania 76.0 75.9 75.6 75.8 76.2 76.9 77.5 77.6 78.0 78.1
193 Zambia 45.2 45.9 46.6 47.7 48.7 50.0 51.9 54.1 55.7 56.5
I need to create a new data frame for each country like below
Angloa
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Indicator1 200.0 193.0 185.0 176.0 167.0 157.0 148.0 138.0 129.0 120.0
Indicator2 53.4 54.5 55.1 55.5 56.4 57.0 58.0 58.8 59.5 60.2
I need to know the code for creating this new data frame
What you asked can be done this way :
# Setting up DataFrames
indicator1 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["200.0", "24.5", "153.0"],
'2002': ["193.0", "23.1", "142.0"]
})
indicator2 = pd.DataFrame({
'country': ['Angola', 'Albania', 'Zambia'],
'2001': ["53.4", "76.0", "45.2"],
'2002': ["54.5", "75.9", "45.9"]
})
# For each country
for index, row in indicator1.iterrows():
# create a new variable with the country as name
globals()[f"{row['country']}"] = {}
# For each column of your 2 dataframes
for key, value in indicator1.iteritems():
if key != 'country':
globals()[f"{row['country']}"][key] = [row[key], indicator2.iloc[indicator2[indicator2['country'] == row[
'country']].index.values[0]][key]]
globals()[f"{row['country']}"] = pd.DataFrame(globals()[f"{row['country']}"])
I've only did it with an extract from your data, but it can be generalised. I'm not sure saving the newly created DataFrame like this is the best way, but I had no variable idea so I let you worry about this.
print(Angola)
# Output :
2001 2002
0 200.0 193.0
1 53.4 54.5

Transforming yearwise data using pandas

I have a dataframe that looks like this:
Temp
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
... ...
1981-12-27 15.5
1981-12-28 13.3
1981-12-29 15.6
1981-12-30 15.2
1981-12-31 17.4
365 rows × 1 columns
And I want to transform It so That It looks like:
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8
1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3
2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6
3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5
4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3
... ... ... ... ... ... ... ... ... ... ...
360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0
361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6
362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5
363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7
364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0
My attempt:
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values
Question:
The above code is giving me my desired output but Is there is a more efficient way of transforming this?
As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting:
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])
# Get Year and Day of Year
df['year'] = df.index.year
df['day'] = df.index.dayofyear
# Pivot
p = df.pivot(index='day', columns='year', values='Temp')
print(p)
p:
year 1981 1982
day
1 38 85
2 51 70
3 76 61
4 71 47
5 44 76
.. ... ...
361 23 22
362 42 64
363 84 22
364 26 56
365 67 73
Run-Time via Timeit
import timeit
setup = '''
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])'''
pivot = '''
df['year'] = df.index.year
df['day'] = df.index.dayofyear
p = df.pivot(index='day', columns='year', values='Temp')'''
groupby_for = '''
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values'''
if __name__ == '__main__':
print("Pivot")
print(timeit.timeit(setup=setup, stmt=pivot, number=1000))
print("Groupby For")
print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000))
Pivot
1.598973
Groupby For
2.3967995999999996
*Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.

pandas grouping aggregtation across multiple columns in a dataframe

I would like to derive the min and max for each year, region, and weather_type from a pandas dataframe. The dataframe looks like this:
year jan feb mar apr may jun aug sept oct nov dec region weathertype
1862 42.0 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 Anglia Rain
1863 58.3 15.7 24.0 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32.0 Anglia Rain
1864 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 Anglia Rain
What is needed are two new columns giving the min and max for each region and year, effecting grouping across rows, with the result added to the existing dataframe as two new columns:
year min max
1862 42.0 81.9
1863 15.7 77.5
1864 12.3 81.5
My approach has been to use this code:
weather_data['max_value'] = weather_data.groupby(['year','region','weathertype'])['jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'].transform(np.min)
However, this produces a non-aggregated subset of the data, which is a duplication of the existing frame, resulting the following error:
Wrong number of items passed 12, placement implies 1
I then melted the dataframe into a long, rather than wide format:
year region Option_1 variable value
1862 Anglia Rain jan 42.0
1863 Anglia Rain jan 58.3
1864 Anglia Rain jan 20.5
I used this code to produce what i needed:
weather_data['min_value'] = weather_data['value'].groupby(weather_data['region','Option_1']).transform(np.min)
but this either produces a key error where there is a single list.
[['region','Option_1]]
produces
Grouper for <class 'pandas.core.frame.DataFrame'> not 1-dimensional
Any suggestions are this point are gratefully received.
I would do:
(df.set_index(['year','region','weathertype'])
.assign(min=lambda x: x.min(axis=1),
max=lambda x: x.max(axis=1)
)
.reset_index())
Output:
year region weathertype jan feb mar apr may jun aug sept oct nov dec min max
-- ------ -------- ------------- ----- ----- ----- ----- ----- ----- ----- ------ ----- ----- ----- ----- -----
0 1862 Anglia Rain 42 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 8.2 82.7
1 1863 Anglia Rain 58.3 15.7 24 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32 15.7 77.5
2 1864 Anglia Rain 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 12.3 81.5

Sorting values in a dataframe

I've been trying to sort the values on my a dataframe that I've been given to work on.
The following is my dataframe.
1981 1.78
1982 1.74
1983 1.61
1984 1.62
1985 1.61
1986 1.43
1987 1.62
1988 1.96
1989 1.75
1990 1.83
1991 1.73
1992 1.72
1993 1.74
1994 1.71
1995 1.67
1996 1.66
1997 1.61
1998 1.48
1999 1.47
2000 1.6
2001 1.41
2002 1.37
2003 1.27
2004 1.26
2005 1.26
2006 1.28
2007 1.29
2008 1.28
2009 1.22
2010 1.15
2011 1.2
2012 1.29
2013 1.19
2014 1.25
2015 1.24
2016 1.2
2017 1.16
2018 1.14
I've been trying to sort my dataframe in descending order such that the highest values on the right would appear first. However whenever I try to sort it, it would only sort based on the year which are the values on the left.
dataframe.sort_values('1')
I've tried using sort_values and indicating '1' as the column that I want sorted. This however returns ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
From the error that OP mentioned, The data structure is a Series and hence the sort function should just be called directly
s = s.sort_values(ascending=False)
The error was raised because, in pandas.Series.sort_values the first argument is axis.
The argument of sort_values() should be column name:
df=df.sort_values("col2")

Read excel row by row and do transpose, Python 3.6

I have excel file with below data and I want to read data where First Column contains 'Area' & transpose, then again move & find where Column contains 'Area' & transpose
In this data total 3 table data given, I want to split it & then transpose. First Column contains Area code and other column name contains Year
Area 1980 1981 1982 1983
AU 33.7 38.8 40.2 42.5
BE 54.6 51.6 49.7 48.9
FI 43.2 49.6 58.8 71.1
Area 1979 1980 1981 1982
AU 29.8 33.7 38.8 40.2
BE 54.2 54.6 51.6 49.7
CA 39.4 44.3 50.6 48
Area 1978 1979 1980 1981
DK 58 57.2 54.5 53.2
FI 37.7 43.2 49.6 58.8
FR 41.6 49.9 55.4 58.5
Final Result expected:
Area variable value
AU 1980 33.7
other values
How to achieve this?
Assuming that we have the following list of DataFrame's:
In [106]: dfs
Out[106]:
[ Area 1980 1981 1982 1983
0 AU 33.7 38.8 40.2 42.5
1 BE 54.6 51.6 49.7 48.9
2 FI 43.2 49.6 58.8 71.1, Area 1979 1980 1981 1982
0 AU 29.8 33.7 38.8 40.2
1 BE 54.2 54.6 51.6 49.7
2 CA 39.4 44.3 50.6 48.0, Area 1978 1979 1980 1981
0 DK 58.0 57.2 54.5 53.2
1 FI 37.7 43.2 49.6 58.8
2 FR 41.6 49.9 55.4 58.5]
first we concatenate them horizontally:
In [107]: df = pd.concat([x.set_index('Area') for x in dfs], axis=1)
In [108]: df
Out[108]:
1980 1981 1982 1983 1979 1980 1981 1982 1978 1979 1980 1981
AU 33.7 38.8 40.2 42.5 29.8 33.7 38.8 40.2 NaN NaN NaN NaN
BE 54.6 51.6 49.7 48.9 54.2 54.6 51.6 49.7 NaN NaN NaN NaN
CA NaN NaN NaN NaN 39.4 44.3 50.6 48.0 NaN NaN NaN NaN
DK NaN NaN NaN NaN NaN NaN NaN NaN 58.0 57.2 54.5 53.2
FI 43.2 49.6 58.8 71.1 NaN NaN NaN NaN 37.7 43.2 49.6 58.8
FR NaN NaN NaN NaN NaN NaN NaN NaN 41.6 49.9 55.4 58.5
now we can stack DF and rename columns:
In [109]: df.stack().reset_index() \
.rename(columns={'level_0':'Area','level_1':'variable',0:'value'})
Out[109]:
Area variable value
0 AU 1980 33.7
1 AU 1981 38.8
2 AU 1982 40.2
3 AU 1983 42.5
4 AU 1979 29.8
5 AU 1980 33.7
6 AU 1981 38.8
7 AU 1982 40.2
8 BE 1980 54.6
9 BE 1981 51.6
10 BE 1982 49.7
11 BE 1983 48.9
12 BE 1979 54.2
13 BE 1980 54.6
14 BE 1981 51.6
15 BE 1982 49.7
16 CA 1979 39.4
17 CA 1980 44.3
18 CA 1981 50.6
19 CA 1982 48.0
20 DK 1978 58.0
21 DK 1979 57.2
22 DK 1980 54.5
23 DK 1981 53.2
24 FI 1980 43.2
25 FI 1981 49.6
26 FI 1982 58.8
27 FI 1983 71.1
28 FI 1978 37.7
29 FI 1979 43.2
30 FI 1980 49.6
31 FI 1981 58.8
32 FR 1978 41.6
33 FR 1979 49.9
34 FR 1980 55.4
35 FR 1981 58.5
what have you tried thus far?
Pandas is a really good library to use for data parsing etc.
you could implement something along the lines of...
import pandas as pd
df = pd.DataFrame.from_csv(csv_filename)
def create_new_table(df):
start = 0
end = 3
while (df.last_valid_index() != end):
#create a new dataframe with the relevant column
newdf.transpose()
start = end
end = end + 3

Categories