Sorting values in a dataframe - python

I've been trying to sort the values on my a dataframe that I've been given to work on.
The following is my dataframe.
1981 1.78
1982 1.74
1983 1.61
1984 1.62
1985 1.61
1986 1.43
1987 1.62
1988 1.96
1989 1.75
1990 1.83
1991 1.73
1992 1.72
1993 1.74
1994 1.71
1995 1.67
1996 1.66
1997 1.61
1998 1.48
1999 1.47
2000 1.6
2001 1.41
2002 1.37
2003 1.27
2004 1.26
2005 1.26
2006 1.28
2007 1.29
2008 1.28
2009 1.22
2010 1.15
2011 1.2
2012 1.29
2013 1.19
2014 1.25
2015 1.24
2016 1.2
2017 1.16
2018 1.14
I've been trying to sort my dataframe in descending order such that the highest values on the right would appear first. However whenever I try to sort it, it would only sort based on the year which are the values on the left.
dataframe.sort_values('1')
I've tried using sort_values and indicating '1' as the column that I want sorted. This however returns ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

From the error that OP mentioned, The data structure is a Series and hence the sort function should just be called directly
s = s.sort_values(ascending=False)
The error was raised because, in pandas.Series.sort_values the first argument is axis.

The argument of sort_values() should be column name:
df=df.sort_values("col2")

Related

Is there a way to group plots based on matching row values?

I have a data frame like shown below.
Country Type 2011 2012 2013
Afghanistan Estimate -1.63 -1.57 -1.41
Afghanistan Sources 5 8 7
Afghanistan Percentile 0.95 0.94 2.36
.
.
.
Zambia Estimate 1.63 1.57 1.41
Zambia Sources 7 10 8
Zambia Percentile 0.88 0.77 1.54
I am hoping to generate plots (preferably line graphs) for each country (Type will be used as legend). Is there a way to group plots for each country? I am relatively new and don't know where to begin.
I'm afraid you can't get away with at least some transformations.
If it's OK to use Seaborn for plotting, it could look something like this:
import pandas as pd
import seaborn as sns
from io import StringIO
df = pd.read_csv(StringIO('''
Country,Type,2011,2012,2013
Afghanistan,Estimate,-1.63,-1.57,-1.41
Afghanistan,Sources,5,8,7
Afghanistan,Percentile,0.95,0.94,2.36
Zambia,Estimate,1.63,1.57,1.41
Zambia,Sources,7,10,8
Zambia,Percentile,0.88,0.77,1.54
'''), dtype={'Country' : 'string',
'Type' : 'string',
'2011' : 'float',
'2012' : 'float',
'2013' : 'float'})
# Country Type 2011 2012 2013
# 0 Afghanistan Estimate -1.63 -1.57 -1.41
# 1 Afghanistan Sources 5.00 8.00 7.00
# 2 Afghanistan Percentile 0.95 0.94 2.36
# 3 Zambia Estimate 1.63 1.57 1.41
# 4 Zambia Sources 7.00 10.00 8.00
# 5 Zambia Percentile 0.88 0.77 1.54
# transform to long format
df = df.melt(id_vars=['Country', 'Type'],
value_vars=['2011','2012','2013'],
var_name='Year')
# df after melt:
# Country Type Year value
# 0 Afghanistan Estimate 2011 -1.63
# 1 Afghanistan Sources 2011 5.00
# 2 Afghanistan Percentile 2011 0.95
# 3 Zambia Estimate 2011 1.63
# 4 Zambia Sources 2011 7.00
# 5 Zambia Percentile 2011 0.88
# 6 Afghanistan Estimate 2012 -1.57
# 7 Afghanistan Sources 2012 8.00
# 8 Afghanistan Percentile 2012 0.94
# 9 Zambia Estimate 2012 1.57
# 10 Zambia Sources 2012 10.00
# 11 Zambia Percentile 2012 0.77
# 12 Afghanistan Estimate 2013 -1.41
# 13 Afghanistan Sources 2013 7.00
# 14 Afghanistan Percentile 2013 2.36
# 15 Zambia Estimate 2013 1.41
# 16 Zambia Sources 2013 8.00
# 17 Zambia Percentile 2013 1.54
sns.relplot(data=df, kind='line', x='Year',
y='value', hue='Type', col="Country")

AttributeError: 'list' object has no attribute 'assign'

I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.

pandas grouping aggregtation across multiple columns in a dataframe

I would like to derive the min and max for each year, region, and weather_type from a pandas dataframe. The dataframe looks like this:
year jan feb mar apr may jun aug sept oct nov dec region weathertype
1862 42.0 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 Anglia Rain
1863 58.3 15.7 24.0 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32.0 Anglia Rain
1864 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 Anglia Rain
What is needed are two new columns giving the min and max for each region and year, effecting grouping across rows, with the result added to the existing dataframe as two new columns:
year min max
1862 42.0 81.9
1863 15.7 77.5
1864 12.3 81.5
My approach has been to use this code:
weather_data['max_value'] = weather_data.groupby(['year','region','weathertype'])['jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'].transform(np.min)
However, this produces a non-aggregated subset of the data, which is a duplication of the existing frame, resulting the following error:
Wrong number of items passed 12, placement implies 1
I then melted the dataframe into a long, rather than wide format:
year region Option_1 variable value
1862 Anglia Rain jan 42.0
1863 Anglia Rain jan 58.3
1864 Anglia Rain jan 20.5
I used this code to produce what i needed:
weather_data['min_value'] = weather_data['value'].groupby(weather_data['region','Option_1']).transform(np.min)
but this either produces a key error where there is a single list.
[['region','Option_1]]
produces
Grouper for <class 'pandas.core.frame.DataFrame'> not 1-dimensional
Any suggestions are this point are gratefully received.
I would do:
(df.set_index(['year','region','weathertype'])
.assign(min=lambda x: x.min(axis=1),
max=lambda x: x.max(axis=1)
)
.reset_index())
Output:
year region weathertype jan feb mar apr may jun aug sept oct nov dec min max
-- ------ -------- ------------- ----- ----- ----- ----- ----- ----- ----- ------ ----- ----- ----- ----- -----
0 1862 Anglia Rain 42 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 8.2 82.7
1 1863 Anglia Rain 58.3 15.7 24 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32 15.7 77.5
2 1864 Anglia Rain 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 12.3 81.5

Locate, extract and re-append year from column in pandas DataFrame

I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!
Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34

Reshaping Pandas DataFrame: switch columns to indices and repeated values as columns

I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.
Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68

Categories