Sort dataframe using dictionary as sort criteria - python

There is a similar question here but not exactly what I'm looking for.
I want to sort a dataframe based on a dictionary that specifies the column(s) to sort by as well as the order for each column.
Example:
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 2 | 12:00 | November | 2003 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 5 | 06:25 | June | 2012 |
| 6 | 07:50 | August | 2019 |
| 7 | 09:20 | May | 2015 |
| 8 | 22:30 | July | 2016 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
+-------+-------+-----------+------+
sort_dict = {'Month': 'Ascending', 'Year': 'Descending', 'Time': 'Ascending'}
df.sort_values(by=sort_dict)
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
| 7 | 09:20 | May | 2015 |
| 5 | 06:25 | June | 2012 |
| 8 | 22:30 | July | 2016 |
| 6 | 07:50 | August | 2019 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 2 | 12:00 | November | 2003 |
+-------+-------+-----------+------+
Any help is appreciated thanks!
Column index would also be fine:
sort_dict = {2: 'Ascending', 3: 'Descending', 1: 'Ascending'}

EDIT: (thanks #Jon Clements)
In python 3.6 declaring sort_dict the key order will be as specified, however, prior to 3.6, dict literals won't necessarily preserve order. eg. in 3.5, declaring sort_dict ends up being {'Month': 'Ascending', 'Time': 'Ascending', 'Year': 'Descending'}... which is going to be a different iteration order - thus different sort results.
If need always same order is possible use OrderedDict or Series by constructor - there order not depends of version of python.
One possible solution is create helper Series, then convert index to list and pass also parameter ascending filled boolean list:
s = pd.Series(sort_dict)
print (s)
Month Ascending
Year Descending
Time Ascending
dtype: object
df = df.sort_values(by=s.index.tolist(), ascending = (s == 'Ascending'))
print (df)
Time Month Year
Index
9 23:05 April 2013
10 21:10 April 2008
6 07:50 August 2019
0 13:00 January 2018
8 22:30 July 2016
5 06:25 June 2012
1 14:30 March 2015
7 09:20 May 2015
2 12:00 November 2003
4 13:30 October 2012
3 10:15 September 2012

Related

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Need to reshape my dataframe (lots of column names)

I am trying to reshape a dataframe in pandas. I currently have one id variable, and the rest of the variables are in the following format: "variableyear", where year is between 2000 and 2016. I want to to make a new variable year (which extracts the year from my variableyear variable) and creates a column named variable. Here is an example dataset that looks similar to my real dataset (as my data is confidential):
| name | income2015 | income2016 | children2015 | children2016 | education2015 | education2016
---|---------|------------|------------|--------------|--------------|---------------|---------------
0 | John | 1 | 4 | 7 | 10 | 13 | 16
1 | Phillip | 2 | 5 | 8 | 11 | 14 | 17
2 | Carl | 3 | 6 | 9 | 12 | 15 | 18
This is what I want:
| name | year | income | children | education
---|---------|------|--------|----------|-----------
0 | John | 2015 | 1 | 7 | 13
1 | Phillip | 2015 | 2 | 8 | 14
2 | Carl | 2015 | 3 | 9 | 15
3 | John | 2016 | 4 | 10 | 16
4 | Phillip | 2016 | 5 | 11 | 17
5 | Carl | 2016 | 6 | 12 | 18
I have already tried the following:
df2 = pd.melt(df, id_vars=['name'], value_vars=df.columns[1:])
df2['year'] = df2['variable'].map(lambda x: x[-4:])
df2['variable'] = df2['variable'].map(lambda x: x[:-4])
which gives me this:
| | | |
------|----------|-----------|------|------
name | variable | value | year |
0 | John | income | 1 | 2015
1 | Phillip | income | 2 | 2015
2 | Carl | income | 3 | 2015
3 | John | income | 4 | 2016
4 | Phillip | income | 5 | 2016
5 | Carl | income | 6 | 2016
6 | John | children | 7 | 2015
7 | Phillip | children | 8 | 2015
8 | Carl | children | 9 | 2015
9 | John | children | 10 | 2016
10 | Phillip | children | 11 | 2016
11 | Carl | children | 12 | 2016
12 | John | education | 13 | 2015
13 | Phillip | education | 14 | 2015
14 | Carl | education | 15 | 2015
15 | John | education | 16 | 2016
16 | Phillip | education | 17 | 2016
17 | Carl | education | 18 | 2016
But now I have to reshape again... Is there an easier to do this?
Also, here is my df in dictionary format:
{'children2015': {0: 7, 1: 8, 2: 9}, 'children2016': {0: 10, 1: 11, 2: 12}, 'education2015': {0: 13, 1: 14, 2: 15}, 'education2016': {0: 16, 1: 17, 2: 18}, 'income2015': {0: 1, 1: 2, 2: 3}, 'income2016': {0: 4, 1: 5, 2: 6}, 'name': {0: 'John', 1: 'Phillip', 2: 'Carl'}}
You can actually use pd.wide_to_long for just this. In the stubnames arg you could use a set of variable names (that excludes name and drop the last 4 characters) in your df using this code: set([x[:-4] for x in df.columns[1:]]).
pd.wide_to_long(df,stubnames=set([x[:-4] for x in df.columns[1:]]),i=['name'],j='year').reset_index()
Output:
name year education income children
0 John 2015 13 1 7
1 Phillip 2015 14 2 8
2 Carl 2015 15 3 9
3 John 2016 16 4 10
4 Phillip 2016 17 5 11
5 Carl 2016 18 6 12

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

How to calculate Average of dates in groupby Python

I have a dataframe with more than 500K rows and 20 columns. I am trying to determine the frequency in which the personId receives something based on the date_received column, all of the other columns are irrelevant for this task but usefull for subsequent tasks.
|---------------------|------------------|
| personId | date_Recieved |
|---------------------|------------------|
| 1 | 2 feb 2016 |
|---------------------|------------------|
| 1 | 4 feb 2016 |
|---------------------|------------------|
| 1 | 6 feb 2016 |
|---------------------|------------------|
| 2 | 10 dec 2016 |
|---------------------|------------------|
| 2 | 1 jan 2017 |
|---------------------|------------------|
| 2 | 20 jan 2017 |
|---------------------|------------------|
The date_received is of type pandas.tslib.Timestamp I am looking for something like this:
|---------------------|------------------|
| personId | Frequency |
|---------------------|------------------|
| 1 | 2 days |
|---------------------|------------------|
| 2 | 20.5 days |
|---------------------|------------------|
So in average person 1 recieves something every 2 days and person two recieves something every 20.5 days.
I tried using the groupby function but still haven't been able to get the response with my dataframe.
Can someone please help me with this?
using groupby and lambda
df.groupby('personId').date_Recieved.apply(lambda x: x.diff().dropna().mean())
personId
1 2 days 00:00:00
2 20 days 12:00:00
Name: date_Recieved, dtype: timedelta64[ns]
setup
txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017
"""
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

Categories