pull row with max date from groupby in python pandas - python

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1

You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you

I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Related

Python: Increment if a name and year appears in a dataframe

I am if there is an efficient way in python to increment the occurrences of the horse's name based on the year they raced?
For example, consider the dataframe:
| | horse | year_race |
|---:|:--------------|------------:|
| 0 | Hackney | 2012 |
| 1 | Orlov Trotter | 2016 |
| 2 | Marwari | 2011 |
| 3 | Hackney | 2012 |
| 4 | Marwari | 2018 |
| 5 | Hackney | 2015 |
| 6 | Marwari | 2014 |
I would like the result to show the following:
{
"Hackney": 1,
"Orlov Trotter": 0
"Marwari": 2
}
If a horse only raced once, then let the occurrence be 0. Hackney only has an occurrence of 1 because there is a duplicate entry in 2012. Marwari has occurrences of 2 because the horse has raced 3 different years.
Is there a python way to solve this?
Thanks in advance
Use DataFrameGroupBy.nunique per horse, subtract 1 and convert to dictionary:
d = df.groupby('horse', sort=False)['year_race'].nunique().sub(1).to_dict()
print (d)
{'Hackney': 1, 'Orlov Trotter': 0, 'Marwari': 2}

Reorganize data using Dataframe

Hi! I received a very disorganized and messy excel file and I will need to re-organize them into more presentable format. However I am stuck on how to proceed. :(
Data received:
| 2020 | 2021 | 2022 | 2022 % Total | 2023E | 2024E | 2025E | ... |
| ---- | ---- | ---- | ------------ | ----- | ----- | ----- | --- |
| 0 | 3 | 6 | 9 | 12 | 15 | 18 | ... |
| 1 | 4 | 7 | 10 | 13 | 16 | 19 | ... |
| 2 | 5 | 8 | 11 | 14 | 17 | 20 | ... |
Expected output:
| Year | Value |
| ---- | ----- |
| 2020 | 0 |
| 2020 | 1 |
| 2020 | 2 |
| 2021 | 3 |
| 2021 | 4 |
| 2021 | 5 |
| 2022 | 6 |
| 2022 | 7 |
| 2022 | 8 |
The headers of the received file contains various year, starting from 2020.
How it works is I will only need the data starting from oldest year (2020) to latest valid year (2022), any data that comes after header with latest valid year is not required (e.g. starting with header containing " % Total"). The latest valid year will keep adding on every year, so next year I will expect a new "2023" column on the 4th column.
After that I will need to append data from "2020", "2021" and "2022" to a new "Value" column. A new "Year" column will also be created for the corresponding year header.
I am not sure whether it is something that can be achieved using Dataframe.
Any suggestions will be greatly appreciated!
Regards,
Shan
If you know the keys you want (i.e. 2020, 2021, 2022), you can do this:
import pandas as pd
# Create a dummy dataframe
df = pd.DataFrame({
"2020": [0, 1, 2],
"2021": [3, 4, 5],
"2022": [6, 7, 8]
})
keys = ["2020", "2021", "2022"]
df[keys[0]]
arr = [] # 2D array to hold restructured data
for key in keys:
arr.extend([[key, v] for v in df[key]])
new_df = pd.DataFrame(arr, columns=["Year", "Value"])
new_df.head()
You could generate the years list in code too, instead of hard coding that:
start = 2020
end = 2022
keys = [str(i) for i in range(start, end + 1)]

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

Sum to dataframes based on row and column

Given to Dataframes df_1
Code | Jan | Feb | Mar
a | 1 | 2 | 1
b | 3 | 4 | 3
and df_2
Code | Jan | Feb | Mar
a | 1 | 1 | 2
c | 7 | 0 | 0
I would like to sum these to tables based on the row and colum. So my result dataframe shoul look like this:
Code | Jan | Feb | Mar
a | 2 | 3 | 3
b | 3 | 4 | 3
c | 7 | 0 | 0
Is there an easy way to do this? I can to this using a lot of for loops and if statements but this is very slow for large datasets.
Use concat and aggregate sum:
df = pd.concat([df_1, df_2]).groupby('Code', as_index=False).sum()
print (df)
Code Jan Feb Mar
0 a 2 3 3
1 b 3 4 3
2 c 7 0 0

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Categories