Using column as tiebreaker for maximums in Python - python

Reposted with clarification.
I am working on a dataframe that looks like the following:
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2012 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+------+-------+
| Value | ID | Date | ID 2 |Latest |
+-------+----+------+------+-------+
| 1 | 5 | 2012 | 111 | |
| 1 | 5 | 2012 | 112 | Latest |
| 0 | 12 | 2017 | 113 | |
| 0 | 12 | 2022 | 114 | |
| 1 | 27 | 2005 | 115 | |
| 1 | 27 | 2011 | 116 | Latest |
+-------+----+------+-----+--------+
I am using the following code to find the maximum:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups.
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.

Try this:
df['Latest'] = np.where(df['ID2'].eq(df.groupby(df['Value'].ne(df['Value'].shift(1)).cumsum())['ID2'].transform('max')) & df['Value'].ne(0), 'Latest', '')
Output:
>>> df
Value ID Date ID2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Here's one way a bit similar to your own approach. Basically, groupby + last to get the latest + assign a variable + merge:
df = df.merge(df.groupby(['ID', 'Value'])['ID 2'].last().reset_index().assign(Latest=lambda x: np.where(x['Value'], 'Latest', '')), how='outer').fillna('')
or even this works:
df = df.query('Value==1').groupby('ID').last('ID 2').assign(Latest='Latest').merge(df, how='outer').fillna('')
Output:
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Here is one with window functions:
c = df['Value'].ne(df['Value'].shift())
s = df['Date'].add(df['ID 2']) #add the year and ID for handling duplicates
c1 = s.eq(s.groupby(c.cumsum()).transform('max'))& (df['Value'].eq(1))
df['Latest'] = np.where(c1,'Latest','')
print(df)
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Related

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

How can I filter each subset of a dataframe for 2 conditions?

I need to check 2 conditions for sales data:
was sold in specific years
was sold on specific amount of days with amount > 0
Dataframe df:
date | id | qual | amount
2020-09-01 | 123 | A | 100
2020-09-02 | 123 | A | 0
2020-09-03 | 123 | A | 90
2020-09-04 | 123 | A | 80
2020-09-01 | 123 | B | 8
2020-09-02 | 123 | B | 6
2020-09-03 | 123 | B | 4
2020-09-04 | 123 | B | 2
2021-02-01 | 123 | B | 18
2020-02-01 | 456 | A | 96
2021-02-02 | 456 | A | 90
2021-01-01 | 789 | A | 30
2021-01-02 | 789 | A | 31
2021-01-03 | 789 | A | 32
2021-01-04 | 789 | A | 29
The data frame has 10_000 IDs with ~1000 dates for each ID and 1 or 2 quality (qual) levels per ID.
The check needs to be performed for every combination of ID + qual level.
After I check each ID + Qual I want to filter my dataframe so that it only contains IDs + Qual combinations that passed that check.
ID: 123 with qual: A
has year 2020 and year 2021 in sales ❌
has at least 4 rows with amount > 0 ❌
-> does not pass
ID: 123 with qual: B
has year 2020 and year 2021 in sales ✅
has at least 4 rows with amount > 0 ✅
-> does pass
ID: 456 with qual: A
has year 2020 and year 2021 in sales ✅
has at least 4 rows with amount > 0 ❌
-> does not pass
ID: 789 with qual: A
has year 2020 and year 2021 in sales ❌
has at least 4 rows with amount > 0 ✅
-> does not pass
Result should therefor look like this:
date | id | qual | amount
2020-09-01 | 123 | B | 8
2020-09-02 | 123 | B | 6
2020-09-03 | 123 | B | 4
2020-09-04 | 123 | B | 2
2021-02-01 | 123 | B | 18
My code so far:
required_sales_years= [2020, 2021]
required_sales_days = 4
has_required_sales = []
for id in df["id"].unique().tolist():
for qual in df["qual"].unique().tolist():
temp = df.query(
"id== #id and qual == #qual and amount > 0"
)
sales_years = temp["date"].dt.year.unique().tolist()
check_sales_year = all(item in sales_years for item in required_sales_years)
check_sales_days = len(temp.index) >= required_sales_days
if check_sales_year and check_sales_days:
has_required_sales.append((id, qual))
How can I do this?
Use groupby().transform to count the valid sales:
required_sale_years = [2020, 2021]
required_sales_days = 4
# intermediate variables
df['year'] = df.date.dt.year
df['valid'] = df['year'].isin(required_sales_years) & df['amount'].gt(0)
# groupby
groups = df.groupby(['id','qual'])
has_years = groups['year'].transform(lambda x: set(required_sales_years).issubset(set(x)))
valid_sales = groups['valid'].transform('sum') >= required_sales_days
output = df[has_years & valid_sales]
Output:
date id qual amount year valid
4 2020-09-01 123 B 8 2020 True
5 2020-09-02 123 B 6 2020 True
6 2020-09-03 123 B 4 2020 True
7 2020-09-04 123 B 2 2020 True
8 2021-02-01 123 B 18 2021 True

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]
Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Pandas: How to create a multi-indexed pivot

I have a set of experiments defined by two variables: scenario and height. For each experiment, I take 3 measurements: result 1, 2 and 3.
The dataframe that collects all the results looks like this:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Scenario']= np.repeat(['Scenario a','Scenario b','Scenario c'],3)
df['height'] = np.tile([0,1,2],3)
df['Result 1'] = np.arange(1,10)
df['Result 2'] = np.arange(20,29)
df['Result 3'] = np.arange(30,39)
If I run the following:
mypiv = df.pivot('Scenario','height').transpose()
writer = pd.ExcelWriter('test_df_pivot.xlsx')
mypiv.to_excel(writer,'test df pivot')
writer.save()
I obtain a dataframe where columns are the scenarios, and the rows have a multi-index defined by result and height:
+----------+--------+------------+------------+------------+
| | height | Scenario a | Scenario b | Scenario c |
+----------+--------+------------+------------+------------+
| Result 1 | 0 | 1 | 4 | 7 |
| | 1 | 2 | 5 | 8 |
| | 2 | 3 | 6 | 9 |
| Result 2 | 0 | 20 | 23 | 26 |
| | 1 | 21 | 24 | 27 |
| | 2 | 22 | 25 | 28 |
| Result 3 | 0 | 30 | 33 | 36 |
| | 1 | 31 | 34 | 37 |
| | 2 | 32 | 35 | 38 |
+----------+--------+------------+------------+------------+
How can I create a pivot where the indices are swapped, i.e. height first, then result?
I couldn't find a way to create it directly. I managed to get what I want swapping the levels and the re-sorting the results:
mypiv2 = mypiv.swaplevel(0,1 , axis=0).sortlevel(level=0,axis=0,sort_remaining=True)
but I was wondering if there is a more direct way.
You can first set_index and then stack with unstack:
print (df.set_index(['height','Scenario']).stack().unstack(level=1))
Scenario Scenario a Scenario b Scenario c
height
0 Result 1 1 4 7
Result 2 20 23 26
Result 3 30 33 36
1 Result 1 2 5 8
Result 2 21 24 27
Result 3 31 34 37
2 Result 1 3 6 9
Result 2 22 25 28
Result 3 32 35 38

Categories