Pandas: Group by and conditional sum based on value of current row

Pandas: Group by and conditional sum based on value of current row - python

My dataframe looks like this:
customer_nr
order_value
year_ordered
payment_successful
1
50
1980
1
1
75
2017
0
1
10
2020
1
2
55
2000
1
2
300
2007
1
2
15
2010
0
I want to know the total amount a customer has successfully paid in the years before, for a specific order.
The expected output is as follows:
customer_nr
order_value
year_ordered
payment_successful
total_successfully_previously_paid
1
50
1980
1
0
1
75
2017
0
50
1
10
2020
1
50
2
55
2000
1
0
2
300
2007
1
55
2
15
2010
0
355
Closest i've gotten is this:
df.groupby(['customer_nr', 'payment_successful'], as_index=False)['order_value'].sum()
That just gives me the summed amount successfully and unsuccessfully paid all time per customer. It doesn't account for selecting only previous orders to participate in the sum.

Try:
df["total_successfully_previously_paid"] = (df["payment_successful"].mul(df["order_value"])
.groupby(df["customer_nr"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
>>> df
customer_nr ... total_successfully_previously_paid
0 1 ... 0.0
1 1 ... 50.0
2 1 ... 50.0
3 2 ... 0.0
4 2 ... 55.0
5 2 ... 355.0
[6 rows x 5 columns]

Related

random sampling of the data in python

I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!
ID Lot_Area Year_Built Full_Bath Bedroom Sale_Price Expensive_home
1 31770 1960 1 3 215000 0
2 11622 1961 1 2 105000 0
3 5389 1995 2 2 236500 0
4 8402 1998 2 3 180400 0
5 10176 1990 1 2 171500 0
6 6820 1985 1 1 212000 0
7 53504 2003 3 4 538000 1
8 12134 1988 2 4 164000 0
9 11394 2010 1 1 394432 1
10 19138 1951 1 2 141000 0
11 13175 1978 2 3 210000 0
12 11751 1977 2 3 190000 0
13 10625 1974 2 3 170000 0
14 7500 2000 2 3 216000 0
15 11241 1970 1 2 149000 0
16 2280 1978 2 3 146000 0
17 12858 2009 2 3 376162 1
18 12883 2009 2 3 290941 0
19 12182 2005 2 3 220000 0
20 11520 2005 2 3 275000 0
similar data file but with more of randomly picked 1s in the last column

To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:
weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)
To create a dataframe with all expensive and then 30% of non-expensive, you can do:
expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])

Groupby and get value offset by one year in pandas

My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.

You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN

Calculating distance to a row with a certain value

I am working on a data with pandas in which a maintenance work is done at a location. The maintenance is done every four years at each site. I want to find the years since the last maintenance action at each site. I am giving here only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year. I know that if the action has been performed in Year Y, the previous maintenance has been performed in Year Y-4.
Site Year Action Measurement
A 2014 0 100
A 2015 0 150
A 2016 1 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 0 60
B 2017 0 110
Given this dataset; first, I want to have a temporary dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 0 100 2
A 2015 0 150 3
A 2016 1 300 4
A 2017 0 80 1
B 2014 0 200 3
B 2015 1 250 4
B 2016 0 60 1
B 2017 0 110 2
Then, I want to have:
Years_Since_Last_Action Mean_Measurement
1 70
2 105
3 175
4 275
Thanks in advance!

Your first question
s=df.loc[df.Action==1,['Site','Year']].set_index('Site') # get all year have the action and map back to the whole dataframe
df['Newyear']=df.Site.map(s.Year)
s1=df.Year-df.Newyear
df['action since last year']=np.where(s1<=0,s1+4,s1)# using np.where get the condition
df
Out[167]:
Site Year Action Measurement Newyear action since last year
0 A 2014 0 100 2016 2
1 A 2015 0 150 2016 3
2 A 2016 1 300 2016 4
3 A 2017 0 80 2016 1
4 B 2014 0 200 2015 3
5 B 2015 1 250 2015 4
6 B 2016 0 60 2015 1
7 B 2017 0 110 2015 2
2nd question
df.groupby('action since last year').Measurement.mean()
Out[168]:
action since last year
1 70
2 105
3 175
4 275
Name: Measurement, dtype: int64

First, build your intermediate using groupby, *fill and a little arithmetic.
v = (df.Year
.where(df.Action.astype(bool))
.groupby(df.Site)
.ffill()
.bfill()
.sub(df.Year))
df['Years_Since_Last_Action'] = np.select([v > 0, v < 0], [4 - v, v.abs()], default=4)
df
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2.0
1 A 2015 0 150 3.0
2 A 2016 1 300 4.0
3 A 2017 0 80 1.0
4 B 2014 0 200 3.0
5 B 2015 1 250 4.0
6 B 2016 0 60 1.0
7 B 2017 0 110 2.0
Next,
df.groupby('Years_Since_Last_Action', as_index=False).Measurement.mean()
Years_Since_Last_Action Measurement
0 1.0 70
1 2.0 105
2 3.0 175
3 4.0 275

How about:
delta_year = df.loc[df.groupby("Site")["Action"].transform("idxmax"), "Year"].values
years_since = ((df.Year - delta_year) % 4).replace(0, 4)
df["Years_Since_Last_Action"] = years_since
out = df.groupby("Years_Since_Last_Action")["Measurement"].mean().reset_index()
out = out.rename(columns={"Measurement": "Mean_Measurement"})
which gives me
In [230]: df
Out[230]:
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2
1 A 2015 0 150 3
2 A 2016 1 300 4
3 A 2017 0 80 1
4 B 2014 0 200 3
5 B 2015 1 250 4
6 B 2016 0 60 1
7 B 2017 0 110 2
In [231]: out
Out[231]:
Years_Since_Last_Action Mean_Measurement
0 1 70
1 2 105
2 3 175
3 4 275

Pandas: Inventory recalculation given a final value

I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10

(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)

By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64

Why am I not able to drop values within columns on pandas using python3?

I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.

You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Group by and conditional sum based on value of current row - python

Related

random sampling of the data in python

Groupby and get value offset by one year in pandas

Calculating distance to a row with a certain value

Pandas: Inventory recalculation given a final value

Why am I not able to drop values within columns on pandas using python3?

Categories

Resources