Assign weights to observations based on date difference and sequence condition - python

I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1

Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0

Related

How to find the hour of the day when sales are highest for products in a .csv using pandas? And merge columns based on similar first name?

The .csv looks like this:
Date+time: shirt shirt (dress) shorts shorts (dress) accessories
2019-01-01 5:00 5 3 2 2 3
2019-01-01 5:05 1 1 4 1 5
2019-01-01 5:10 1 2 1 2 9
...
2019-12-31 11:55 5 2 1 1 7
I want to know if there is a way to combine the columns that share a common first name? For instance, the program should look for columns that share a similar first name such as shirt and shirt (dress), these should be merged together and considered one entity, same with shorts.
How would I go about finding the highest purchasing hour for each day of the year and then for those highest purchasing hours find the percentages of the total for each product?
You could trim off the second part of the columns, so that the ones with the same first name are changed to have the same whole name:
df.columns = df.columns.str.split(' ').str[0]
Output:
>>> df
Date+time: shirt shirt shorts shorts accessories
0 2019-01-01 5:00 5 3 2 2 3
1 2019-01-01 5:05 1 1 4 1 5
2 2019-01-01 5:10 1 2 1 2 9
3 2019-12-31 11:55 5 2 1 1 7
Then, sum the columns with the same names together:
new_df = df.groupby(level=0).sum()
Output:
>>> new_df
Date+time: accessories shirt shorts
0 2019-01-01+5:00 3 8 4
1 2019-01-01+5:05 5 2 5
2 2019-01-01+5:10 9 3 3
3 2019-12-31+11:55 7 7 2

Pandas: Create new column and populate with value from previous row based on conditions

I have the following dataframe:
df = pd.DataFrame({'KEY': ['1','1','1','1','1','1','1','2','2'], 'DATE': ['2020-01-01','2020-01-01','2020-01-01','2020-01-08','2020-01-08','2020-01-08','2020-01-08','2020-02-01','2020-02-01'], 'ENDNO': ['1000','1000','1000','2000','2000','2000','2000','400','400'], 'ITEM': ['PAPERCLIPS','BINDERS','STAPLES','PAPERCLIPS','BINDERS','STAPLES','TAPE','PENCILS','PENS']})
KEY DATE ENDNO ITEM
1 2020-01-01 1000 PAPERCLIPS
1 2020-01-01 1000 BINDERS
1 2020-01-01 1000 STAPLES
1 2020-01-08 2000 PAPERCLIPS
1 2020-01-08 2000 BINDERS
1 2020-01-08 2000 STAPLES
1 2020-01-08 2000 TAPE
2 2020-02-01 400 PENCILS
2 2020-02-01 400 PENS
I need to add a new column called "STARTNO" and populate it based on multiple conditions:
if KEY <> KEY of row above, STARTNO = 0
else
(if DATE = DATE of row above, STARTNO = STARTNO of row above
else STARTNO = ENDNO of row above)
It should end up looking something like this:
KEY DATE STARTNO ENDNO ITEM
1 2020-01-01 0 1000 PAPERCLIPS
1 2020-01-01 0 1000 BINDERS
1 2020-01-01 0 1000 STAPLES
1 2020-01-08 1000 2000 PAPERCLIPS
1 2020-01-08 1000 2000 BINDERS
1 2020-01-08 1000 2000 STAPLES
1 2020-01-08 1000 2000 TAPE
2 2020-02-01 0 400 PENCILS
2 2020-02-01 0 400 PENS
If I was just evaluating 1 statement, I know I could use lambdas, but I'm not sure how to do a nested statement in Pandas and reference the line above.
Would someone please point me in the right direction?
Thanks!
ETA:
Quang Hoang's answer almost got me what I needed. I realized I missed one aspect of my initial list.
I've added a new item called "TAPE" and updated the dataframe script above.
Applying the groupby clause works well for all items except TAPE. With TAPE, it puts the STARTNO back at 0; however, I actually need the STARTNO to be the same as the ENDNO for the previous items with the same KEY and DATE. If I change the code to:
df['STARTNO'] = df.groupby(['KEY','DATE'])['ENDNO'].shift(fill_value=0)
it starts the STARTNO back at 0 whenever the date changes, which is incorrect.
How do I change the code so that it takes the ENDNO for the previous row when the KEY and DATE match?
I think this is groupby().shift():
df['STARTNO'] = df.groupby(['KEY','ITEM'])['ENDNO'].shift(fill_value=0)
Output:
KEY DATE ENDNO ITEM STARTNO
0 1 2020-01-01 1000 PAPERCLIPS 0
1 1 2020-01-01 1000 BINDERS 0
2 1 2020-01-01 1000 STAPLES 0
3 1 2020-01-08 2000 PAPERCLIPS 1000
4 1 2020-01-08 2000 BINDERS 1000
5 1 2020-01-08 2000 STAPLES 1000
6 2 2020-02-01 400 PENCILS 0
7 2 2020-02-01 400 PENS 0

Creating a new dataframe from a multi index dataframe using some conditions

I have a time series dataset which is basically consumption data of materials over the past 5 years
Material No Consumption Date Consumption
A 2019-06-01 1
A 2019-07-01 2
A 2019-08-01 3
A 2019-09-01 4
A 2019-10-01 0
A 2019-11-01 0
A 2019-12-01 0
A 2020-01-01 1
A 2020-02-01 2
A 2020-03-01 3
A 2020-04-01 0
A 2020-05-01 0
B 2019-06-01 0
B 2019-07-01 0
B 2019-08-01 0
B 2019-09-01 4
B 2019-10-01 0
B 2019-11-01 0
B 2019-12-01 0
B 2020-01-01 4
B 2020-02-01 2
B 2020-03-01 8
B 2020-04-01 0
B 2020-05-01 0
From the above dataframe, I want to see the number of months in which the material had at least 1 unit of consumption. The output dataframe should look something like this.
Material no_of_months(Jan2020-May2020) no_of_months(Jun2019-May2020)
A 3 7
B 3 4
Currently I'm sub-setting the data frame and using a group by to count the unique entries with non-zero consumption. However, this needs creating multiple data frames for different periods and then merging them. Was wondering if this could be done in a better way using dictionaries.
consumption_jan20_may20 = consumption.loc[consumption['Consumption Date']>='2020-01-01',['Material No','Consumption Date','Consumption']]
consumption_jan20_may20 = consumption_jan20_may20.groupby([pd.Grouper(key='Material No'),grouper])['Consumption'].count().reset_index()
consumption_jan20_may20 = consumption_jan20_may20.groupby('Material No').count().reset_index()
consumption_jan20_may20.columns = ['Material No','no_of_months(Jan2020-May2020)','dummy']
consumption_jan20_may20 = consumption_jan20_may20[['MATNR','no_of_months(Jan2020-May2020)']]
You can firstly limit the data that you are investigating (limit it to a range of months). Let's say you want to check the data for the first 5 months:
df = df[:6]
Then you can use the below code to find the months that the material usage is not zero:
df_nonezero = df[df['Consumption']!=0]
if you want to see how many months the consumption is not zero, you can simply determine the length of new data frame:
len(df_nonezero)

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

make a shift by index with a pandas dataframe

Is there a pandas way to do that:
predicted_sells = []
for row in df.values:
index_tms = row[0]
delta = index_tms + timedelta(hours=1)
try:
sells_to_predict = df.loc[delta]['cars_sold']
except KeyError:
new_element = None
predicted_sells.append(sells_to_predict)
df['sell_to_predict'] = predicted_sells
example explanation:
sell is the number of cars I sold at the time tms. sell_to_predict is the number of cars I sold the hour after. I want to predict that. So I want to build a new column containing at the time tms the number of cars I will sell at the time tms+1h
before my code it looks like that
tms sell
2015-11-23 15:00:00 6
2015-11-23 16:00:00 2
2015-11-23 17:00:00 10
after it looks like that
tms sell sell_to_predict
2015-11-23 15:00:00 6 2
2015-11-23 16:00:00 2 10
2015-11-23 17:00:00 10 NaN
I create a new column based on a shift of an other column, but that's not a shift in number of columns. That's a shift based on an index (here the index is a timestamp)
Here is an other example, little more complex :
before :
sell random
store hour
1 1 1 9
2 7 7
2 1 4 3
2 2 3
after :
sell random predict
store hour
1 1 1 9 7
2 7 7 NaN
2 1 4 3 2
2 2 3 NaN
have you tried shift?
e.g.
df = pd.DataFrame(list(range(4)))
df.columns = ['sold']
df['predict'] = df.sold.shift(-1)
df
sold predict
0 0 1
1 1 2
2 2 3
3 3 NaN
the answer was to resample so I won't have any hole, and then apply the answer for this question : How do you shift Pandas DataFrame with a multiindex?

Categories