I have a time series dataset which is basically consumption data of materials over the past 5 years
Material No Consumption Date Consumption
A 2019-06-01 1
A 2019-07-01 2
A 2019-08-01 3
A 2019-09-01 4
A 2019-10-01 0
A 2019-11-01 0
A 2019-12-01 0
A 2020-01-01 1
A 2020-02-01 2
A 2020-03-01 3
A 2020-04-01 0
A 2020-05-01 0
B 2019-06-01 0
B 2019-07-01 0
B 2019-08-01 0
B 2019-09-01 4
B 2019-10-01 0
B 2019-11-01 0
B 2019-12-01 0
B 2020-01-01 4
B 2020-02-01 2
B 2020-03-01 8
B 2020-04-01 0
B 2020-05-01 0
From the above dataframe, I want to see the number of months in which the material had at least 1 unit of consumption. The output dataframe should look something like this.
Material no_of_months(Jan2020-May2020) no_of_months(Jun2019-May2020)
A 3 7
B 3 4
Currently I'm sub-setting the data frame and using a group by to count the unique entries with non-zero consumption. However, this needs creating multiple data frames for different periods and then merging them. Was wondering if this could be done in a better way using dictionaries.
consumption_jan20_may20 = consumption.loc[consumption['Consumption Date']>='2020-01-01',['Material No','Consumption Date','Consumption']]
consumption_jan20_may20 = consumption_jan20_may20.groupby([pd.Grouper(key='Material No'),grouper])['Consumption'].count().reset_index()
consumption_jan20_may20 = consumption_jan20_may20.groupby('Material No').count().reset_index()
consumption_jan20_may20.columns = ['Material No','no_of_months(Jan2020-May2020)','dummy']
consumption_jan20_may20 = consumption_jan20_may20[['MATNR','no_of_months(Jan2020-May2020)']]
You can firstly limit the data that you are investigating (limit it to a range of months). Let's say you want to check the data for the first 5 months:
df = df[:6]
Then you can use the below code to find the months that the material usage is not zero:
df_nonezero = df[df['Consumption']!=0]
if you want to see how many months the consumption is not zero, you can simply determine the length of new data frame:
len(df_nonezero)
Related
I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1
Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0
I have a DataFrame df and I am trying to calculate a cumulative count based on the condition that the date in the column at is bigger or equal to the dates in the column recovery_date.
Here is the original df:
at recovery_date
0 2020-02-01 2020-03-02
1 2020-03-01 2020-03-31
2 2020-04-01 2020-05-01
3 2020-05-01 2020-05-31
4 2020-06-01 2020-07-01
Here is the desired outcome:
at recovery_date result
0 2020-02-01 2020-03-02 0
1 2020-03-01 2020-03-31 0
2 2020-04-01 2020-05-01 2
3 2020-05-01 2020-05-31 3
4 2020-06-01 2020-07-01 4
The interpretation is that for each at there are x amount of recovery_dates preceding it or on the same day.
I am trying to avoid using a for loop as I am implementing this for a time-sensitive application.
This is a solution I was able to find, however I am looking for something more performant:
def how_many(at: pd.Timestamp, recoveries: pd.Series) -> int:
return (at >= recoveries).sum()
df["result"] = [how_many(row["at"], df["recovery_date"][:idx]) for idx, row in df.iterrows()]
Thanks a lot!!
You're looking for something like this:
df['result'] = df['at'].apply(lambda at: (at >= df['recovery_date']).sum())
What this does is: For each value in the at column, check if there are any recovery_dates that are bigger or equal (at this point we have an array of True (=1) and False (=0) values) then sum them.
This yields your desired output
at recovery_date count result
0 2020-02-01 2020-03-02 1 0
1 2020-03-01 2020-03-31 1 0
2 2020-04-01 2020-05-01 1 2
3 2020-05-01 2020-05-31 1 3
4 2020-06-01 2020-07-01 1 4
I have the following dataframe:
df = pd.DataFrame({'No': [0,0,0,1,1,2,2],
'date':['2020-01-15','2019-12-16','2021-03-01', '2018-05-19', '2016-04-08', '2020-01-02', '2020-03-07']})
df.date =pd.to_datetime(df.date)
No date
0 0 2018-01-15
1 0 2019-12-16
2 0 2021-03-01
3 1 2018-05-19
4 1 2016-04-08
5 2 2020-01-02
6 2 2020-03-07
I want to drop the rows if all the date values are earlier than 2020-01-01 for each unique number in No column, i.e. I want to drop rows with the indices 3 and 4.
Is it possible to do it without a for loop?
Use groupby and transform:
>>> df[df.groupby('No')['date'].transform('max')>='2020-01-01']
No date
0 0 2020-01-15
1 0 2019-12-16
2 0 2021-03-01
5 2 2020-01-02
6 2 2020-03-07
I have a pandas dataframe looking like this.
Location Part UnitCost DemandType Demand Period
NL 12345 6 GENERAL 4 2017-10-01 00:00:00
NL 12345 6 GENRAL 6 2017-12-01 00:00:00
There was no demand in November, but there is no record of that. It is just left then. I want that added. What can I do to make it into this:
Location Part UnitCost DemandType Demand Period
NL 12345 6 GENERAL 4 2017-10-01 00:00:00
NL 12345 6 GENERAL 0 2017-11-01 00:00:00
NL 12345 6 GENERAL 6 2017-12-01 00:00:00
Furthermore, I want to add all months with zero demand from 2017-10-01 till 2020-03-01.
It is important that this is done for the unique combination of Location and Part. There are more than 100 unique combinations of Location and Part in my dataframe.
Thank you very much in advance!
Here is one way:
df['Period'] = pd.to_datetime(df['Period']) #Make sure Period is datatime dtype
df1 = df.set_index('Period') #Set Index for resample in next statement
df1.resample('MS').ffill().assign(Demand=df1['Demand']).fillna(0).reset_index()
Output:
Period Location Part UnitCost DemandType Demand
0 2017-10-01 NL 12345 6 GENERAL 4.0
1 2017-11-01 NL 12345 6 GENERAL 0.0
2 2017-12-01 NL 12345 6 GENRAL 6.0
Is there a pandas way to do that:
predicted_sells = []
for row in df.values:
index_tms = row[0]
delta = index_tms + timedelta(hours=1)
try:
sells_to_predict = df.loc[delta]['cars_sold']
except KeyError:
new_element = None
predicted_sells.append(sells_to_predict)
df['sell_to_predict'] = predicted_sells
example explanation:
sell is the number of cars I sold at the time tms. sell_to_predict is the number of cars I sold the hour after. I want to predict that. So I want to build a new column containing at the time tms the number of cars I will sell at the time tms+1h
before my code it looks like that
tms sell
2015-11-23 15:00:00 6
2015-11-23 16:00:00 2
2015-11-23 17:00:00 10
after it looks like that
tms sell sell_to_predict
2015-11-23 15:00:00 6 2
2015-11-23 16:00:00 2 10
2015-11-23 17:00:00 10 NaN
I create a new column based on a shift of an other column, but that's not a shift in number of columns. That's a shift based on an index (here the index is a timestamp)
Here is an other example, little more complex :
before :
sell random
store hour
1 1 1 9
2 7 7
2 1 4 3
2 2 3
after :
sell random predict
store hour
1 1 1 9 7
2 7 7 NaN
2 1 4 3 2
2 2 3 NaN
have you tried shift?
e.g.
df = pd.DataFrame(list(range(4)))
df.columns = ['sold']
df['predict'] = df.sold.shift(-1)
df
sold predict
0 0 1
1 1 2
2 2 3
3 3 NaN
the answer was to resample so I won't have any hole, and then apply the answer for this question : How do you shift Pandas DataFrame with a multiindex?