I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.
Related
I am trying to fill all missing values until the end of the dataframe but unable to do so. In the example below, I am taking average of the last three values. My code is only filling until 2017-01-10 whereas I want to fill until 2017-01-14. For 1/14, I want to use values from 11,12 & 13.Please help.
import pandas as pd
df = pd.DataFrame([
{"ds":"2017-01-01","y":3},
{"ds":"2017-01-02","y":4},
{"ds":"2017-01-03","y":6},
{"ds":"2017-01-04","y":2},
{"ds":"2017-01-05","y":7},
{"ds":"2017-01-06","y":9},
{"ds":"2017-01-07","y":8},
{"ds":"2017-01-08","y":2},
{"ds":"2017-01-09"},
{"ds":"2017-01-10"},
{"ds":"2017-01-11"},
{"ds":"2017-01-12"},
{"ds":"2017-01-13"},
{"ds":"2017-01-14"}
])
df["y"].fillna(df["y"].rolling(3,min_periods=1).mean(),axis=0,inplace=True)
Result:
ds y
0 2017-01-01 3.0
1 2017-01-02 4.0
2 2017-01-03 6.0
3 2017-01-04 2.0
4 2017-01-05 7.0
5 2017-01-06 9.0
6 2017-01-07 8.0
7 2017-01-08 2.0
8 2017-01-09 5.0
9 2017-01-10 2.0
10 2017-01-11 NaN
11 2017-01-12 NaN
12 2017-01-13 NaN
13 2017-01-14 NaN
Desired output:
You can iterate over the values in y and if a nan value is encountered, look at the 3 earlier values and use .at[] to set the mean of the 3 earlier values as the new value:
for index, value in df['y'].items():
if np.isnan(value):
df['y'].at[index] = df['y'].iloc[index-3: index].mean()
Resulting dataframe for the missing values:
7 2017-01-08 2.000000
8 2017-01-09 6.333333
9 2017-01-10 5.444444
10 2017-01-11 4.592593
11 2017-01-12 5.456790
12 2017-01-13 5.164609
13 2017-01-14 5.071331
I have a PANDAS dataframe where I want to return a function of every X items of a time series--so for instance, my dataframe might look like
date value
2017-01-01 1
2017-01-02 5
2017-01-03 2
2017-01-04 1
2017-01-05 6
2017-01-06 6
So for example, if I want to be able to pull the rolling average of every X values where X is 3, I would want a dataframe showing
date value
2017-01-03 2.666
2017-01-04 2.666
2017-01-05 3
2017-01-06 4.333
Is there a dataframe operation that lets me pick a group of X values upon which to run a function?
I think you need rolling with mean and then if necessary remove first NaNs by dropna:
df['value'] = df['value'].rolling(3).mean()
df = df.dropna(subset=['value'])
print (df)
date value
2 2017-01-03 2.666667
3 2017-01-04 2.666667
4 2017-01-05 3.000000
5 2017-01-06 4.333333
There is also possible use min_periods parameter for avoid NaNs:
df['value'] = df['value'].rolling(3, min_periods=1).mean()
print (df)
date value
0 2017-01-01 1.000000
1 2017-01-02 3.000000
2 2017-01-03 2.666667
3 2017-01-04 2.666667
4 2017-01-05 3.000000
5 2017-01-06 4.333333
I have something like the following dataframe:
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df
Out[216]:
2017-01-01 5
2017-01-02 5
2017-01-03 2
2017-01-04 1
2017-01-05 3
2017-01-06 4
2017-01-07 5
2017-01-08 6
2017-01-09 7
2017-01-10 8
Freq: D, dtype: int64
I want to identify the start date of the 3 day period that has the minimum total value. So in this example, 2017-01-03 through 2017-01-05, has the minimum value with a sum of 6 across those 3 days.
Is there a way to do this without looping through each 3 day window?
The result would be:
2017-01-03 6
And if there were multiple windows that have the same minimum sum, the result could have a record for each.
IIUC rolling
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df=df.to_frame()
df['New']=df.rolling(3).sum().shift(-2).values
df.loc[df.New==df.New.min(),:].drop(0,1)
Out[685]:
New
2017-01-03 6.0
I have a dataframe of values and a list of dates.
E.g.,
data = pd.DataFrame([1,3,5,7,2,3,9,1,3,8,4,5],index=pd.date_range(start='2017-01-01',periods=12),columns=['values'])
I want to replace the value of a date in the date list with a zero value. E.g.,
date_list = ['2017-01-04', '2017-01-07', '2017-01-10']
I have tried:
data[date_list] == 0
but this yields an error:
KeyError: "None of [['2017-01-04', '2017-01-07', '2017-01-10']] are in the [index]"
Does anyone have an idea of how to solve this? I have a very large dataframe and date list...
Another way,
In [11]: data[data.index.isin(date_list)] = 0
In [12]: data
Out[12]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
You need to convert that list to datetime and use the loc indexer:
data.loc[pd.to_datetime(date_list)] = 0
data
Out[19]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
This works because the DataFrame has only one column. This sets all the columns to zero. But as jezrael pointed out, if you only want to set the values column to zero, you need to specify that:
data.loc[pd.to_datetime(date_list), 'values'] = 0
I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.
IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])