My dataset has Customer_Code, As_Of_Date and 24 products. The products have a value of 0 -1. I ordered the data set by customer code and as_of_date. I want to subtract from the next row in the products to the previous row. The important thing here is to get each customer out according to their as_of_date.
I try
df2.set_index('Customer_Code').diff()
and
df2.set_index('As_Of_Date').diff()
and
for i in new["Customer_Code"].unique():
df14 = df12.set_index('As_Of_Date').diff()
but is not true. My code is true for first customer but it is not true for second customer.
How I can do?
You didn't share any data so I made up something that you may use. Your expected outcome also lacks. For further reference, please do not share images. Let's say you have this data:
id date product
0 12 2008-01-01 1
1 12 2008-01-01 2
2 12 2008-01-01 1
3 12 2008-01-02 4
4 12 2008-01-02 5
5 34 2009-01-01 6
6 34 2009-01-01 7
7 34 2009-01-01 84
8 34 2009-01-02 4
9 34 2009-01-02 3
10 34 2009-01-02 3
11 34 2009-01-03 5
12 34 2009-01-03 6
13 34 2009-01-03 8
As I understand it, you want to substract the product value from the previous row, grouped by id and date. (if any other group, adapt). You then need to do this:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(1)), np.nan))
which returns:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -1.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 1.0
7 34 2009-01-01 84 77.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 -1.0
10 34 2009-01-02 3 0.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 1.0
13 34 2009-01-03 8 2.0
or if you want it the other way around:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(-1)), np.nan))
which gives:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -3.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 -1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 -77.0
7 34 2009-01-01 84 80.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 0.0
10 34 2009-01-02 3 -2.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 -2.0
13 34 2009-01-03 8 NaN
I would like to duplicate the rows in a data frame by creating a sequence of n dates from the start date.
My input file format.
col1 col2 date
1 5 2015-07-15
2 6 2015-07-20
3 7 2015-07-25
My expected output.
col1 col2 date
1 5 2015-07-15
1 5 2015-07-16
1 5 2015-07-17
1 5 2015-07-18
1 5 2015-07-19
2 6 2015-07-20
2 6 2015-07-21
2 6 2015-07-22
2 6 2015-07-23
2 6 2015-07-24
3 7 2015-07-25
3 7 2015-07-26
3 7 2015-07-27
3 7 2015-07-28
3 7 2015-07-29
I have to create a sequence of dates with a day difference.
Thanks in advance.
Use:
df['date'] = pd.to_datetime(df['date'])
n = 15
#create date range by periods
idx = pd.date_range(df['date'].iat[0], periods=n)
#create DatetimeIndex with reindex and forward filling values
df = (df.set_index('date')
.reindex(idx, method='ffill')
.reset_index()
.rename(columns={'index':'date'}))
print (df)
date col1 col2
0 2015-07-15 1 5
1 2015-07-16 1 5
2 2015-07-17 1 5
3 2015-07-18 1 5
4 2015-07-19 1 5
5 2015-07-20 2 6
6 2015-07-21 2 6
7 2015-07-22 2 6
8 2015-07-23 2 6
9 2015-07-24 2 6
10 2015-07-25 3 7
11 2015-07-26 3 7
12 2015-07-27 3 7
13 2015-07-28 3 7
14 2015-07-29 3 7
Import packages
from datetime import datetime as dt
from datetime import timedelta
import numpy as np
Then create the date range as a df:
base = dt(2015, 7, 15)
arr = np.array([base + timedelta(days=i) for i in range(15)])
df_d = pd.DataFrame({'date_r' : arr})
Change the datatype of the original df if you have not:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
and merge with the original df, and sort by date ascending:
df_merged = df.merge(df_d, how='right', left_on='date', right_on='date_r')
df_merged.sort_values('date_r', inplace=True)
you will get this df:
col1 col2 date date_r
0 1.0 5.0 2015-07-15 2015-07-15
3 NaN NaN NaT 2015-07-16
4 NaN NaN NaT 2015-07-17
5 NaN NaN NaT 2015-07-18
6 NaN NaN NaT 2015-07-19
1 2.0 6.0 2015-07-20 2015-07-20
7 NaN NaN NaT 2015-07-21
8 NaN NaN NaT 2015-07-22
9 NaN NaN NaT 2015-07-23
10 NaN NaN NaT 2015-07-24
2 3.0 7.0 2015-07-25 2015-07-25
11 NaN NaN NaT 2015-07-26
12 NaN NaN NaT 2015-07-27
13 NaN NaN NaT 2015-07-28
14 NaN NaN NaT 2015-07-29
Now, you will just need to forward fill using fillna(method='ffill').astype(int):
df_merged['col1'] = df_merged['col1'].fillna(method='ffill').astype(int)
df_merged['col2'] = df_merged['col2'].fillna(method='ffill').astype(int)
for completeness sake, rename the columns to get the original intended df back:
df_merged = df_merged[['col1', 'col2', 'date_r']]
df_merged.rename(columns={'date_r' : 'date'}, inplace=True)
for cosmetic purposes:
df_merged.reset_index(inplace=True, drop=True)
print(df_merged)
to yield finally:
col1 col2 date
0 1 5 2015-07-15
1 1 5 2015-07-16
2 1 5 2015-07-17
3 1 5 2015-07-18
4 1 5 2015-07-19
5 2 6 2015-07-20
6 2 6 2015-07-21
7 2 6 2015-07-22
8 2 6 2015-07-23
9 2 6 2015-07-24
10 3 7 2015-07-25
11 3 7 2015-07-26
12 3 7 2015-07-27
13 3 7 2015-07-28
14 3 7 2015-07-29
more generic way would be stretching out your time index and filling NaN with previous values.
try this,
df['date']=pd.to_datetime(df['date'])
print(df.set_index('date').asfreq('D').ffill().reset_index())
O/P:
date col1 col2
0 2015-07-15 1.0 5.0
1 2015-07-16 1.0 5.0
2 2015-07-17 1.0 5.0
3 2015-07-18 1.0 5.0
4 2015-07-19 1.0 5.0
5 2015-07-20 2.0 6.0
6 2015-07-21 2.0 6.0
7 2015-07-22 2.0 6.0
8 2015-07-23 2.0 6.0
9 2015-07-24 2.0 6.0
10 2015-07-25 3.0 7.0
I have several pandas dataframes (say a normal python list) which look like the following two. Note that there can be (in fact there are) some missing values at random dates. I need to compute percentiles of TMAX and/or TMAX_ANOM across the several dataframes, for each date, ignoring the missing values.
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 13.0 2.333333
1 1980 7 2 14.3 2.566667
2 1980 7 3 15.6 2.800000
3 1980 7 4 16.9 3.033333
4 1980 8 1 18.2 3.266667
5 1980 8 2 19.5 3.500000
6 1980 8 3 20.8 3.733333
7 1980 8 4 22.1 3.966667
8 1981 7 1 10.0 -0.666667
9 1981 7 2 11.0 -0.733333
10 1981 7 3 12.0 -0.800000
11 1981 7 4 13.0 -0.866667
12 1981 8 1 14.0 -0.933333
13 1981 8 2 15.0 -1.000000
14 1981 8 3 16.0 -1.066667
15 1981 8 4 17.0 -1.133333
16 1982 7 1 9.0 -1.666667
17 1982 7 2 9.9 -1.833333
18 1982 7 3 10.8 -2.000000
19 1982 7 4 11.7 -2.166667
20 1982 8 1 12.6 -2.333333
21 1982 8 2 13.5 -2.500000
22 1982 8 3 14.4 -2.666667
23 1982 8 4 15.3 -2.833333
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 14.0 3.666667
1 1980 7 2 15.4 4.033333
2 1980 7 3 16.8 4.400000
3 1980 7 4 18.2 4.766667
4 1980 8 1 19.6 5.133333
6 1980 8 3 22.4 5.866667
7 1980 8 4 23.8 6.233333
8 1981 7 1 10.0 -0.333333
9 1981 7 2 11.0 -0.366667
10 1981 7 3 12.0 -0.400000
11 1981 7 4 13.0 -0.433333
12 1981 8 1 14.0 -0.466667
13 1981 8 2 15.0 -0.500000
14 1981 8 3 16.0 -0.533333
15 1981 8 4 17.0 -0.566667
16 1982 7 1 7.0 -3.333333
17 1982 7 2 7.7 -3.666667
18 1982 7 3 8.4 -4.000000
19 1982 7 4 9.1 -4.333333
20 1982 8 1 9.8 -4.666667
21 1982 8 2 10.5 -5.000000
23 1982 8 4 11.9 -5.666667
So just to be clear, in this example with just two dataframe (and supposing the percentile is median to simplify the discussion), as a output I need a dataframe with 24 elements, the same YYYY/MM/DD fields, and the TMAX (and/or TMAX_ANOM) replaced as follow: for 1980/7/1 it must be the median between 13 and 14, for for 1980/7/2 it must be the median between 14.3 and 15.4 and so on. When there are missing values (for example the 1980/8/2 in the second dataframe here), the median must be computed just from the remaining dataframes -- so in this case the value would just be 19.5
I have not been able to find a clean way to accomplish this, with either numpy or pandas. Any suggestions or should I just resort to manual looping?
#dates as indexes
df1.index = pd.to_datetime(dict(year = df1.YYYY, month = df1.MM, day = df1.DD))
df2.index = pd.to_datetime(dict(year = df2.YYYY, month = df2.MM, day = df2.DD))
#binding useful columns
new_df = df1[['TMAX','TMAX_ANOM']].join(df2[['TMAX','TMAX_ANOM']], lsuffix = '_df1', rsuffix = '_df2')
#calculating quantile
new_df['TMAX_quantile'] = new_df[['TMAX_df1', 'TMAX_df2']].quantile(0.5, axis = 1)
I have a dataframe that includes two columns like the following:
date value
0 2017-05-01 1
1 2017-05-08 4
2 2017-05-15 9
each row shows Monday of the week and I have a value only for that specific day. I want to estimate this value for the whole week days until the next Monday, and get the following output:
date value
0 2017-05-01 1
1 2017-05-02 1
2 2017-05-03 1
3 2017-05-04 1
4 2017-05-05 1
5 2017-05-06 1
6 2017-05-07 1
7 2017-05-08 4
8 2017-05-09 4
9 2017-05-10 4
10 2017-05-11 4
11 2017-05-12 4
12 2017-05-13 4
13 2017-05-14 4
14 2017-05-15 9
15 2017-05-16 9
16 2017-05-17 9
17 2017-05-18 9
18 2017-05-19 9
19 2017-05-20 9
20 2017-05-21 9
in this link it shows how to select the range in Dataframe but I don't know how to fill the value column as I explained.
Here is a solution using pandas reindex and ffill:
# Make sure dates is treated as datetime
df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d")
from pandas.tseries.offsets import DateOffset
# Create target dates: all days in the weeks in the original dataframe
new_index = pd.date_range(start=df['date'].iloc[0],
end=df['date'].iloc[-1] + DateOffset(6),
freq='D')
# Temporarily set dates as index, conform to target dates and forward fill data
# Finally reset the index as in the original df
out = df.set_index('date')\
.reindex(new_index).ffill()\
.reset_index(drop=False)\
.rename(columns = {'index' : 'date'})
Which gives the expected result:
date value
0 2017-05-01 1.0
1 2017-05-02 1.0
2 2017-05-03 1.0
3 2017-05-04 1.0
4 2017-05-05 1.0
5 2017-05-06 1.0
6 2017-05-07 1.0
7 2017-05-08 4.0
8 2017-05-09 4.0
9 2017-05-10 4.0
10 2017-05-11 4.0
11 2017-05-12 4.0
12 2017-05-13 4.0
13 2017-05-14 4.0
14 2017-05-15 9.0
15 2017-05-16 9.0
16 2017-05-17 9.0
17 2017-05-18 9.0
18 2017-05-19 9.0
19 2017-05-20 9.0
20 2017-05-21 9.0
Given following input, the goal is to group values by hour for each Date with Avg and Sum functions.
Solution to grouping it by hour is here, but it does not consider new days.
Date Time F1 F2 F3
21-01-16 8:11 5 2 4
21-01-16 9:25 9 8 2
21-01-16 9:39 7 3 2
21-01-16 9:53 6 5 1
21-01-16 10:07 4 6 7
21-01-16 10:21 7 3 1
21-01-16 10:35 5 6 7
21-01-16 11:49 1 2 1
21-01-16 12:03 3 3 1
22-01-16 9:45 6 5 1
22-01-16 9:20 4 6 7
22-01-16 12:10 7 3 1
Expected output:
Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1
You can do the parsing of dates during reading of the csv file:
from __future__ import print_function # make it work with Python 2 and 3
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))
Output:
F1 F2 F3
mean sum mean sum mean sum
Date Time
2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
9 7.333333 22 5.333333 16 1.666667 5
10 5.333333 16 5.000000 15 5.000000 15
11 1.000000 1 2.000000 2 1.000000 1
12 3.000000 3 3.000000 3 1.000000 1
2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
12 7.000000 7 3.000000 3 1.000000 1
All the way into csv:
from __future__ import print_function
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))
Output:
Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1
You cas use convert time to datetime by to_datetime and then groupby with agg:
print df
Date Time F1 F2 F3
0 2016-01-21 8:11 5 2 4
1 2016-01-21 9:25 9 8 2
2 2016-01-21 9:39 7 3 2
3 2016-01-21 9:53 6 5 1
4 2016-01-21 10:07 4 6 7
5 2016-01-21 10:21 7 3 1
6 2016-01-21 10:35 5 6 7
7 2016-01-21 11:49 1 2 1
8 2016-01-21 12:03 3 3 1
9 2016-01-22 9:45 6 5 1
10 2016-01-22 9:20 4 6 7
11 2016-01-22 12:10 7 3 1
df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
Date Time F1 F2 F3
0 2016-01-21 1900-01-01 08:11:00 5 2 4
1 2016-01-21 1900-01-01 09:25:00 9 8 2
2 2016-01-21 1900-01-01 09:39:00 7 3 2
3 2016-01-21 1900-01-01 09:53:00 6 5 1
4 2016-01-21 1900-01-01 10:07:00 4 6 7
5 2016-01-21 1900-01-01 10:21:00 7 3 1
6 2016-01-21 1900-01-01 10:35:00 5 6 7
7 2016-01-21 1900-01-01 11:49:00 1 2 1
8 2016-01-21 1900-01-01 12:03:00 3 3 1
9 2016-01-22 1900-01-01 09:45:00 6 5 1
10 2016-01-22 1900-01-01 09:20:00 4 6 7
11 2016-01-22 1900-01-01 12:10:00 7 3 1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
Date Time F1 F2 F3
mean sum mean sum mean sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1
And then you can set column names by list comprehension:
levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df
Date Time F1 mean F1 sum F2 mean F2 sum F3 mean F3 sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1