I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1,
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
I want to fill in missing dates, with just one location_id and fill it with 0:
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-02 100 0
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.
you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]])),
fill_value=0)
print(df)
0
2020-01-01 100 91
200 49
300 19
2020-01-02 100 0
2020-01-03 100 41
200 25
300 51
2020-01-04 100 44
200 40
300 54
instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
list(product(pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]))),
fill_value=0)
Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
0
Date location_id
2020-01-01 100 19.0
200 75.0
300 39.0
2020-01-02 100 0.0
2020-01-03 100 11.0
200 91.0
300 80.0
2020-01-04 100 36.0
200 56.0
300 54.0
unstack/stack and asfreq/reindex would work:
new_df = df.unstack(fill_value=0)
new_df.index = pd.to_datetime(new_df.index)
new_df.asfreq('D').fillna(0).stack('location_id')
Output:
0
Date location_id
2020-01-01 100 78.0
200 25.0
300 89.0
2020-01-02 100 0.0
200 0.0
300 0.0
2020-01-03 100 79.0
200 23.0
300 11.0
2020-01-04 100 30.0
200 79.0
300 72.0
Related
I have the following dataframe:
amount
01-01-2020 100
01-02-2020 100
01-03-2020 100
01-04-2020 100
01-05-2020 100
01-06-2020 100
01-07-2020 100
01-08-2020 100
01-09-2020 100
01-10-2020 100
01-11-2020 100
01-12-2020 100
I need to add a new column which starts with 100 and increases the value by 10% every 4 months, ie:
amount result
01-01-2020 100 100
01-02-2020 100 100
01-03-2020 100 100
01-04-2020 100 100
01-05-2020 100 110
01-06-2020 100 110
01-07-2020 100 110
01-08-2020 100 110
01-09-2020 100 121
01-10-2020 100 121
01-11-2020 100 121
01-12-2020 100 121
I think you need Grouper for each 4 months with GroupBy.ngroup for groups, then get 10% by multiple Series by 100 with divide 10 and last add 100:
df.index = pd.to_datetime(df.index, dayfirst=True)
df['result'] = df.groupby(pd.Grouper(freq='4MS')).ngroup().mul(100).div(10).add(100)
print (df)
amount result
2020-01-01 100 100.0
2020-02-01 100 100.0
2020-03-01 100 100.0
2020-04-01 100 100.0
2020-05-01 100 110.0
2020-06-01 100 110.0
2020-07-01 100 110.0
2020-08-01 100 110.0
2020-09-01 100 120.0
2020-10-01 100 120.0
2020-11-01 100 120.0
2020-12-01 100 120.0
If datetimes are consecutive and always each 4 rows is possible use:
df['result'] = np.arange(len(df)) // 4 * 100 / 10 + 100
print (df)
amount result
2020-01-01 100 100.0
2020-02-01 100 100.0
2020-03-01 100 100.0
2020-04-01 100 100.0
2020-05-01 100 110.0
2020-06-01 100 110.0
2020-07-01 100 110.0
2020-08-01 100 110.0
2020-09-01 100 120.0
2020-10-01 100 120.0
2020-11-01 100 120.0
2020-12-01 100 120.0
Here is another way:
pct = .1
df['result'] = df['amount'] * (1 + pct) ** (np.arange(len(df))//4)
you forgot to substract boolean vlaues for each period:
df['result'] = df['amount'] * (1 + pct) ** (np.arange(len(df))//4) - np.arange(len(df))//4
this is how you will have correct results.
I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})
print(data)
#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0
First, I want to forward fill my data for EACH UNIQUE VALUE in Group_Id by 1S, so basically grouping by Group_Id then resample using ffill.
Here is the data:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I previously did this:
def daily_average_temperature(dfdf):
INDEX = dfdf[['Group_Id','Timestamp','Data']]
INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
INDEX = INDEX.set_index('Timestamp')
INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')
return T_index1
This is wrong as it didn't group the data with different value of Group_Id first but rather ignoring the column.
Second, I would like to spread the Data values so each row is a group_id with index as columns replacing Timestamp, looks something like this:
x0 x1 x2 x3 x4 x5 ... Group_Id
0 40 31.05 25.5 25.5 25.5 25 ... 1
1 35 35.75 36.5 36.5 36.5 36.5 ... 2
2 25.5 25.5 25.5 25.5 25.5 25.5 ... 3
3 25.5 25.5 25.5 25.5 25.5 25.5 ... 4
4 25 25 25 25 25 25 ... 5
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Please note that this table above is not related to the previous dataset but just used to show the format.
Thanks
Use DataFrame.groupby with DataFrameGroupBy.resample:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.ffill()
.reset_index())
return dfdf
print (daily_average_temperature(dfdf))
Group_Id Timestamp Data
0 50 2018-01-03 00:00:15 125.5
1 52 2018-01-02 00:00:09 127.0
2 52 2018-01-02 00:00:10 127.0
3 52 2018-01-02 00:00:11 127.0
4 52 2018-01-02 00:00:12 127.0
5 52 2018-01-02 00:00:13 126.5
6 101 2018-01-01 00:00:05 125.0
7 120 2018-01-01 00:00:07 125.5
8 120 2018-01-01 00:00:08 125.0
9 300 2018-01-04 00:00:14 127.0
10 300 2018-01-04 00:00:15 126.5
11 350 2018-01-05 00:00:19 125.5
EDIT: This solution use minimal and maximal datetimes for DataFrame.reindex by date_range in DattimeIndex in columns after reshape by Series.unstack, also is added back filling if necessary:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 345615
300 345615
120 345615
101 345615
52 345615
50 345615
Name: Group_Id, dtype: int64
Another solution is similar, only date_range is specified by values from strings (not dynamic by min and max):
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'),
axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 604801
300 604801
120 604801
101 604801
52 604801
50 604801
Name: Group_Id, dtype: int64
I am doing a classification problem in which I am trying to predict whether a car will be refuelled the following day.
The data consists of a date, an ID for every car, the distance to destination
What I want is a variable that is lagged 3 days, and not 3 rows per car_ID - since the case is that every car_ID is not present on every day. Therefore, the lag should be based on the date and not the rows.
If there are less than 3 days of history, the result should be -1.
Currently, I have this piece of code which lags every row 3 days
data['distance_to_destination'].groupby(data['car_ID']).shift(3).tolist()
But this is only lagging for the number of rows and not the number of days.
What I want to achieve is the column "lag_dtd_3":
date car_ID distance_to_destination lag_dtd_3
01/01/2019 1 100 -1
01/01/2019 2 200 -1
02/01/2019 1 80 -1
02/01/2019 2 170 -1
02/01/2019 3 500 -1
03/01/2019 2 120 -1
05/01/2019 1 25 80
05/01/2019 2 75 170
06/01/2019 1 20 -1
06/01/2019 2 30 120
06/01/2019 3 120 -1
One solution to lag information by 3 days is to move the index instead of shifting.
pivot = data.pivot(columns='car_ID')
shifted = pivot.copy()
shifted.index = shifted.index + pd.DateOffset(days=3) # Here I lag the index instead of shifting
shifted.columns = shifted.columns.set_levels(['lag_dtd_3'], 0)
output = pd.concat([pivot, shifted], axis = 1).stack('car_ID').reset_index('car_ID')
output['lag_dtd_3'] = output['lag_dtd_3'].fillna(-1)
output = output.dropna()
Output:
car_ID distance_to_destination lag_dtd_3
date
2019-01-01 1 100.0 -1.0
2019-01-01 2 200.0 -1.0
2019-01-02 1 80.0 -1.0
2019-01-02 2 170.0 -1.0
2019-01-02 3 500.0 -1.0
2019-01-03 2 120.0 -1.0
2019-01-05 1 25.0 80.0
2019-01-05 2 75.0 170.0
2019-01-06 1 20.0 -1.0
2019-01-06 2 30.0 120.0
2019-01-06 3 120.0 -1.0
I have a DatetimeIndex indexed dataframe with two columns. The index is uneven.
A B
Date
2016-01-04 1 20
2016-01-12 2 10
2016-01-21 3 10
2016-01-25 2 20
2016-02-08 2 30
2016-02-15 1 20
2016-02-21 3 20
2016-02-25 2 20
I want to compute the dot product of time-series A and B over a rolling window of length 20 days.
It should return:
dot
Date
2016-01-04 Nan
2016-01-12 Nan
2016-01-21 Nan
2016-01-25 110
2016-02-08 130
2016-02-15 80
2016-02-21 140
2016-02-25 180
here is how this is obtained:
110 = 2*10+3*10+2*20 (product obtained in period from 2016-01-06 to 2016-01-25 included)
130 = 3*10+2*20+2*30 (product obtained in period from 2016-01-20 to 2016-02-08)
80 = 1*20+2*30 (product obtained in period from 2016-01-27 to 2016-02-15)
140 = 3*20+1*20+2*30 (product obtained in period from 2016-02-02 to 2016-02-21)
180 = 2*20+3*20+1*20+2*30 (product obtained in period from 2016-02-06 to 2016-02-25)
The dot product is an example that should be generalizable to any function taking two series and returning a value.
I think this should work. df.product() across rows, the df.rolling(period).sum()
Dates = pd.to_datetime(['2016-01-04',
'2016-01-12',
'2016-01-21',
'2016-01-25',
'2016-02-08',
'2016-02-15',
'2016-02-21',
'2016-02-25',
'2016-02-26'
]
)
data = {'A': [i*10 for i in range(1,10)], 'B': [i for i in range(1,10)]}
df1 = pd.DataFrame(data = data, index = Dates)
df2 = df1.product(axis =1).rolling(3).sum()
df2.columns = 'Dot'
df2
output
2016-01-04 NaN
2016-01-12 NaN
2016-01-21 140.0
2016-01-25 290.0
2016-02-08 500.0
2016-02-15 770.0
2016-02-21 1100.0
2016-02-25 1490.0
2016-02-26 1940.0
dtype: float64
And if your data is daily and you want to get 20 days data first, group them by 20 days and sum them up, or use last, according to what you want.
Dates1 = pd.date_range(start='2016-03-31', end = '2016-07-31')
data1 = {'A': [np.pi * i * np.random.rand()
for i in range(1, len(Dates1) + 1)],
'B': [i * np.random.randn() * 10
for i in range(1, len(Dates1) + 1)]}
df3 = pd.DataFrame(data = data1, index = Dates1)
df3.groupby(pd.TimeGrouper(freq = '20d')).sum()
A B
2016-03-31 274.224084 660.144639
2016-04-20 1000.456615 -2403.034012
2016-05-10 1872.422495 -1737.571080
2016-05-30 2121.497529 1157.710510
2016-06-19 3084.569208 -1854.258668
2016-07-09 3324.775922 -9743.113805
2016-07-29 505.162678 -1179.730820
and then use dot product like I did above.