I have following dataframe in pandas
code date tank product time_frst time_lst qty_frst qty_lst
123 2019-01-01 1 MS 02:00:00 10:00:00 234 100
123 2019-01-01 2 HS 02:30:00 19:00:00 200 50
123 2019-01-01 3 MS 00:30:00 22:00:00 300 500
My desired dataframe is as follows
code date tank product time qty
123 2019-01-01 1 MS 02:00:00 234
123 2019-01-01 1 MS 10:00:00 100
123 2019-01-01 2 HS 02:30:00 200
123 2019-01-01 2 HS 19:00:00 50
123 2019-01-01 3 MS 00:30:00 300
123 2019-01-01 3 MS 22:00:00 500
How can I do it in pandas?
Create MultiIndex by all columns without _, so possible split all another columns to MultiIndex in columns and last reshape by DataFrame.stack, for remove unnecesary column use first reset_index and second is for convert MultiIndex in index to columns:
df = df.set_index(['code','date','tank','product'])
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().reset_index(level=4, drop=True).reset_index()
print (df)
code date tank product qty time
0 123 2019-01-01 1 MS 234 02:00:00
1 123 2019-01-01 1 MS 100 10:00:00
2 123 2019-01-01 2 HS 200 02:30:00
3 123 2019-01-01 2 HS 50 19:00:00
4 123 2019-01-01 3 MS 300 00:30:00
5 123 2019-01-01 3 MS 500 22:00:00
Related
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
First, I want to forward fill my data for EACH UNIQUE VALUE in Group_Id by 1S, so basically grouping by Group_Id then resample using ffill.
Here is the data:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I previously did this:
def daily_average_temperature(dfdf):
INDEX = dfdf[['Group_Id','Timestamp','Data']]
INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
INDEX = INDEX.set_index('Timestamp')
INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')
return T_index1
This is wrong as it didn't group the data with different value of Group_Id first but rather ignoring the column.
Second, I would like to spread the Data values so each row is a group_id with index as columns replacing Timestamp, looks something like this:
x0 x1 x2 x3 x4 x5 ... Group_Id
0 40 31.05 25.5 25.5 25.5 25 ... 1
1 35 35.75 36.5 36.5 36.5 36.5 ... 2
2 25.5 25.5 25.5 25.5 25.5 25.5 ... 3
3 25.5 25.5 25.5 25.5 25.5 25.5 ... 4
4 25 25 25 25 25 25 ... 5
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Please note that this table above is not related to the previous dataset but just used to show the format.
Thanks
Use DataFrame.groupby with DataFrameGroupBy.resample:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.ffill()
.reset_index())
return dfdf
print (daily_average_temperature(dfdf))
Group_Id Timestamp Data
0 50 2018-01-03 00:00:15 125.5
1 52 2018-01-02 00:00:09 127.0
2 52 2018-01-02 00:00:10 127.0
3 52 2018-01-02 00:00:11 127.0
4 52 2018-01-02 00:00:12 127.0
5 52 2018-01-02 00:00:13 126.5
6 101 2018-01-01 00:00:05 125.0
7 120 2018-01-01 00:00:07 125.5
8 120 2018-01-01 00:00:08 125.0
9 300 2018-01-04 00:00:14 127.0
10 300 2018-01-04 00:00:15 126.5
11 350 2018-01-05 00:00:19 125.5
EDIT: This solution use minimal and maximal datetimes for DataFrame.reindex by date_range in DattimeIndex in columns after reshape by Series.unstack, also is added back filling if necessary:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 345615
300 345615
120 345615
101 345615
52 345615
50 345615
Name: Group_Id, dtype: int64
Another solution is similar, only date_range is specified by values from strings (not dynamic by min and max):
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'),
axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 604801
300 604801
120 604801
101 604801
52 604801
50 604801
Name: Group_Id, dtype: int64
i have below times series data frames
i wanna delete rows on condtion (check everyday) : check aaa>100 then delete all day rows (in belows, delete all 2015-12-01 rows because aaa column last 3 have 1000 value)
....
date time aaa
2015-12-01,00:00:00,0
2015-12-01,00:15:00,0
2015-12-01,00:30:00,0
2015-12-01,00:45:00,0
2015-12-01,01:00:00,0
2015-12-01,01:15:00,0
2015-12-01,01:30:00,0
2015-12-01,01:45:00,0
2015-12-01,02:00:00,0
2015-12-01,02:15:00,0
2015-12-01,02:30:00,0
2015-12-01,02:45:00,0
2015-12-01,03:00:00,0
2015-12-01,03:15:00,0
2015-12-01,03:30:00,0
2015-12-01,03:45:00,0
2015-12-01,04:00:00,0
2015-12-01,04:15:00,0
2015-12-01,04:30:00,0
2015-12-01,04:45:00,0
2015-12-01,05:00:00,0
2015-12-01,05:15:00,0
2015-12-01,05:30:00,0
2015-12-01,05:45:00,0
2015-12-01,06:00:00,0
2015-12-01,06:15:00,0
2015-12-01,06:30:00,1000
2015-12-01,06:45:00,1000
2015-12-01,07:00:00,1000
....
how can i do it ?
I think you need if MultiIndex first compare values of aaa by condition and then filter all values in first level by boolean indexing, last filter again by isin with inverted condition by ~:
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
2015-12-02 05:00:00 0
05:15:00 200
05:30:00 0
05:45:00 0
2015-12-03 06:00:00 0
06:15:00 0
06:30:00 1000
06:45:00 1000
07:00:00 1000
lvl0 = df.index.get_level_values(0)
idx = lvl0[df['aaa'].gt(100)].unique()
print (idx)
Index(['2015-12-02', '2015-12-03'], dtype='object', name='date')
df = df[~lvl0.isin(idx)]
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
And if first column is not index only compare column date:
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
4 2015-12-02 05:00:00 0
5 2015-12-02 05:15:00 200
6 2015-12-02 05:30:00 0
7 2015-12-02 05:45:00 0
8 2015-12-03 06:00:00 0
9 2015-12-03 06:15:00 0
10 2015-12-03 06:30:00 1000
11 2015-12-03 06:45:00 1000
12 2015-12-03 07:00:00 1000
idx = df.loc[df['aaa'].gt(100), 'date'].unique()
print (idx)
['2015-12-02' '2015-12-03']
df = df[~df['date'].isin(idx)]
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)
when I use this syntax it creates a series rather than adding a column to my new dataframe sum.
My code:
sum = data['variance'] = data.budget + data.actual
My dataframe data currently has everything except the budget - actual column. How do I create a variance column?
cluster date budget actual budget - actual
0 a 2014-01-01 00:00:00 11000 10000 1000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
I think you've misunderstood some python syntax, the following does two assignments:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
sum = df['budget'] + df['actual'] # a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)
You could also use the .add() function:
df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])
Same thing can be done using lambda function.
Here I am reading the data from a xlsx file.
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
If "budget" has any NaN values but you don't want it to sum to NaN then try:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
This is the most elegant solution which follows DRY and work absolutely great.
dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)
Thank you.
eval lets you sum and create columns right away:
In [12]: data.eval('variance = budget + actual', inplace=True)
In [13]: data
Out[13]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
Since inplace=True you don't need to assign it back to data.