So, I have 2 data frames where the first one has the following structure:
'ds', '1_sensor_id', '1_val_1', '1_val_2'
0 2019-09-13 12:40:00 33469 30 43
1 2019-09-13 12:45:00 33469 43 43
The second one has the following structure:
'ds', '2_sensor_id', '2_val_1', '2_val_2'
0 2019-09-13 12:42:00 20006 6 50
1 2019-09-13 12:47:00 20006 5 80
So what I want to do is merge the two pandas frame through interpolation. So ultimately, the merged frame should have values defined at the time stamps (ds) defined in frame 1 and the 2_val_1 and 2_val_2 columns would be interpolated and the merged frame would have a row for each value in ds column in frame 1. What would be the best way to do this in pandas? I tried the merge_asof function but this does nearest neighbourhood interpolation and I did not get all the time stamps back.
You can append one frame to another and use interpolate(), example:
import datetime
import pandas as pd
df1 = pd.DataFrame(columns=['ds', '1_sensor_id', '1_val_1', '1_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 40, 00), 33469, 30, 43],
[datetime.datetime(2019, 9, 13, 12, 45, 00), 33469, 33, 43]])
df2 = pd.DataFrame(columns=['ds', '2_sensor_id', '2_val_1', '2_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 42, 00), 20006, 6, 50],
[datetime.datetime(2019, 9, 13, 12, 47, 00), 20006, 5, 80]])
df = df1.append(df2, sort=False)
df.set_index('ds', inplace=True)
df.interpolate(method = 'time', limit_direction='backward', inplace=True)
print(df)
1_sensor_id 1_val_1 ... 2_val_1 2_val_2
ds ...
2019-09-13 12:40:00 33469.0 30.0 ... 6.0 50.0
2019-09-13 12:45:00 33469.0 33.0 ... 5.4 68.0
2019-09-13 12:42:00 NaN NaN ... 6.0 50.0
2019-09-13 12:47:00 NaN NaN ... 5.0 80.0
Related
Hey a python newbie here.
Suppose I have the first two columns of this data dataframe:
df = pd.DataFrame({'group': ["Sun", "Moon", "Sun", "Moon", "Mars", "Mars"],
'score': [2, 13, 24, 15, 11, 44],
'datetime': ["2017-08-30 07:00:00", "2017-08-30 08:00:00", "2017-08-31 07:00:00", "2017-08-31 08:00:00", "2017-08-29 21:00:00", "2017-08-28 21:00:00"],
'difference': [2, 13, 22, 2, -33, 44]})
I want to create a new column named difference (I have put it there as an illustration), such that
it is equal:
score value in that row - score value of the day before in the same hour, for that group
e.g. difference in row 3 is equal to:
score in that row - score on the day before (30th) at 08:00:00 for that group (i.e. Moon), i.e. 15 - 13 = 2. If the day before and same time do not exist, then the value of the score of that row is taken (e.g. in row 0, for time 2017-08-30 07:00:00 there is no 2017-08-29 07:00:00, hence only the 2 is taken).
I write the following:
df['datetime'] = pd.to_datetime(df['datetime'])
before = df['datetime'] - pd.DateOffset(days=1)
df['difference'] = df.groupby(["group", "datetime"])['score'].sub(
before.map(df.set_index('datetime')['score']), fill_value=0)
but I get the error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
What am I missing? IS there any more elegant solution?
MultiIndex.map
We can set the group column along with the before column as the index of the dataframe, then map the multiindex with score values belonging to the same group then subtract the mapped score values from the score column to calculate the difference.
s = df.set_index(['group', before]).index.map(df.set_index(['group', 'datetime'])['score'])
df['difference'] = df['score'].sub(list(s), fill_value=0)
>>> df
group score datetime difference
0 Sun 2 2017-08-30 07:00:00 2.0
1 Moon 13 2017-08-30 08:00:00 13.0
2 Sun 24 2017-08-31 07:00:00 22.0
3 Moon 15 2017-08-31 08:00:00 2.0
4 Mars 11 2017-08-29 21:00:00 -33.0
5 Mars 44 2017-08-28 21:00:00 44.0
I've got the following pandas.DataFrame and would like to calculate a new column containing the timedelta between consecutive timestamps in the multi-index level Timestamp:
import pandas as pd
import numpy as np
data = {'Timestamp': [12, 12, 12, 22, 22, 22, 44, 44, 66, 102],
'Customer': ['bmw', 'vw', 'vw', 'bmw', 'vw', 'vw', 'vw', 'vw', 'bmw', 'bmw'],
'Series': ['series1', 'series1', 'series2', 'series1', 'series1', 'series2', 'series1', 'series2', 'series2', 'series1'],
'time_delta': [np.nan, np.nan, np.nan, 10, 10, 10, 22, 22, 22, 36]
}
df = pd.DataFrame(data).set_index(['Timestamp', 'Customer', 'Series'])
The column time_delta is the desired output I would like to achieve. I somewhat struggle since I can not use the pandas.Series.diff() function as the periods are not consistent. I want to do the timestamp delta calculation on the Timestamp level of the dataframe, but pass the result to all rows of this level. So for the first Timestamp level value 12 there is no preceeding timestamp value, thus all rows for this timestamp are filled with np.nan. For the next timestamp 22, I can take the delta to 12 (which is 10) and fill it for all rows of timestamp 22.
Let's try extracting the level values and calculate the difference from there:
df['time_delta'] = df.index.get_level_values('Timestamp')
s = df['time_delta'].diff()
df['time_delta'] = s.where(s>0).ffill()
Output:
time_delta
Timestamp Customer Series
12 bmw series1 NaN
vw series1 NaN
series2 NaN
22 bmw series1 10.0
vw series1 10.0
series2 10.0
44 vw series1 22.0
series2 22.0
66 bmw series2 22.0
102 bmw series1 36.0
This question already has an answer here:
Pandas asfreq with weekly frequency
(1 answer)
Closed 2 years ago.
I create the following DataFrame:
import pandas as pd
d = {'T': [1, 2, 4, 15], 'H': [3, 4, 6, 8]}
df = pd.DataFrame(data=d, index=['10.09.2018 13:15:00','10.09.2018 13:30:00', '10.09.2018 14:00:00', '10.09.2018 22:00:00'])
df.index = pd.to_datetime(df.index)
And get the following result.
Out[30]:
T H
2018-10-09 13:15:00 1 3
2018-10-09 13:30:00 2 4
2018-10-09 14:00:00 4 6
2018-10-09 22:00:00 15 8
As you can see there is one value missing at 13:45:00 and a lot values between 14:00 and 22:00.
Is there a way to automatically find the missing values, insert a row with the missing time stamp and nan values for the missing time ?
I want to achieve this:
Out[30]:
T H
2018-10-09 13:15:00 1 3
2018-10-09 13:30:00 2 4
2018-10-09 13:45:00 nan nan
2018-10-09 14:00:00 4 6
2018-10-09 14:15:00 nan nan
...
2018-10-09 21:45:00 nan nan
2018-10-09 22:00:00 15 8
You can create a second dataframe with the correct timestep as index and join it with the original data. The following code worked in my case
# your code
import pandas as pd
d = {'T': [1, 2, 4, 15], 'H': [3, 4, 6, 8]}
df = pd.DataFrame(data=d, index=['10.09.2018 13:15:00','10.09.2018 13:30:00', '10.09.2018 14:00:00', '10.09.2018 22:00:00'])
df.index = pd.to_datetime(df.index)
# generate second dataframe with needed index
timerange = pd.date_range('10.09.2018 13:15:00', periods=40, freq='15min')
df2 = pd.DataFrame(index=timerange)
# join the original dataframe with the new one
newdf = df.join(df2, how='outer')
I am using python 3.6.4 and pandas 0.23.0. I have referenced pandas 0.23.0 documentation for constructor and append. It does not mention anything about non-existent values. I didn't find any similar example.
Consider following code:
import pandas as pd
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
index_yrs = [2016, 2017, 2018]
r2016 = [26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19]
r2017 = [20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15]
r2018 = [16, 18, 18, 18, 17]
df = pd.DataFrame([r2016], columns = months, index = [index_yrs[0]])
df = df.append(pd.DataFrame([r2017], columns = months, index = [index_yrs[1]]))
Now how to add r2018 which has data only till Month of May?
I agree with RafaelC that padding your list for 2018 data with NaNs for missing values is the best way to do this. You can use np.nan from Numpy (which you will already have installed since you have Pandas) to generate NaNs.
import pandas as pd
import numpy as np
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
index_yrs = [2016, 2017, 2018]
As a small change to your code I've put data for all three years into a years list which we can pass as the data parameter for pd.DataFrame. This eliminates the need to append each row to the previous ones.
r2016 = [26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19]
r2017 = [20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15]
r2018 = [16, 18, 18, 18, 17]
years = [r2016] + [r2017] + [r2018]
This is what years looks like: [[26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19],
[20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15],
[16, 18, 18, 18, 17]].
As for padding your year 2018 with NaNs something like this might do the trick. We are just ensuring that if a year only has values for the first n months that the remaining months will be filled out with NaNs.
for year in years:
if len(year) < 12:
year.extend([np.nan] * (12 - len(year)))
Finally we can create your dataframe using the one liner below instead of appending row by row.
df = pd.DataFrame(years, columns=months, index=index_yrs).astype(float)
Output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 26.0 27.0 25.0 22.0 20.0 23.0 22.0 20.0 20.0 18.0 18.0 19.0
2017 20.0 21.0 18.0 16.0 15.0 15.0 15.0 15.0 13.0 13.0 14.0 15.0
2018 16.0 18.0 18.0 18.0 17.0 NaN NaN NaN NaN NaN NaN NaN
You may notice that I converted the dtype of the values in the dataframe to float using .astype(float). I did this to make all of your columns as the same dtype. If we don't call .astype(float) then Jan-May will be dtype int and Jun-Dec will be dtype float64.
You can add a row using pd.DataFrame.loc via a series. So you only need to convert your array into a pd.Series object before adding a row:
df.loc[index_yrs[2]] = pd.Series(r2018, index=df.columns[:len(r2018)])
print(df)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 26.0 27.0 25.0 22.0 20.0 23.0 22.0 20.0 20.0 18.0 18.0 19.0
2017 20.0 21.0 18.0 16.0 15.0 15.0 15.0 15.0 13.0 13.0 14.0 15.0
2018 16.0 18.0 18.0 18.0 17.0 NaN NaN NaN NaN NaN NaN NaN
However, I strongly recommend you form a list of lists (with padding) before a single append. This is because list.append, or construction via a list comprehension, is cheap relative to repeated pd.DataFrame.append or pd.DataFrame.loc.
The above solution is recommended if you absolutely must add one row at a time.
I am very new to pandas and trying to get the row index for the any value higher than the lprice. Can someone give me a quick idea on what I am doing wrong?
Dataframe
StrikePrice
0 40.00
1 50.00
2 60.00
3 70.00
4 80.00
5 90.00
6 100.00
7 110.00
8 120.00
9 130.00
10 140.00
11 150.00
12 160.00
13 170.00
14 180.00
15 190.00
16 200.00
17 210.00
18 220.00
19 230.00
20 240.00
Now I am trying to figure out how to get the row index for any value which is higher than the lprice
lprice = 99
for strike in df['StrikePrice']:
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(df.index['StrikePrice' == ce_1])
The above gives 0 as the index
I am not sure what I am doing wrong here.
Using the index attribute after boolean slicing.
lprice = 99
df[df.StrikePrice >= lprice].index
Int64Index([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
If you insist on iterating and finding when you've found it, you can modify your code:
lprice = 99
for idx, strike in df['StrikePrice'].iteritems():
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(idx)
I think best is filter index by boolean indexing:
a = df.index[df['StrikePrice'] >= 99]
#alternative
#a = df.index[df['StrikePrice'].ge(99)]
Your code should be changed similar:
lprice = 99
for strike in df['StrikePrice']:
if strike >= lprice:
print('The high strike is:' + str(strike))
print(df.index[df['StrikePrice'] == strike])
numpy.where(condition[, x, y]) does exactly this if we specify only condition.
np.where() returns the tuple
condition.nonzero(), the indices where condition is True, if only condition is given.
In [36]: np.where(df.StrikePrice >= lprice)[0]
Out[36]: array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=int64)
PS thanks #jezrael for the hint -- np.where() returns numerical index positions instead of DF index values:
In [41]: df = pd.DataFrame({'val':np.random.rand(10)}, index=pd.date_range('2018-01-01', freq='9999S', periods=10))
In [42]: df
Out[42]:
val
2018-01-01 00:00:00 0.459097
2018-01-01 02:46:39 0.148380
2018-01-01 05:33:18 0.945564
2018-01-01 08:19:57 0.105181
2018-01-01 11:06:36 0.570019
2018-01-01 13:53:15 0.203373
2018-01-01 16:39:54 0.021001
2018-01-01 19:26:33 0.717460
2018-01-01 22:13:12 0.370547
2018-01-02 00:59:51 0.462997
In [43]: np.where(df['val']>0.5)[0]
Out[43]: array([2, 4, 7], dtype=int64)
workaround:
In [44]: df.index[np.where(df['val']>0.5)[0]]
Out[44]: DatetimeIndex(['2018-01-01 05:33:18', '2018-01-01 11:06:36', '2018-01-01 19:26:33'], dtype='datetime64[ns]', freq=None)