Python pandas rolling.apply two time-series input into function - python

I have a DatetimeIndex indexed dataframe with two columns. The index is uneven.
A B
Date
2016-01-04 1 20
2016-01-12 2 10
2016-01-21 3 10
2016-01-25 2 20
2016-02-08 2 30
2016-02-15 1 20
2016-02-21 3 20
2016-02-25 2 20
I want to compute the dot product of time-series A and B over a rolling window of length 20 days.
It should return:
dot
Date
2016-01-04 Nan
2016-01-12 Nan
2016-01-21 Nan
2016-01-25 110
2016-02-08 130
2016-02-15 80
2016-02-21 140
2016-02-25 180
here is how this is obtained:
110 = 2*10+3*10+2*20 (product obtained in period from 2016-01-06 to 2016-01-25 included)
130 = 3*10+2*20+2*30 (product obtained in period from 2016-01-20 to 2016-02-08)
80 = 1*20+2*30 (product obtained in period from 2016-01-27 to 2016-02-15)
140 = 3*20+1*20+2*30 (product obtained in period from 2016-02-02 to 2016-02-21)
180 = 2*20+3*20+1*20+2*30 (product obtained in period from 2016-02-06 to 2016-02-25)
The dot product is an example that should be generalizable to any function taking two series and returning a value.

I think this should work. df.product() across rows, the df.rolling(period).sum()
Dates = pd.to_datetime(['2016-01-04',
'2016-01-12',
'2016-01-21',
'2016-01-25',
'2016-02-08',
'2016-02-15',
'2016-02-21',
'2016-02-25',
'2016-02-26'
]
)
data = {'A': [i*10 for i in range(1,10)], 'B': [i for i in range(1,10)]}
df1 = pd.DataFrame(data = data, index = Dates)
df2 = df1.product(axis =1).rolling(3).sum()
df2.columns = 'Dot'
df2
output
2016-01-04 NaN
2016-01-12 NaN
2016-01-21 140.0
2016-01-25 290.0
2016-02-08 500.0
2016-02-15 770.0
2016-02-21 1100.0
2016-02-25 1490.0
2016-02-26 1940.0
dtype: float64
And if your data is daily and you want to get 20 days data first, group them by 20 days and sum them up, or use last, according to what you want.
Dates1 = pd.date_range(start='2016-03-31', end = '2016-07-31')
data1 = {'A': [np.pi * i * np.random.rand()
for i in range(1, len(Dates1) + 1)],
'B': [i * np.random.randn() * 10
for i in range(1, len(Dates1) + 1)]}
df3 = pd.DataFrame(data = data1, index = Dates1)
df3.groupby(pd.TimeGrouper(freq = '20d')).sum()
A B
2016-03-31 274.224084 660.144639
2016-04-20 1000.456615 -2403.034012
2016-05-10 1872.422495 -1737.571080
2016-05-30 2121.497529 1157.710510
2016-06-19 3084.569208 -1854.258668
2016-07-09 3324.775922 -9743.113805
2016-07-29 505.162678 -1179.730820
and then use dot product like I did above.

Related

For each row: search for the previous row at least 20 seconds before

Problem:
For each row of a DataFrame, I want to find the nearest prior row where the 'Datetime' value is at least 20 seconds before the current 'Datetime' value.
For example: if the previous 'Datetime' (at index i-1) is at least 20s earlier than the current one - it will be chosen. Otherwise (e.g. only 5 seconds earlier), move to i-2 and see if it is at least 20s earlier. Repeat until the condition is met, or no such row has been found.
The expected result is a concatenation of the original df and the rows that were found. When no matching row at or more than 20 s before the current Datetime has been found, then the new columns are null (NaT or NaN, depending on the type).
Example data
df = pd.DataFrame({
'Datetime': pd.to_datetime([
f'2016-05-15 08:{M_S}+06:00'
for M_S in ['36:21', '36:41', '36:50', '37:10', '37:19', '37:39']]),
'A': [21, 43, 54, 2, 54, 67],
'B': [3, 3, 45, 23, 8, 6],
})
Example result:
>>> res
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
The last three columns are the newly created columns, and the first three columns are the original dataset.
Two vectorized solutions
Note: we assume that the rows are sorted by Datetime. If that is not the case, then sort them first (O[n log n]).
For 10,000 rows:
3.3 ms, using Numpy's searchsorted.
401 ms, using a rolling window of 20s, left-open.
1. Using np.searchsorted
We use np.searchsorted to find in one call the indices of all matching rows:
import numpy as np
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
E.g., for the OP's data, these indices are:
>>> z
array([-1, 0, 0, 2, 2, 4])
I.e.: z[0] == -1: no matching row; z[1] == 0: row 0 (08:36:21) is the nearest that is 20s or more before row 1 (08:36:41). z[2] == 0: row 0 is the nearest match for row 2 (row 1 is too close). Etc.
Why subtracting 1? We use np.searchsorted to select the first row in the exclusion zone (i.e., too close); then we subtract 1 to get the correct row (the first one at least 20s before).
Why - 1ns? This is to make the search window left-open. A row at exactly 20s before the current one will not be in the exclusion zone, and thus will end up being the one selected as the match.
We then use z to select the matching rows (or nulls) and concatenate into the result. Putting it all in a function:
def select_np(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(pd.Series(z >= 0))
], axis=1)
On the OP's example
>>> select_np(df[['Datetime', 'A', 'B']])
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
2. Using a rolling window (pure Pandas)
This was our original solution and uses pandas rolling with a Timedelta(20s) window size, left-open. It is still more optimized than a naive (O[n^2]) search, but is roughly 100x slower than select_np(), as pandas uses explicit loops in Python to find the window bounds for .rolling(): see get_window_bounds(). There is also some overhead due to having to make sub-frames, applying a function or aggregate, etc.
def select_pd(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
z = (
df.assign(rownum=range(len(df)))
.rolling(pd.Timedelta(min_dt), on='Datetime', closed='right')['rownum']
.apply(min).astype(int) - 1
)
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(z >= 0)
], axis=1)
3. Testing
First, we write an arbitrary-size test data generator:
def gen(n):
return pd.DataFrame({
'Datetime': pd.Timestamp('2020') +\
np.random.randint(0, 30, n).cumsum() * pd.Timedelta('1s'),
'A': np.random.randint(0, 100, n),
'B': np.random.randint(0, 100, n),
})
Example
np.random.seed(0)
tdf = gen(10)
>>> select_np(tdf)
Datetime A B Datetime_nearest A_nearest B_nearest
0 2020-01-01 00:00:12 21 87 NaT NaN NaN
1 2020-01-01 00:00:27 36 46 NaT NaN NaN
2 2020-01-01 00:00:48 87 88 2020-01-01 00:00:27 36.0 46.0
3 2020-01-01 00:00:48 70 81 2020-01-01 00:00:27 36.0 46.0
4 2020-01-01 00:00:51 88 37 2020-01-01 00:00:27 36.0 46.0
5 2020-01-01 00:01:18 88 25 2020-01-01 00:00:51 88.0 37.0
6 2020-01-01 00:01:21 12 77 2020-01-01 00:00:51 88.0 37.0
7 2020-01-01 00:01:28 58 72 2020-01-01 00:00:51 88.0 37.0
8 2020-01-01 00:01:37 65 9 2020-01-01 00:00:51 88.0 37.0
9 2020-01-01 00:01:56 39 20 2020-01-01 00:01:28 58.0 72.0
Speed
tdf = gen(10_000)
% timeit select_np(tdf)
3.31 ms ± 6.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit select_pd(df)
401 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> select_np(df).equals(select_pd(df))
True
Scale sweep
We can now compare speed over a range of sizes, using the excellent perfplot package:
import perfplot
perfplot.plot(
setup=gen,
kernels=[select_np, select_pd],
n_range=[2**k for k in range(4, 16)],
equality_check=lambda a, b: a.equals(b),
)
Focusing on select_np:
perfplot.plot(
setup=gen,
kernels=[select_np],
n_range=[2**k for k in range(4, 24)],
)
The following solution is memory-efficient but it is not the fastest one (because it uses iteration over rows).
The fully vectorized version (that I could think of on my own) would be faster but it would use O(n^2) memory.
Example dataframe:
timestamps = [pd.Timestamp('2016-01-01 00:00:00'),
pd.Timestamp('2016-01-01 00:00:19'),
pd.Timestamp('2016-01-01 00:00:20'),
pd.Timestamp('2016-01-01 00:00:21'),
pd.Timestamp('2016-01-01 00:00:50')]
df = pd.DataFrame({'Datetime': timestamps,
'A': np.arange(10, 15),
'B': np.arange(20, 25)})
Datetime
A
B
0
2016-01-01 00:00:00
10
20
1
2016-01-01 00:00:19
11
21
2
2016-01-01 00:00:20
12
22
3
2016-01-01 00:00:21
13
23
4
2016-01-01 00:00:50
14
24
Solution:
times = df['Datetime'].to_numpy() # it's convenient to have it as an `ndarray`
shifted_times = times - pd.Timedelta(20, unit='s')
useful is a list of "useful" indices of df - i.e. where the appended values will NOT be nan:
useful = np.nonzero(shifted_times >= times[0])[0]
# useful == [2, 3, 4]
Truncate shifted_times from the beginning - to iterate through useful elements only:
if len(useful) == 0:
# all new columns will be `nan`s
first_i = 0 # this value will never actually be used
useful_shifted_times = np.array([], dtype=shifted_times.dtype)
else:
first_i = useful[0] # first_i == 2
useful_shifted_times = shifted_times[first_i : ]
 
Find the corresponding index positions of df for each "useful" value.
(these index positions are essentially the indices of times
that are selected for each element of useful_shifted_times):
selected_indices = []
# Iterate through `useful_shifted_times` one by one:
# (`i` starts at `first_i`)
for i, shifted_time in enumerate(useful_shifted_times, first_i):
selected_index = np.nonzero(times[: i] <= shifted_time)[0][-1]
selected_indices.append(selected_index)
# selected_indices == [0, 0, 3]
 
Selected rows:
df_nearest = df.iloc[selected_indices].add_suffix('_nearest')
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
0
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
 
Replace indices of df_nearest to match those of the corresponding rows of df.
(basically, that is the last len(selected_indices) indices):
df_nearest.index = df.index[len(df) - len(selected_indices) : ]
Datetime_nearest
A_nearest
B_nearest
2
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:21
13
23
 
Append the selected rows to the original dataframe to get the final result:
new_df = df.join(df_nearest)
Datetime
A
B
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
NaT
nan
nan
1
2016-01-01 00:00:19
11
21
NaT
nan
nan
2
2016-01-01 00:00:20
12
22
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:50
14
24
2016-01-01 00:00:21
13
23
 
Note: NaT stands for 'Not a Time'. It is the equivalent of nan for time values.
Note: it also works as expected even if all the last 'Datetime' - 20 sec is before the very first 'Datetime' --> all new columns will be nans.

Add Missing Date Index in a multiindex dataframe

I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1,
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
I want to fill in missing dates, with just one location_id and fill it with 0:
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-02 100 0
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.
you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]])),
fill_value=0)
print(df)
0
2020-01-01 100 91
200 49
300 19
2020-01-02 100 0
2020-01-03 100 41
200 25
300 51
2020-01-04 100 44
200 40
300 54
instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
list(product(pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]))),
fill_value=0)
Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
0
Date location_id
2020-01-01 100 19.0
200 75.0
300 39.0
2020-01-02 100 0.0
2020-01-03 100 11.0
200 91.0
300 80.0
2020-01-04 100 36.0
200 56.0
300 54.0
unstack/stack and asfreq/reindex would work:
new_df = df.unstack(fill_value=0)
new_df.index = pd.to_datetime(new_df.index)
new_df.asfreq('D').fillna(0).stack('location_id')
Output:
0
Date location_id
2020-01-01 100 78.0
200 25.0
300 89.0
2020-01-02 100 0.0
200 0.0
300 0.0
2020-01-03 100 79.0
200 23.0
300 11.0
2020-01-04 100 30.0
200 79.0
300 72.0

How to forward resample for each different value in a column

First, I want to forward fill my data for EACH UNIQUE VALUE in Group_Id by 1S, so basically grouping by Group_Id then resample using ffill.
Here is the data:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I previously did this:
def daily_average_temperature(dfdf):
INDEX = dfdf[['Group_Id','Timestamp','Data']]
INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
INDEX = INDEX.set_index('Timestamp')
INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')
return T_index1
This is wrong as it didn't group the data with different value of Group_Id first but rather ignoring the column.
Second, I would like to spread the Data values so each row is a group_id with index as columns replacing Timestamp, looks something like this:
x0 x1 x2 x3 x4 x5 ... Group_Id
0 40 31.05 25.5 25.5 25.5 25 ... 1
1 35 35.75 36.5 36.5 36.5 36.5 ... 2
2 25.5 25.5 25.5 25.5 25.5 25.5 ... 3
3 25.5 25.5 25.5 25.5 25.5 25.5 ... 4
4 25 25 25 25 25 25 ... 5
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Please note that this table above is not related to the previous dataset but just used to show the format.
Thanks
Use DataFrame.groupby with DataFrameGroupBy.resample:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.ffill()
.reset_index())
return dfdf
print (daily_average_temperature(dfdf))
Group_Id Timestamp Data
0 50 2018-01-03 00:00:15 125.5
1 52 2018-01-02 00:00:09 127.0
2 52 2018-01-02 00:00:10 127.0
3 52 2018-01-02 00:00:11 127.0
4 52 2018-01-02 00:00:12 127.0
5 52 2018-01-02 00:00:13 126.5
6 101 2018-01-01 00:00:05 125.0
7 120 2018-01-01 00:00:07 125.5
8 120 2018-01-01 00:00:08 125.0
9 300 2018-01-04 00:00:14 127.0
10 300 2018-01-04 00:00:15 126.5
11 350 2018-01-05 00:00:19 125.5
EDIT: This solution use minimal and maximal datetimes for DataFrame.reindex by date_range in DattimeIndex in columns after reshape by Series.unstack, also is added back filling if necessary:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 345615
300 345615
120 345615
101 345615
52 345615
50 345615
Name: Group_Id, dtype: int64
Another solution is similar, only date_range is specified by values from strings (not dynamic by min and max):
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'),
axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 604801
300 604801
120 604801
101 604801
52 604801
50 604801
Name: Group_Id, dtype: int64

Pandas dataframe Groupby and retrieve date range

Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43

pandas.Series() Creation using DataFrame Columns returns NaN Data entries

Im attempting to convert a dataframe into a series using code which, simplified, looks like this:
dates = ['2016-1-{}'.format(i)for i in range(1,21)]
values = [i for i in range(20)]
data = {'Date': dates, 'Value': values}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
ts = pd.Series(df['Value'], index=df['Date'])
print(ts)
However, print output looks like this:
Date
2016-01-01 NaN
2016-01-02 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 NaN
2016-01-09 NaN
2016-01-10 NaN
2016-01-11 NaN
2016-01-12 NaN
2016-01-13 NaN
2016-01-14 NaN
2016-01-15 NaN
2016-01-16 NaN
2016-01-17 NaN
2016-01-18 NaN
2016-01-19 NaN
2016-01-20 NaN
Name: Value, dtype: float64
Where does NaN come from? Is a view on a DataFrame object not a valid input for the Series class ?
I have found the to_series function for pd.Index objects, is there something similar for DataFrames ?
I think you can use values, it convert column Value to array:
ts = pd.Series(df['Value'].values, index=df['Date'])
import pandas as pd
import numpy as np
import io
dates = ['2016-1-{}'.format(i)for i in range(1,21)]
values = [i for i in range(20)]
data = {'Date': dates, 'Value': values}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print df['Value'].values
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
ts = pd.Series(df['Value'].values, index=df['Date'])
print(ts)
Date
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
2016-01-06 5
2016-01-07 6
2016-01-08 7
2016-01-09 8
2016-01-10 9
2016-01-11 10
2016-01-12 11
2016-01-13 12
2016-01-14 13
2016-01-15 14
2016-01-16 15
2016-01-17 16
2016-01-18 17
2016-01-19 18
2016-01-20 19
dtype: int64
Or you can use:
ts1 = pd.Series(data=values, index=pd.to_datetime(dates))
print(ts1)
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
2016-01-06 5
2016-01-07 6
2016-01-08 7
2016-01-09 8
2016-01-10 9
2016-01-11 10
2016-01-12 11
2016-01-13 12
2016-01-14 13
2016-01-15 14
2016-01-16 15
2016-01-17 16
2016-01-18 17
2016-01-19 18
2016-01-20 19
dtype: int64
Thank you #ajcr for better explanation why you get NaN:
When you give a Series or DataFrame column to pd.Series, it will reindex it using the index you specify. Since your DataFrame column has an integer index (not a date index) you get lots of missing values.
You can just do:
s = df.set_index('Date')
Which is now a one column dataframe.
If you really want it as a Series:
s = df.set_index('Date').Value
btw, NaN is numpy's Not-a-Number.
Using your method, you could use:
ts = pd.Series(df['Value'].values, name='Value', index=df['Date'])
The reason you are getting the NaNs is that you are not providing the data in the correct format. You are passing a Series to a Series.
If you are only looking for a to create series with those values you could have also done:
pd.Series( [i for i in range(20)], pd.date_range('2016-01-02', periods=20, freq='D'))

Categories