The context
I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group.
As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks,
Bests
group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.
What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.
Here is an implementation:
import numpy as np
import numba as nb
# To avoid the compilation cost at runtime, use:
# #nb.njit('int64[:](float64[:],int64[:])')
#nb.njit
def segmentedCumSum(ids, values):
size = len(ids)
res = np.empty(size, dtype=values.dtype)
if size == 0:
return res
zero = values.dtype.type(0)
curValue = zero
for i in range(size):
if not np.isnan(ids[i]):
if i > 0 and ids[i-1] != ids[i]:
curValue = zero
curValue += values[i]
res[i] = curValue
else:
res[i] = -1
curValue = zero
return res
df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())
Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.
I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN
I am learning to use pandas resample() function, however, the following code does not return anything as expected. I re-sampled the time series by day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print df.head()
weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('D').mean()
weekly_summary['distance'] = df.distance.resample('D').sum()
print weekly_summary.head()
Output
speed distance cumulative_distance
2015-01-01 00:00:00 40 10.00 10.00
2015-01-01 00:15:00 6 1.50 11.50
2015-01-01 00:30:00 31 7.75 19.25
2015-01-01 00:45:00 41 10.25 29.50
2015-01-01 01:00:00 59 14.75 44.25
[5 rows x 3 columns]
Empty DataFrame
Columns: [speed, distance]
Index: []
[0 rows x 2 columns]
Depending on your pandas version, how you will do this will vary.
In pandas 0.19.0, your code works as expected:
In [7]: pd.__version__
Out[7]: '0.19.0'
In [8]: df.speed.resample('D').mean().head()
Out[8]:
2015-01-01 28.562500
2015-01-02 30.302083
2015-01-03 30.864583
2015-01-04 29.197917
2015-01-05 30.708333
Freq: D, Name: speed, dtype: float64
In older versions, your solution might not work but at least in 0.14.1, you can tweak it to do so:
>>> pd.__version__
'0.14.1'
>>> df.speed.resample('D').mean()
29.41087328767123
>>> df.speed.resample('D', how='mean').head()
2015-01-01 29.354167
2015-01-02 26.791667
2015-01-03 31.854167
2015-01-04 26.593750
2015-01-05 30.312500
Freq: D, Name: speed, dtype: float64
This looks like an issue with old version of pandas, in newer versions it will enlarge the df when assigning a new column where the index is not the same shape. What should work is to not make an empty df and instead pass the initial call to resample as the data arg for the df ctor:
In [8]:
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print (df.head())
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
weekly_summary['distance'] = df.distance.resample('D').sum()
print( weekly_summary.head())
speed distance cumulative_distance
2015-01-01 00:00:00 28 7.0 7.0
2015-01-01 00:15:00 8 2.0 9.0
2015-01-01 00:30:00 10 2.5 11.5
2015-01-01 00:45:00 56 14.0 25.5
2015-01-01 01:00:00 6 1.5 27.0
speed distance
2015-01-01 27.895833 669.50
2015-01-02 29.041667 697.00
2015-01-03 27.104167 650.50
2015-01-04 28.427083 682.25
2015-01-05 27.854167 668.50
Here I pass the call to resample as the data arg for the df ctor, this will take the index and column name and create a single column df:
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
then subsequent assignments should work as expected
I'm trying to merge a (Pandas 14.1) dataframe and a series. The series should form a new column, with some NAs (since the index values of the series are a subset of the index values of the dataframe).
This works for a toy example, but not with my data (detailed below).
Example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2011', periods=6, freq='D'))
df1
A B C D
2011-01-01 -0.487926 0.439190 0.194810 0.333896
2011-01-02 1.708024 0.237587 -0.958100 1.418285
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395
2011-01-04 -0.554705 1.342504 0.245934 0.955521
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322
2011-01-06 0.132924 0.501027 -1.139487 1.107873
s1 = pd.Series(np.random.randn(3), name='foo', index=pd.date_range('1/1/2011', periods=3, freq='2D'))
s1
2011-01-01 -1.660578
2011-01-03 -0.209688
2011-01-05 0.546146
Freq: 2D, Name: foo, dtype: float64
pd.concat([df1, s1],axis=1)
A B C D foo
2011-01-01 -0.487926 0.439190 0.194810 0.333896 -1.660578
2011-01-02 1.708024 0.237587 -0.958100 1.418285 NaN
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395 -0.209688
2011-01-04 -0.554705 1.342504 0.245934 0.955521 NaN
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322 0.546146
2011-01-06 0.132924 0.501027 -1.139487 1.107873 NaN
The situation with the data (see below) seems basically identical - concatting a series with a DatetimeIndex whose values are a subset of the dataframe's. But it gives the ValueError in the title (blah1 = (5, 286) blah2 = (5, 276) ). Why doesn't it work?:
In[187]: df.head()
Out[188]:
high low loc_h loc_l
time
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN
2014-01-01 17:04:00 1.375585 1.375585 NaN NaN
In [186]: df.index
Out[186]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 271, Freq: None, Timezone: None
In [189]: hl.head()
Out[189]:
2014-01-01 17:00:00 1.376090
2014-01-01 17:02:00 1.375445
2014-01-01 17:05:00 1.376195
2014-01-01 17:10:00 1.375385
2014-01-01 17:12:00 1.376115
dtype: float64
In [187]:hl.index
Out[187]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 89, Freq: None, Timezone: None
In: pd.concat([df, hl], axis=1)
Out: [stack trace] ValueError: Shape of passed values is (5, 286), indices imply (5, 276)
I had a similar problem (join worked, but concat failed).
Check for duplicate index values in df1 and s1, (e.g. df1.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here https://stackoverflow.com/a/34297689/7163376 should resolve it.
My problem were different indices, the following code solved my problem.
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat([df1, df2], axis=1)
To drop duplicate indices, use df = df.loc[df.index.drop_duplicates()]. C.f. pandas.pydata.org/pandas-docs/stable/generated/… – BallpointBen Apr 18 at 15:25
This is wrong but I can't reply directly to BallpointBen's comment due to low reputation. The reason its wrong is that df.index.drop_duplicates() returns a list of unique indices, but when you index back into the dataframe using those the unique indices it still returns all records. I think this is likely because indexing using one of the duplicated indices will return all instances of the index.
Instead, use df.index.duplicated(), which returns a boolean list (add the ~ to get the not-duplicated records):
df = df.loc[~df.index.duplicated()]
Aus_lacy's post gave me the idea of trying related methods, of which join does work:
In [196]:
hl.name = 'hl'
Out[196]:
'hl'
In [199]:
df.join(hl).head(4)
Out[199]:
high low loc_h loc_l hl
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945 1.376090
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN NaN
Some insight into why concat works on the example but not this data would be nice though!
Your indexes probably contains duplicated values.
import pandas as pd
T1_INDEX = [
0,
1, # <= !!! if I write e.g.: "0" here then it fails
0.2,
]
T1_COLUMNS = [
'A', 'B', 'C', 'D'
]
T1 = [
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3],
[3.0, 3.1, 3.2, 3.3],
]
T2_INDEX = [
1.2,
2.11,
]
T2_COLUMNS = [
'D', 'E', 'F',
]
T2 = [
[54.0, 5324.1, 3234.2],
[55.0, 14.5324, 2324.2],
# [3.0, 3.1, 3.2],
]
df1 = pd.DataFrame(T1, columns=T1_COLUMNS, index=T1_INDEX)
df2 = pd.DataFrame(T2, columns=T2_COLUMNS, index=T2_INDEX)
print(pd.concat([pd.DataFrame({})] + [df2, df1], axis=1))
Try sorting index after concatenating them
result=pd.concat([df1,df2]).sort_index()
Maybe it is simple, try this
if you have a DataFrame. then make sure that both matrices or vectros that you're trying to combine have the same rows_name/index
I had the same issue. I changed the name indices of the rows to make them match each other
here is an example for a matrix (principal component) and a vector(target) have the same row indicies (I circled them in the blue in the leftside of the pic)
Before, "when it was not working", I had the matrix with normal row indicies (0,1,2,3) while I had the vector with row indices (ID0, ID1, ID2, ID3)
then I changed the vector's row indices to (0,1,2,3) and it worked for me.
enter image description here
I want to select all rows with a particular index. My DataFrame look like this:
>>> df
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
Selecting one of the first (Patient) index works:
>>> df.loc[1]
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
But selecting multiple of the first (Patient) index does not:
>>> df.loc[[1, 2]]
Code
Patient Date
1 2003-01-12 00:00:00 a
2 2001-1-17 22:00:00 z
However, I would like to get the entire dataframe (as the result would be if [1,1,1,2] i.e, the original dataframe).
When using a single index it works fine. For example:
>>> df.reset_index().set_index("Patient").loc[[1, 2]]
Date Code
Patient
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
TL;DR Why do I have to repeat the index when using multiple indexes but not when I use a single index?
EDIT: Apparently it can be done similar to:
>>> df.loc[df.index.get_level_values("Patient").isin([1, 2])]
But this seems quite dirty to me. Is this the way - or is any other, better, way possible?
For Pandas verison 0.14 the recommended way, according to the above comment, is:
df.loc[([1,2],),:]