Pandas groupby: efficiently extract indexes

Pandas groupby: efficiently extract indexes - python

I am trying to efficiently split a 3D point cloud into a number of 2D tiles/segments.
Using a combination of numpy's searchsorted() and pandas groupby() functions i have been able to achieve sorting the data into groups with pleasing speed.
For example:
import numpy as np
import pandas as pd
import time
scale=100
n_points= 1000000
n_tiles = 1000000
pos = np.empty((n_points,3))
pos[:,0]=np.random.random(n_points)*scale
pos[:,1]=np.random.random(n_points)*scale
pos[:,2]=np.random.random(n_points)
df = pd.DataFrame(pos)
# create bounds for each segment
min_bound,max_bound = 0,scale
x_segment_bounds,xstep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep = True)
x_segment_bounds[0]=x_segment_bounds[0]+xstep/2
y_segment_bounds,ystep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep=True)
y_segment_bounds[0]=y_segment_bounds[0]+ystep/2
# sort into bins
time_grab = time.clock()
bins_x = np.searchsorted(x_segment_bounds, pos[:, 0])
bins_y = np.searchsorted(y_segment_bounds, pos[:, 1])
print("Time for binning: ", time.clock()-time_grab)
df["bins_x"] = bins_x.astype(np.uint16)
df["bins_y"] = bins_y.astype(np.uint16)
# group points
time_grab = time.clock()
segments = df.groupby(['bins_x', 'bins_y'])
print("Time for grouping: ", time.clock()-time_grab)
Produces:
Time for binning: 0.1390
Time for grouping: 0.0043
The problem i am having is in efficiently accessing the point indexes that belong to each group in the pandas groupby object.
For example looping through each group is very inefficient:
segment_indices = []
for i,segment in enumerate(segments):
segment_indices.append(segment[1].index.values)
takes ~70 seconds.
I have found this method for retrieving the indexes:
segments = df.groupby(['bins_x', 'bins_y']).apply(lambda x: x.index.tolist())
which takes ~10 seconds, however it is still comparatively quite slow compared to the binning and grouping functions. Since i'm simply trying to copy the data to a new array or list, and not actually performing any computation on it, i am expecting much better efficiency. I would expect speeds at least similar to the binning and grouping operations.
I am curious if there is a more efficient way of extracting the indexes (or any of the information) from a groupby object? Alternatively is there another method for segmenting/ grouping points which doesn't use pandas, such as a numpy or scipy alternative?

Related

Numpy (or scipy) binning of time series values based on timestamps

I'm trying to bin (downsample) a time series based on its timestamps. For instance:
import numpy as np
import pandas as pd
timestamps = np.linspace(0, 1000, 10000)
values = np.random.random(10000)
I usually convert it to a dataframe, and use cut (or qcut) to create the bins:
timeseries_df = pd.DataFrame({"Timestamps": timestamps, "Values": values})
timeseries_df["Bins"] = pd.cut(timeseries_df["Timestamps"],100) #downsampling by two orders of magnitude
ds_timestamps = timeseries_df.groupby("Bins").max()["Timestamps"]
ds_values = timeseries_df.groupby("Bins").mean()["Values"]
This works, but I'm writing functions that I can reuse and I'd like to avoid using pandas if possible. I've tried implementing a version of what's been suggested here
ds_timestamps = np.linspace(timestamps.min(), timestamps.max(), 100)
digitized_timestamps = np.digitize(timestamps, ds_timestamps)
ds_values = [values[digitized_timestamps == i+1].mean() for i in range(len(ds_timestamps))]
This also works but is extremely slow. Is there another way of doing this?

As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.

Efficient way to rebuild a dictionary of dataframes

I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:
import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30),
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None
Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:
# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.
Thanks in advance, any help is highly appreciated

Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.
Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
This is about 2 times faster than the reference on my machine.
Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.
Here is the final implementation:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)
This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!

As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b

Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

Interpolating one time series onto another in pandas

I have one set of values measured at regular times. Say:
import pandas as pd
import numpy as np
rng = pd.date_range('2013-01-01', periods=12, freq='H')
data = pd.Series(np.random.randn(len(rng)), index=rng)
And another set of more arbitrary times, for example, (in reality these times are not a regular sequence)
ts_rng = pd.date_range('2013-01-01 01:11:21', periods=7, freq='87Min')
ts = pd.Series(index=ts_rng)
I want to know the value of data interpolated at the times in ts.
I can do this in numpy:
x = np.asarray(ts_rng,dtype=np.float64)
xp = np.asarray(data.index,dtype=np.float64)
fp = np.asarray(data)
ts[:] = np.interp(x,xp,fp)
But I feel pandas has this functionality somewhere in resample, reindex etc. but I can't quite get it.

You can concatenate the two time series and sort by index. Since the values in the second series are NaN you can interpolate and the just select out the values that represent the points from the second series:
pd.concat([data, ts]).sort_index().interpolate().reindex(ts.index)
or
pd.concat([data, ts]).sort_index().interpolate()[ts.index]

Assume you would like to evaluate a time series ts on a different datetime_index. This index and the index of ts may overlap. I recommend to use the following groupby trick. This essentially gets rid of dubious double stamps. I then forward interpolate but feel free to apply more fancy methods
def interpolate(ts, datetime_index):
x = pd.concat([ts, pd.Series(index=datetime_index)])
return x.groupby(x.index).first().sort_index().fillna(method="ffill")[datetime_index]

Here's a clean one liner:
ts = np.interp( ts_rng.asi8 ,data.index.asi8, data[0] )

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?

I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.

Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby: efficiently extract indexes - python

Related

Numpy (or scipy) binning of time series values based on timestamps

Efficient way to rebuild a dictionary of dataframes

Pandas - expanding inverse quantile function

Interpolating one time series onto another in pandas

doing better than numpy's in1d mask function: ordered arrays?

Categories

Resources