I'm trying to bin (downsample) a time series based on its timestamps. For instance:
import numpy as np
import pandas as pd
timestamps = np.linspace(0, 1000, 10000)
values = np.random.random(10000)
I usually convert it to a dataframe, and use cut (or qcut) to create the bins:
timeseries_df = pd.DataFrame({"Timestamps": timestamps, "Values": values})
timeseries_df["Bins"] = pd.cut(timeseries_df["Timestamps"],100) #downsampling by two orders of magnitude
ds_timestamps = timeseries_df.groupby("Bins").max()["Timestamps"]
ds_values = timeseries_df.groupby("Bins").mean()["Values"]
This works, but I'm writing functions that I can reuse and I'd like to avoid using pandas if possible. I've tried implementing a version of what's been suggested here
ds_timestamps = np.linspace(timestamps.min(), timestamps.max(), 100)
digitized_timestamps = np.digitize(timestamps, ds_timestamps)
ds_values = [values[digitized_timestamps == i+1].mean() for i in range(len(ds_timestamps))]
This also works but is extremely slow. Is there another way of doing this?
As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.
Related
I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:
import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30),
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None
Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:
# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.
Thanks in advance, any help is highly appreciated
Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.
Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
This is about 2 times faster than the reference on my machine.
Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.
Here is the final implementation:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)
This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!
I think my code is inefficient and I think there may be a way to do it better.
The objective of the code is that it takes an Excel listing and has to relate each element of a column to the rest of the elements of the same column. Depending on some conditions store it in a new data frame with the joint information, in my case the file has more than 16000 rows, so when doing the exercise it must perform (16.000 x 16.000) 256.000.000 iterations. But it takes days processing.
The code I have is the following:
import pandas as pd
import numpy as np
excel1="Cs.xlsx"
dataframe1=pd.read_excel(excel1)
col_names=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2","D"]
my_df =pd.DataFrame(columns=col_names)
count_row = dataframe1.shape[0]
print(count_row)
for n in range(0,count_row):
for p in range(0,count_row):
if abs(dataframe1.iloc[n,1] - dataframe1.iloc[p,1]) < 0.27 and abs(dataframe1.iloc[n,2] -
dataframe1.iloc[p,2]) < 0.27:
Nb_Nb=dataframe1.iloc[n,0]+"_"+dataframe1.iloc[p,0]
myrow=pd.Series([dataframe1.iloc[n,0],dataframe1.iloc[p,0],Nb_Nb,dataframe1.iloc[n,1],
dataframe1.iloc[n,2],dataframe1.iloc[p,1],dataframe1.iloc[p,2]],
index=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2"])
my_df = my_df.append(myrow, ignore_index=True)
print(my_df.head(5))
To start with, you can try using a different python structure. Dataframes take a lot of memory and are slower to process.
Order from simple structures and more efficient processing to complex structures and less efficient processing
Lists
Dictionaries
Numpy Arrays
Pandas Series
Pandas Dataframes
I am trying to efficiently split a 3D point cloud into a number of 2D tiles/segments.
Using a combination of numpy's searchsorted() and pandas groupby() functions i have been able to achieve sorting the data into groups with pleasing speed.
For example:
import numpy as np
import pandas as pd
import time
scale=100
n_points= 1000000
n_tiles = 1000000
pos = np.empty((n_points,3))
pos[:,0]=np.random.random(n_points)*scale
pos[:,1]=np.random.random(n_points)*scale
pos[:,2]=np.random.random(n_points)
df = pd.DataFrame(pos)
# create bounds for each segment
min_bound,max_bound = 0,scale
x_segment_bounds,xstep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep = True)
x_segment_bounds[0]=x_segment_bounds[0]+xstep/2
y_segment_bounds,ystep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep=True)
y_segment_bounds[0]=y_segment_bounds[0]+ystep/2
# sort into bins
time_grab = time.clock()
bins_x = np.searchsorted(x_segment_bounds, pos[:, 0])
bins_y = np.searchsorted(y_segment_bounds, pos[:, 1])
print("Time for binning: ", time.clock()-time_grab)
df["bins_x"] = bins_x.astype(np.uint16)
df["bins_y"] = bins_y.astype(np.uint16)
# group points
time_grab = time.clock()
segments = df.groupby(['bins_x', 'bins_y'])
print("Time for grouping: ", time.clock()-time_grab)
Produces:
Time for binning: 0.1390
Time for grouping: 0.0043
The problem i am having is in efficiently accessing the point indexes that belong to each group in the pandas groupby object.
For example looping through each group is very inefficient:
segment_indices = []
for i,segment in enumerate(segments):
segment_indices.append(segment[1].index.values)
takes ~70 seconds.
I have found this method for retrieving the indexes:
segments = df.groupby(['bins_x', 'bins_y']).apply(lambda x: x.index.tolist())
which takes ~10 seconds, however it is still comparatively quite slow compared to the binning and grouping functions. Since i'm simply trying to copy the data to a new array or list, and not actually performing any computation on it, i am expecting much better efficiency. I would expect speeds at least similar to the binning and grouping operations.
I am curious if there is a more efficient way of extracting the indexes (or any of the information) from a groupby object? Alternatively is there another method for segmenting/ grouping points which doesn't use pandas, such as a numpy or scipy alternative?
In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.
This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python