I have the following (time-series) data:
t = [5.13, 5.27, 5.40, 5.46, 190.99, 191.13, 191.267, 368.70, 368.83, 368.90, 368.93]
y = [17.17, 17.18, 17.014, 17.104, 16.981, 16.96, 16.85, 17.27, 17.66, 17.76, 18.01]
so, groups of data in short (time) intervals then separated cleanly by a long time gap.
I'm looking for a simple method that will intelligently average these together; sort of a 'Bayesian blocks' but for non-histogram data.
One could do a simple moving average, or numpy convolution, but I'm looking for something a bit smarter that will generalize to larger, similar, but not identical datasets.
It's easy with Pandas. First, construct a DataFrame:
df = pd.DataFrame({'t':t,'y':y})
Then label the groups according to a time threshold:
groups = (df.t.diff() > 10).cumsum()
That gives you [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2], because cumsum() on a boolean array increments wherever the input is true.
Finally, use groupby():
df.groupby(groups).mean()
It gives you:
t y
t
0 5.315 17.117000
1 191.129 16.930333
2 368.840 17.675000
If you need plain NumPy arrays at the end, just tack on .t.values and .y.values.
If you don't know a priori what time threshold to use, I'm sure you can come up with some heuristic, perhaps involving simple statistics on df.t and df.t.diff().
Related
I'm currently trying to create a column in a pandas dataframe, that creates a counter that equals the number of rows in the dataframe, divided by 2. Here is my code so far:
# Fill the cycles column with however many rows exist / 2
for x in ((jac_output.index)/2):
jac_output.loc[x, 'Cycles'] = x+1
However, I've noticed that it misses out values every so often, like this:
[
Why would my counter miss a value every so often as it gets higher? And is there another way of optimizing this, as it seems to be quite slow?
you may have removed some data from the dataframe, so some indicies are missing, therefore you should use reset_index to renumber them, or you can just use
for x in np.arange(0,len(jac_output.index),1)/2:.
You can view jac_output.index as a list like [0, 1, 2, ...]. When you divide it by 2, it results in [0, 0.5, 1, ...]. 0.5 is surely not in your original index.
To slice the index into half, you can try:
jac_output.index[:len(jac_output.index)//2]
Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.
Here is a demonstration:
import pandas as pd
import dask.dataframe as dd
users = [1]*1000 + [2]*1000 + [3]*1000
df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)
ddf = ddf.set_index('user')
counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500 1000
# 999 2000
# dtype: int64
However, I found no guarantee of this behaviour anywhere.
I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:
set_index
set_partitions
rearrange_by_column
rearrange_by_column_tasks
SimpleShuffleLayer
When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?
Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.
is it the case that a single index can never be in two different partitions?
No, it's certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.
An extreme example (every row is the same value except one):
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] + [1] * 20})
In [4]: ddf = dd.from_pandas(df, npartitions=10)
In [5]: s = ddf.set_index("A")
In [6]: s.divisions
Out[6]: (0, 0, 0, 0, 0, 0, 0, 1)
As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:
In [7]: import dask
In [8]: dask.compute(s.to_delayed()) # easy way to see the partitions separately
Out[8]:
([Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],)
This is because the code deciding which output partition a row belongs doesn't consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.
I'll update this answer when the issue is fixed.
Is it the case that a single index can never be in two different partitions?
IIUC, the answer for practical purposes is yes.
A dask dataframe will in general have multiple partitions and dask may or may not know about the index values associated with each partition (see Partitions). If dask does know which partition contains which index range, then this will be reflected in df.divisions output (if not, the result of this call will be None).
When running .set_index, dask will compute divisions and it seems that in determining the divisions it will require that divisions are sequential and unique (except for the last element). The relevant code is here.
So two potential follow-up questions: why not allow any non-sequential indexing, and as a specific case of the previous, why not allow duplicate indexes in partitions.
With regards to the first question: for smallish data it might be feasible to think about a design that allows non-sorted indexing, but you can imagine that a general non-sorted indexing won't scale well, since dask will need to store indexes for each partition somehow.
With regards to the second question: it seems that this should be possible, but it also seems that right now it's not implemented correctly. See the snippet below:
# use this to generate 10 indexed partitions
import pandas as pd
for user in range(10):
df = pd.DataFrame({'user_col': [user//3]*100})
df['user'] = df['user_col']
df = df.set_index('user')
df.index.name = 'user_index'
df.to_parquet(f'test_{user}.parquet', index=True)
# now load them into a dask dataframe
import dask.dataframe as dd
ddf = dd.read_parquet('test_*.parquet')
# dask will know about the divisions
print(ddf.known_divisions) # True
# further evidence
print(ddf.divisions) # (0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3)
# this should show three partitions, but will show only one
print(ddf.loc[0].npartitions) # 1
I have just noticed that Dask's documentation for shuffle says
After this operation, rows with the same value of on will be in the same partition.
This seems to confirm my empirical observation.
I have some large unique numbers that are some sort of identity of devices
clusteringOutput[:,1]
Out[140]:
array([1.54744609e+12, 1.54744946e+12, 1.54744133e+12, ...,
1.54744569e+12, 1.54744570e+12, 1.54744571e+12])
even though the numbers are large they are only a handful of those that just repeat over the entries.
I would like to remap those into smaller ranges of integers. So if these numbers are only different 100 values I would like then to map them in the scale from 1-100 with a mapping table that allows me to find and see those mappings.
In the internet the remapping functions, typically will rescale and I do not want to rescale. I want to have concrete integer numbers that map the longer ids I have to simpler to the eyes numbers.
Any ideas on how I can implement that? I can use pandas data frames if it helps.
Thanks a lot
Alex
Use numpy.unique with return_inverse=True:
import numpy as np
arr = np.array([1.54744609e+12,
1.54744946e+12,
1.54744133e+12,
1.54744133e+12,
1.54744569e+12,
1.54744570e+12,
1.54744571e+12])
mapper, ind = np.unique(arr, return_inverse=True)
Output of ind:
array([4, 5, 0, 0, 1, 2, 3])
Remapping using mapper:
mapper[ind]
# array([1.54744609e+12, 1.54744946e+12, 1.54744133e+12, 1.54744133e+12,
# 1.54744569e+12, 1.54744570e+12, 1.54744571e+12])
Validation:
all(arr == mapper[ind])
# True
I need to convert a data frame to sparse matrix. The data frame looks similar to this: (The actual data is way too big (Approx 500 000 rows and 1000 columns)).
I need to convert it into a matrix such that the rows of the matrix are 'id' and columns are 'names' and should show only the finite values. No nans should be shown (to reduce memory usage). And when I tried using pd.pivot_table, it was taking a long time to make the matrix for my big data.
In R, there is a method called 'dMcast' for this purpose. I explored but could not find the alternate of this in python. I'm new to python.
First i will convert the categorical names column to indices. Maybe pandas has this functionality already?
names = list('PQRSPSS')
name_ids_map = {n:i for i, n in enumerate(set(names))}
name_ids = [name_ids_map[n] for n in names]
Then I would use scipy.sparse.coo and then maybe convert that to another sparse format.
ids = [1, 1, 1, 1, 2, 2, 3]
rating = [2, 4, 1, 4, 2, 2, 1]
sp = scipy.sparse.coo_matrix((rating, (ids, name_ids))
print(sp)
sp.tocsc()
I am not aware of a sparse matrix library that can index a dimension with categorical data like 'R', 'S" etc
df1
Date Topic Return
1/1/2010 A,B -0.308648967
1/2/2010 C,D -0.465862046
1/3/2010 E 0.374052392
1/4/2010 F 0.520312204
1/5/2010 G 0.503889198
1/6/2010 H -1.730646788
1/7/2010 L,M,N 1.756295613
1/8/2010 K -0.598990239
......
1/30/2010 z 2,124355
Plot= df1.plot(x='Date', y='Return')
How can I find highest peaks and smallest troughs for this graph and label these special points with corresponding Topics?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Take an example data
data = {"Date":["date{i}".format(i=i) for i in range(10)], "Topic":["topic{i}".format(i=i) for i in range(10)], "Return":[1,2,3,2,1,2,4,7,1,3]}
df = pd.DataFrame.from_dict(data)
dates = np.array(df["Date"].tolist())
returns = np.array(df["Return"].tolist())
# Calculate the minimas and the maximas
minimas = (np.diff(np.sign(np.diff(returns))) > 0).nonzero()[0] + 1
maximas = (np.diff(np.sign(np.diff(returns))) < 0).nonzero()[0] + 1
# Plot the entire data first
plt.plot(dates, returns)
# Then mark the maximas and the minimas
for minima in minimas:
plt.plot(df.iloc[minima]["Date"], df.iloc[minima]["Return"], marker="o", label=str(df.iloc[minima]["Topic"]))
for maxima in maximas:
plt.plot(df.iloc[maxima]["Date"], df.iloc[maxima]["Return"], marker="o", label=str(df.iloc[maxima]["Topic"]))
plt.legend()
plt.show()
Example dataframe:
Date Topic Return
0 date0 topic0 1
1 date1 topic1 2
2 date2 topic2 3
3 date3 topic3 2
4 date4 topic4 1
5 date5 topic5 2
6 date6 topic6 4
7 date7 topic7 7
8 date8 topic8 1
9 date9 topic9 3
Plot it produces:
This depends a little bit on your definitions of "peak" and "trough". Oftentimes, a person might care about smoothed peaks and troughs to identify broad trends, especially in the presence of noise. In the event that you want every fine-grained dip or rise in the data though (and if your rows are sorted), you can cheat a little bit with vectorized routines from numpy.
import numpy as np
d = np.diff(df['Return'])
i = np.argwhere((d[:-1]*d[1:])<=0).flatten()
special_points = df['Topic'][i+1]
The first line with np.diff() compares each return value to the next return value. In particular, it subtracts them. Depending a little on your definition of a local peak/trough, these will have the property that you only have a feature you're looking for if these pairwise differences alternate in sign. Consider the following peak.
[1, 5, 1]
If you compute the pairwise differences, you get a slightly shorter vector
[4, -4]
Note that these alternate in sign. Hence, if you multiply them you get -16, which must be negative. This is the exact insight that our code uses to identify the peaks and troughs. The dimension reduction offsets things a little bit, so we shift the indices we find by 1 (in the df['Topic'][i+1] block).
Caveats: Note that we have <= instead of strict inequality. This is in case we have a wider peak than normal. Consider [1, 2, 2, 2, 2, 2, 1]. Arguably, the string of 2's represents a peak and would need to be captured. If that isn't desirable, make the inequality strict.
Additionally, if you're interested in wider peaks like that, this algorithm still isn't correct. It's plenty fast, but in general it only computes a superset of the peaks/troughs. Consider the following
[1, 2, 2, 3, 2, 1]
Arguably, the number 3 is the only peak in that dataset (depends a bit on your definitions of course), but our algorithm will also pick up the first and second instances of the number 2 due to their being on a shelf (being identical to a neighbor).
Extras: The scipy.signal module has a variety of peak-finding algorithms which may be better suited depending on any extra requirements you have on your peaks. Modifying this solution is unlikely to be as fast or clean as using an appropriate built-in signal processor. A call to scipy.signal.find_peaks() can basically replicate everything we've done here, and it has more options if you need them. Other algorithms like scipy.signal.find_peaks_cwt() might be more appropriate if you need any kind of smoothing or more complicated operations.