Rolling mean with intervals - python

How can I compute efficiently the rolling mean at fixed intervals?
import numpy as np
import pandas as pd
n=50
s = pd.Series(data = np.random.randint(0,10,n), index = pd.date_range(pd.to_datetime('today').floor('D'), freq='D', periods = n))
E.g. in the series above with an interval of 4 days and number of elements 3, the ith element of the new series t=t_i will have s_i =1/3 *( s_(i-4) + s_(i-4*2) + s_(i-4*3) )

Have you checked out pandas.DataFrame.rolling? It might have what you're looking for.
If I understand correctly, here is an example with an array of 1 to 50:
interval = 4
window = 3
data = np.linspace(1,50,50)
arr = pd.Series(np.array(data)[::interval]) #subset data by every 4th value
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window) #look forward 3 spaces on every 4th value
arr.rolling(indexer).mean() #take the mean of the window
The output would be an array [5, 9, 13, 17, ...], 5 corresponding to averaging 1, 5, and 9 and 9 being the average of 5, 9, and 13.

Related

Group rows based on +- threshold on high dimensional object

I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())

Cumulative average in python

I'm working with csv files.
I'd like a to create a continuously updated average of a sequence. ex;
I'd like to output the average of each individual value of a list
list; [a, b, c, d, e, f]
formula:
(a)/1= ?
(a+b)/2=?
(a+b+c)/3=?
(a+b+c+d)/4=?
(a+b+c+d+e)/5=?
(a+b+c+d+e+f)/6=?
To demonstrate:
if i have a list; [1, 4, 7, 4, 19]
my output should be; [1, 2.5, 4, 4, 7]
explained;
(1)/1=1
(1+4)/2=2.5
(1+4+7)/3=4
(1+4+7+4)/4=4
(1+4+7+4+19)/5=7
As far as my python file it is a simple code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
x = [] #has to be a list of 1 to however many rows are in the "numbers" column, will be a simple [1, 2, 3, 4, 5] etc...
#x will be used to divide the numbers selected in y to give us z
y = df[numbers]
z = #new dataframe derived from the continuous average of y
plt.plot(x, z)
plt.show()
If numpy is needed that is no problem.
pandas.DataFrame.expanding is what you need.
Using it you can just call df.expanding().mean() to get the result you want:
mean = df.expanding().mean()
print(mean)
Out[10]:
0 1.0
1 2.5
2 4.0
3 4.0
4 7.0
If you want to do it just in one column, use pandas.Series.expanding.
Just use the column instead of df:
df['column_name'].expanding().mean()
You can use cumsum to get cumulative sum and then divide to get the running average.
x = np.array([1, 4, 7, 4, 19])
np.cumsum(x)/range(1,len(x)+1)
print (z)
output:
[1. 2.5 4. 4. 7. ]
To give a complete answer to your question, filling in the blanks of your code using numpy and plotting:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#df = pd.read_csv('somecsvfile.csv')
#instead I just create a df with a column named 'numbers'
df = pd.DataFrame([1, 4, 7, 4, 19], columns = ['numbers',])
x = range(1, len(df)+1) #x will be used to divide the numbers selected in y to give us z
y = df['numbers']
z = np.cumsum(y) / np.array(x)
plt.plot(x, z, 'o')
plt.xticks(x)
plt.xlabel('Entry')
plt.ylabel('Cumulative average')
But as pointed out by Augusto, you can also just put the whole thing into a DataFrame. Adding a bit more to his approach:
n = [1, 4, 7, 4, 19]
df = pd.DataFrame(n, columns = ['numbers',])
#augment the index so it starts at 1 like you want
df.index = np.arange(1, len(df)+1)
# create a new column for the cumulative average
df = df.assign(cum_avg = df['numbers'].expanding().mean())
# numbers cum_avg
# 1 1 1.0
# 2 4 2.5
# 3 7 4.0
# 4 4 4.0
# 5 19 7.0
# plot
df['cum_avg'].plot(linestyle = 'none',
marker = 'o',
xticks = df.index,
xlabel = 'Entry',
ylabel = 'Cumulative average')

Transform a Pandas series to be monotonic

I'm looking for a way to remove the points that ruin the monotonicity of a series.
For example
s = pd.Series([0,1,2,3,10,4,5,6])
or
s = pd.Series([0,1,2,3,-1,4,5,6])
we would extract
s = pd.Series([0,1,2,3,4,5,6])
NB: we assume that the first element is always correct.
Monotonic could be both increasing or decreasing, the functions below will return exclude all values that brean monotonicity.
However, there seems to be a confusion in your question, given the series s = pd.Series([0,1,2,3,10,4,5,6]), 10 doesn't break monotonicity conditions, 4, 5, 6 do. So the correct answer there is 0, 1, 2, 3, 10
import pandas as pd
s = pd.Series([0,1,2,3,10,4,5,6])
def to_monotonic_inc(s):
return s[s >= s.cummax()]
def to_monotonic_dec(s):
return s[s <= s.cummin()]
print(to_monotonic_inc(s))
print(to_monotonic_dec(s))
Output is 0, 1, 2, 3, 10 for increasing and 0 for decreasing.
Perhaps you want to find the longest monotonic array? because that's a completely different search problem.
----- EDIT -----
Below is a simple way of finding the longest monotonic ascending array given your constraints using plain python:
def get_longeset_monotonic_asc(s):
enumerated = sorted([(v, i) for i, v in enumerate(s) if v >= s[0]])[1:]
output = [s[0]]
last_index = 0
for v, i in enumerated:
if i > last_index:
last_index = i
output.append(v)
return output
s1 = [0,1,2,3,10,4,5,6]
s2 = [0,1,2,3,-1,4,5,6]
print(get_longeset_monotonic_asc(s1))
print(get_longeset_monotonic_asc(s2))
'''
Output:
[0, 1, 2, 3, 4, 5, 6]
[0, 1, 2, 3, 4, 5, 6]
'''
Note that this solution involves sorting which is O(nlog(n)) + a second step which is O(n).
Here is a way to produce a monotonically increasing series:
import pandas as pd
# create data
s = pd.Series([1, 2, 3, 4, 5, 4, 3, 2, 3, 4, 5, 6, 7, 8])
# find max so far (i.e., running_max)
df = pd.concat([s.rename('orig'),
s.cummax().rename('running_max'),
], axis=1)
# are we at or above max so far?
df['keep?'] = (df['orig'] >= df['running_max'])
# filter out one or many points below max so far
df = df.loc[ df['keep?'], 'orig']
# verify that remaining points are monotonically increasing
assert pd.Index(df).is_monotonic_increasing
# print(df.drop_duplicates()) # eliminates ties
print(df) # keeps ties
0 1
1 2
2 3
3 4
4 5
10 5 # <-- same as previous value -- a tie
11 6
12 7
13 8
Name: orig, dtype: int64
You can see graphically with s.plot(); and df.plot();

Downsample pandas data frame based on count column

I have a thousands of data frame like the following, though much larger (1000000 rows, 100 columns).
data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[45, 66, 6, 6, 1, 432, 3],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
I want to randomly sample from this data frame and make a new data frame such that the sum of count should only equal N. Meaning I want to randomly sample based on the count value as a weight, and make a new data frame with this new resampled data such that sum of count is N.
The relative proportions should stay approximately the same, and no value when resampled should exceed the count of the original count value. The values in cols1 (or any other column except Value and count) should remain the same.
For example, if N was 50, it might look like:
pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[4, 7, 1, 1, 0, 37, 0],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
How can this be done?
Efficiency is key, otherwise I could expand the data frame based on count and randomly sample without replacement, then merge it back together.
Thanks,
Jack
Using multinomial sampling, this is relatively easy.
import numpy as np
from itertools import chain
def downsample(df, N):
prob = df['count']/sum(df['count'])
df['count'] = list(chain.from_iterable(np.random.multinomial(n = N, pvals = prob, size = 1)))
df = df[df['count'] != 0]
return df
For OP's example:
downsample(data, 50)
returns:
Value cols1 count
1 Boy 5 1
3 Corn 4 16
5 Barnes 32 33

How to remove values from a Python array, perform an operation on them and then replace them in the original array

I'm working with a huge dataset. What I want to do is take all values > 0 from the array and place them in a new array, run statistics on those extracted values and then place the new values back in the original array.
Suppose I have an array [0,0,0,0,0, . . . .32, .44,0,0,0] (i.e. the object arr in the script below): I want to remove the values such as .32, .44, etc., and put them in a new array arr2.
Then I want to do a statistical analysis (PCA) on this second array, take the new values corresponding with the original position in the original array and replace the original values with these new values. I've started coding this below, but have no idea how to extract values > 0 while maintaining the position in the array.
import os
import nibabel as nb
import numpy as np
import numpy.linalg as npl
import nibabel as nib
import matplotlib.pyplot as plt
from matplotlib.mlab import PCA
#from dipy.io.image import load_nifti, save_nifti
np.set_printoptions(precision=4, suppress=True)
FA = './all_FA_skeletonised.nii'
from dipy.io.image import load_nifti
img = nib.load(FA)
data = img.get_data()
data.shape #get x,y,z and subject # parameters from image
#place subject number into a variable
vol_shape = data.shape[:-1] # x,y,z coordinates
n_vols = data.shape[-1] # 28 subjects volumes
# N is the num of voxels (dimensions) in a volume
N = np.prod(vol_shape)
#- Reshape first dimension of whole image data array to N, and take
#- transpose
arr2 = []
arr = data.reshape(N, n_vols).T # 28 X 7,200,000 array
for a in array:
if a > 0:
arr2.append(a)
row_means = np.outer(np.mean(arr2, axis=1), np.ones(N))
X = arr2 - row_means # mean center data array
#- Calculate unscaled covariance matrix of X
unscaled_covariance = X.dot(X.T)
unscaled_covariance.shape
# Calculate U, S, VT with SVD on unscaled covariance matrix
U, S, VT = npl.svd(unscaled_covariance)
#- Use subplots to make axes to plot first 10 principal component
#- vectors
#- Plot one component vector per sub-plot.
fig, axes = plt.subplots(10, 1)
for i, ax in enumerate(axes):
ax.plot(U[:, i])
#- Calculate scalar projections for projecting X onto U
#- Put results into array C.
C = U.T.dot(X)
***#- Put values in C back into original data matrix***
I would extract the wanted values with their positions (in the original array) and store them in a dictionary as index_in_the_original_array: value_in_the_original_array. Then I would do the calculations on the values in the dictionary. Finally, we have the indices preserved (as keys in the dictionary) for replacing the values back in the original array. In code:
from pprint import pprint
original_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Collecting all values & indices of the elements that are greater than 5:
my_dictionary = {index: value for index, value in enumerate(original_array) if value > 5}
pprint(original_array) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pprint(my_dictionary) # {5: 6, 6: 7, 7: 8, 8: 9, 9: 10}
# doing the processing (Here just incrementing the values by 2):
my_dictionary = {key: my_dictionary[key] + 2 for key in my_dictionary.keys()}
pprint(my_dictionary) # {5: 8, 6: 9, 7: 10, 8: 11, 9: 12}
# Replacing the new values into the original array:
for key in my_dictionary.keys():
original_array[key] = my_dictionary[key]
pprint(original_array) # [1, 2, 3, 4, 5, 8, 9, 10, 11, 12]
Update
If we want to avoid the use of a dictionary, we could do the following which does basically the same as above.
import numpy as np
def process_data(data):
return data * 5
original_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
new_array = np.array([[index, value] for index, value in enumerate(original_array) if value > 5])
print(new_array) # [[ 5 6]
# [ 6 7]
# [ 7 8]
# [ 8 9]
# [ 9 10]]
# doing the processing (Here, just using the above function that multiplies the values by 5):
new_array[:, 1] = process_data(new_array[:, 1])
print(new_array) # [[ 5 30]
# [ 6 35]
# [ 7 40]
# [ 8 45]
# [ 9 50]]
# Replacing the new values into the original array:
for indx, val in new_array:
original_array[indx] = val
print(original_array) # [ 1 2 3 4 5 30 35 40 45 50]
edit: got the question wrong (see comments) so here's an update.
Say we have a=[0,0,1,2,0,3] and b=[.1, .1, .1] and want to combine them to get a [0, 0,.1, .1, 0, 0.1], i.e. 0 remains at same indexes and all the other values get substituted:
import numpy as np
b = np.array([.1, .1, .1])
a = np.array([0,0,1,2,0,3], dtype='float64') # expects same dtype
np.place(a, a>0, b) # modify in place
Backup a before the np.place line if you need its original values.
previous version:
Not sure whether I got you right, assuming by 'maintaining the position in the array', you mean for example [0,0,1,2,0,3,0] should eval [1,2,3] (instead of [1,3,2] or something else). You can do this by a[a!=] where a is your array. If you only want to knock off leading/trailing zeros, try numpy.trim_zeros instead.
Things should be different if input is 2D arrays or matrices, as you'll need to keep them in shape.

Categories