I'm working with csv files.
I'd like a to create a continuously updated average of a sequence. ex;
I'd like to output the average of each individual value of a list
list; [a, b, c, d, e, f]
formula:
(a)/1= ?
(a+b)/2=?
(a+b+c)/3=?
(a+b+c+d)/4=?
(a+b+c+d+e)/5=?
(a+b+c+d+e+f)/6=?
To demonstrate:
if i have a list; [1, 4, 7, 4, 19]
my output should be; [1, 2.5, 4, 4, 7]
explained;
(1)/1=1
(1+4)/2=2.5
(1+4+7)/3=4
(1+4+7+4)/4=4
(1+4+7+4+19)/5=7
As far as my python file it is a simple code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
x = [] #has to be a list of 1 to however many rows are in the "numbers" column, will be a simple [1, 2, 3, 4, 5] etc...
#x will be used to divide the numbers selected in y to give us z
y = df[numbers]
z = #new dataframe derived from the continuous average of y
plt.plot(x, z)
plt.show()
If numpy is needed that is no problem.
pandas.DataFrame.expanding is what you need.
Using it you can just call df.expanding().mean() to get the result you want:
mean = df.expanding().mean()
print(mean)
Out[10]:
0 1.0
1 2.5
2 4.0
3 4.0
4 7.0
If you want to do it just in one column, use pandas.Series.expanding.
Just use the column instead of df:
df['column_name'].expanding().mean()
You can use cumsum to get cumulative sum and then divide to get the running average.
x = np.array([1, 4, 7, 4, 19])
np.cumsum(x)/range(1,len(x)+1)
print (z)
output:
[1. 2.5 4. 4. 7. ]
To give a complete answer to your question, filling in the blanks of your code using numpy and plotting:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#df = pd.read_csv('somecsvfile.csv')
#instead I just create a df with a column named 'numbers'
df = pd.DataFrame([1, 4, 7, 4, 19], columns = ['numbers',])
x = range(1, len(df)+1) #x will be used to divide the numbers selected in y to give us z
y = df['numbers']
z = np.cumsum(y) / np.array(x)
plt.plot(x, z, 'o')
plt.xticks(x)
plt.xlabel('Entry')
plt.ylabel('Cumulative average')
But as pointed out by Augusto, you can also just put the whole thing into a DataFrame. Adding a bit more to his approach:
n = [1, 4, 7, 4, 19]
df = pd.DataFrame(n, columns = ['numbers',])
#augment the index so it starts at 1 like you want
df.index = np.arange(1, len(df)+1)
# create a new column for the cumulative average
df = df.assign(cum_avg = df['numbers'].expanding().mean())
# numbers cum_avg
# 1 1 1.0
# 2 4 2.5
# 3 7 4.0
# 4 4 4.0
# 5 19 7.0
# plot
df['cum_avg'].plot(linestyle = 'none',
marker = 'o',
xticks = df.index,
xlabel = 'Entry',
ylabel = 'Cumulative average')
Related
I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())
How can I compute efficiently the rolling mean at fixed intervals?
import numpy as np
import pandas as pd
n=50
s = pd.Series(data = np.random.randint(0,10,n), index = pd.date_range(pd.to_datetime('today').floor('D'), freq='D', periods = n))
E.g. in the series above with an interval of 4 days and number of elements 3, the ith element of the new series t=t_i will have s_i =1/3 *( s_(i-4) + s_(i-4*2) + s_(i-4*3) )
Have you checked out pandas.DataFrame.rolling? It might have what you're looking for.
If I understand correctly, here is an example with an array of 1 to 50:
interval = 4
window = 3
data = np.linspace(1,50,50)
arr = pd.Series(np.array(data)[::interval]) #subset data by every 4th value
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window) #look forward 3 spaces on every 4th value
arr.rolling(indexer).mean() #take the mean of the window
The output would be an array [5, 9, 13, 17, ...], 5 corresponding to averaging 1, 5, and 9 and 9 being the average of 5, 9, and 13.
Let's say I have the following dataset:
x = [1, 1, 1, 2, 2, 2, 3, 3, 3]
y = [1, 2, 3, 2, 3, 4, 3, 4, 5]
import pandas as pd
df = pd.DataFrame({'x':x,'y':y}) #dataframe to work with
which, plotted using matplotlib scatter looks like this.
I would like to select the bottom three points using Pandas, without iterating over the rows of my dataframe (because of speed considerations of a large dataframe), and without simply selecting 1st, 4th and 7th point of the dataframe:
I tried selecting based on a condition:
selected_df = df.loc[df["y"] <=3] #selects an extra point at x=1,y=2
This selects an extra point which I don't want. I also tried building two lists of values representing a line that separates the bottom points from others:
x_line = [1,2,3]
y_line = [1.5, 2.5, 3.5]
selected_df = df.loc[df["y"] <=y_line ] #y_line is a list, doesn't work
I also unfortunately must not solve it by filling y_line with more points to make y_line same size as df["y"].
Can anyone please show me the direction how to select the bottom points preferably using functions of DataFrame such as df.where or a condition? I would appreciate it very much.
IIUC, what you're esentially looking for is the lowest y for each x, so you can phrase this as a groupby problem:
>>> selected_df = df.groupby("x", as_index=False).y.min()
>>> selected_df
x y
0 1 1
1 2 2
2 3 3
I'm working with a huge dataset. What I want to do is take all values > 0 from the array and place them in a new array, run statistics on those extracted values and then place the new values back in the original array.
Suppose I have an array [0,0,0,0,0, . . . .32, .44,0,0,0] (i.e. the object arr in the script below): I want to remove the values such as .32, .44, etc., and put them in a new array arr2.
Then I want to do a statistical analysis (PCA) on this second array, take the new values corresponding with the original position in the original array and replace the original values with these new values. I've started coding this below, but have no idea how to extract values > 0 while maintaining the position in the array.
import os
import nibabel as nb
import numpy as np
import numpy.linalg as npl
import nibabel as nib
import matplotlib.pyplot as plt
from matplotlib.mlab import PCA
#from dipy.io.image import load_nifti, save_nifti
np.set_printoptions(precision=4, suppress=True)
FA = './all_FA_skeletonised.nii'
from dipy.io.image import load_nifti
img = nib.load(FA)
data = img.get_data()
data.shape #get x,y,z and subject # parameters from image
#place subject number into a variable
vol_shape = data.shape[:-1] # x,y,z coordinates
n_vols = data.shape[-1] # 28 subjects volumes
# N is the num of voxels (dimensions) in a volume
N = np.prod(vol_shape)
#- Reshape first dimension of whole image data array to N, and take
#- transpose
arr2 = []
arr = data.reshape(N, n_vols).T # 28 X 7,200,000 array
for a in array:
if a > 0:
arr2.append(a)
row_means = np.outer(np.mean(arr2, axis=1), np.ones(N))
X = arr2 - row_means # mean center data array
#- Calculate unscaled covariance matrix of X
unscaled_covariance = X.dot(X.T)
unscaled_covariance.shape
# Calculate U, S, VT with SVD on unscaled covariance matrix
U, S, VT = npl.svd(unscaled_covariance)
#- Use subplots to make axes to plot first 10 principal component
#- vectors
#- Plot one component vector per sub-plot.
fig, axes = plt.subplots(10, 1)
for i, ax in enumerate(axes):
ax.plot(U[:, i])
#- Calculate scalar projections for projecting X onto U
#- Put results into array C.
C = U.T.dot(X)
***#- Put values in C back into original data matrix***
I would extract the wanted values with their positions (in the original array) and store them in a dictionary as index_in_the_original_array: value_in_the_original_array. Then I would do the calculations on the values in the dictionary. Finally, we have the indices preserved (as keys in the dictionary) for replacing the values back in the original array. In code:
from pprint import pprint
original_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Collecting all values & indices of the elements that are greater than 5:
my_dictionary = {index: value for index, value in enumerate(original_array) if value > 5}
pprint(original_array) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pprint(my_dictionary) # {5: 6, 6: 7, 7: 8, 8: 9, 9: 10}
# doing the processing (Here just incrementing the values by 2):
my_dictionary = {key: my_dictionary[key] + 2 for key in my_dictionary.keys()}
pprint(my_dictionary) # {5: 8, 6: 9, 7: 10, 8: 11, 9: 12}
# Replacing the new values into the original array:
for key in my_dictionary.keys():
original_array[key] = my_dictionary[key]
pprint(original_array) # [1, 2, 3, 4, 5, 8, 9, 10, 11, 12]
Update
If we want to avoid the use of a dictionary, we could do the following which does basically the same as above.
import numpy as np
def process_data(data):
return data * 5
original_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
new_array = np.array([[index, value] for index, value in enumerate(original_array) if value > 5])
print(new_array) # [[ 5 6]
# [ 6 7]
# [ 7 8]
# [ 8 9]
# [ 9 10]]
# doing the processing (Here, just using the above function that multiplies the values by 5):
new_array[:, 1] = process_data(new_array[:, 1])
print(new_array) # [[ 5 30]
# [ 6 35]
# [ 7 40]
# [ 8 45]
# [ 9 50]]
# Replacing the new values into the original array:
for indx, val in new_array:
original_array[indx] = val
print(original_array) # [ 1 2 3 4 5 30 35 40 45 50]
edit: got the question wrong (see comments) so here's an update.
Say we have a=[0,0,1,2,0,3] and b=[.1, .1, .1] and want to combine them to get a [0, 0,.1, .1, 0, 0.1], i.e. 0 remains at same indexes and all the other values get substituted:
import numpy as np
b = np.array([.1, .1, .1])
a = np.array([0,0,1,2,0,3], dtype='float64') # expects same dtype
np.place(a, a>0, b) # modify in place
Backup a before the np.place line if you need its original values.
previous version:
Not sure whether I got you right, assuming by 'maintaining the position in the array', you mean for example [0,0,1,2,0,3,0] should eval [1,2,3] (instead of [1,3,2] or something else). You can do this by a[a!=] where a is your array. If you only want to knock off leading/trailing zeros, try numpy.trim_zeros instead.
Things should be different if input is 2D arrays or matrices, as you'll need to keep them in shape.
I'm not quite sure how to say this so I'll try to be clear in my description.
Right now I have a 3D numpy array where the 1st column represents a depth and the 2nd a position on the x-axis. My goal is to make a pcolor where the columns are spread out along the x-axis based on the values in a 1D float array.
Here's where it gets tricky, I only have the relative distances between points. That is, the distance between column 1 and column 2 and so on.
Here's an example of what I have and what I'd like:
darray = [[2 3 7 7]
[4 8 2 3]
[6 1 9 5]
[3 4 8 4]]
posarray = [ 3.767, 1.85, 0.762]
DesiredArray = [[2 0 0 0 3 0 7 7]
[4 0 0 0 8 0 2 3]
[6 0 0 0 1 0 9 5]
[3 0 0 0 4 0 8 4]]
How I tried implementing it:
def space_set(darr, sarr):
spaced = np.zeros((260,1+int(sum(sarr))), dtype = float)
x = 0
for point in range(len(sarr)):
spaced[:, x] = darr[:,point]
x = int(sum(sarr[0:point]))
spaced[:,-1] = darr[:,-1]
Then I was planning on using matplotlibs pcolor to plot it. This method seems to lose columns though. Any ideas for either directly plotting or making a numpy array to plot? Thanks in advance.
Here's an example of what I'm looking for.
Since there is so much whitespace, perhaps it would be easier to draw the Rectangles, rather than use pcolor. As a bonus, you can place the rectangles exactly where you want them, rather than having to "snap" them to an integer-valued grid. And, you do not have to allocate space for a larger 2D array mainly filled with zeros. (In your case the memory required is probably measly, but the idea does not scale well, so it is nice if we can avoid doing that.)
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import matplotlib.cm as cm
def draw_rect(x, y, z):
rect = patches.Rectangle((x,y), 1, 1, color = jet(z))
ax.add_patch(rect)
jet = plt.get_cmap('jet')
fig = plt.figure()
ax = fig.add_subplot(111)
darray = np.array([[2, 3, 7, 7],
[4, 8, 2, 3],
[6, 1, 9, 5],
[3, 4, 8, 4]], dtype = 'float')
darray_norm = darray/darray.max()
posarray = [3.767, 1.85, 0.762]
x = np.cumsum(np.hstack((0, np.array(posarray)+1)))
for j, i in np.ndindex(darray.shape):
draw_rect(x[j], i, darray_norm[i, j])
ax.set_xlim(x.min(),x.max()+1)
ax.set_ylim(0,len(darray))
ax.invert_yaxis()
m = cm.ScalarMappable(cmap = jet)
m.set_array(darray)
plt.colorbar(m)
plt.show()
yields