averaging datasets of varying length

averaging datasets of varying length - python

I have a series of datasets outputted from a program. My goal is to plot an average of the datasets as a line graph in pyplot or numpy. My problem is that the length of the outputted datasets is not controllable.
For example, I have four data sets of lengths varying between 200 and 400 points with x values normalised to figures from 0 to 1, and I want to calculate the median line for the four datasets.
All I can think of so far is to interpolate (linearly would be sufficient) to add extra data points to the shorter sequences, or somehow periodically remove values from the longer sequences. Does anyone have any suggestions?
At the moment I am importing with csv reader and appending row by row to a list, so the output is a list of lists, each with a set of xy coordinates which I think is the same as a 2d array?
I was actually thinking it may be easier to delete excess data points than to interpolate, for example, starting with four lists, I could remove unnecessary points along the x axis since they are normalised and increasing, then cull points with too small a step size by referencing the shortest list step sizes (this explanation may not be so clear, I will try to write up an example and put it up tomorrow)
An example data set would be
line1=[[0.33,2],[0.66,5],[1,5]]
line 2=[[0.25,43],[0.5,53],[0.75,6.5],[1,986]]

so the solution that I used was to interpolate as suggested above, I've included a simplified version of the code below:
first the data is imported as a dictionary for ease of access and manipulation:
def average(files, newfile):
import csv
dict={}
ln=[]
max=0
for x in files:
with open(x+'.csv', 'rb') as file:
reader = csv.reader(file, delimiter=',')
l=[]
for y in reader:
l.append(y)
dict[x]=l
ln.append(x)
Next the length of the longest data set is established:
for y in ln:
if max == 0:
max = len(dict[y])
elif len(dict[y]) >= max:
max = len(dict[y])
next the number of iterations required for each dataset needs to be defined:
for y in ln:
dif = max - len(dict[y])
finally the intermediary values are calculated by linear interpolation and inserted to the dataset
for i in range(dif):
loc = int( i* len(dict[y])/dif)
if loc ==0:
loc =1
new = [(float(dict[y][loc-1][x])+float(dict[y][loc][x]))/2
for x in range(len(dict[y][loc]))]
dict[y].insert(loc,new)
then taking the average is very simple:
for x in range(len(dict[ln[0]])):
t = [sum(float(dict[u][x][0]) for u in ln)/len(ln),
sum(float(dict[u][x][1])/4 for u in ln)]
avg.append(t)
I'm not saying it's pretty code, but it does what I needed it to...

Related

Enumerating through a list of data to find averages, but the lines aren't just numbers

I am new to Python. I am enumerating through a large list of data, as shown below, and would like to find the mean of every line.
for index, line in enumerate (data):
#calculate the mean
However, the lines of this particular set of data are as such:
[array([[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]], dtype=float32)].
I would like to find the mean of both 2x1s separately, then the average of both means, so it outputs a single number.
Thanks in advance.

You probably do not need to enumerate through the list to achieve what you want. You can do it in two steps using list comprehension.
For example,
data = [[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]]
# Calculate the average for 2X1s or each row
avgs_along_x = [sum(line)/len(line) for line in data]
# Calculate the average along y
avg_along_y = sum(avgs_along_x)/len(avgs_along_x)
There are other ways to calculate the mean of a list in python. You can read about them here.
If you are using numpy this can be done in one line.
import numpy as np
np.average(data, 1) # calculate the mean along x-axes denoted as 1
# To get what you want, we can pass tuples of axes.
np.average(data, (1,0))

Calculating intermittent average

I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example

first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]

python average of multidimensional array netcdf plot

I read a multidimensional array from netCDF file.
The variable that I need to plot is named "em", and it has 4 dimensions ´em (years, group, lat, lon)´
The "group" variable has 2 values, I am interested only of the first one.
So the only variable that I need to manage is the "years" variable. The variable "years" has 17 values. For the first plot I need to average the first 5 years, and for the second plot I have to aveage from 6th years to the last years.
data = Dataset (r'D:\\Users\\file.nc')
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
year = data.variables['label'][:]
group = data.variables['group'][:]
em= data.variables['em'][:]
How can I create a 2 dimensional array avareging for this array ?
First one :
`em= data.variables['em'][0:4][0][:][:]`
Second one :
em= data.variables['em'][5:16][0][:][:]
I create a simple loop
nyear=(2005-2000)+1
for i in range (nyear):
em_1= data.variables['em'][i][0][:][:]
em_1+=em_1
em_2000_2005=em_1/nyear
but I think there could be more elegant easy way to this on python

I would highly recommend using xarray for working with NetCDF files. Rather than keeping track of indices positionally, you can operate on them by name which greatly improves code readability. In your example all you would need to do is
import xarray as xr
ds = xr.open_dataset(r'D:\\Users\\file.nc')
em_mean1 = ds.em.isel(label = range(6,18)).mean()
em_mean2 = ds.em.isel(label = range(6)).mean()
the .isel method selects the indices of the specified dimension (label in this case), and the .mean() method computes the average over the selection.

You can use NumPy:
em = data.variables['em'][:];
em_mean = np.mean(em,axis=0) # if you want to average all over first dimension
If data contains NaN's, just use NumPY's nanmean.
As you wanted to average first 3 values, for the first case, use:
em_mean1 = np.squeeze(np.mean(em[0:4,:],axis=0))
and take for the plot:
em_mean1 = np.squeeze(em_mean1[0,:]);
You can do similar for the second case.

I'm not sure what is wrong with my code.. (linear/polynomial regression)

I have a data set (csv file) with three seperate columns. Column 0 is the signal time, Column 1 is the frequency, and Column 2 is the intensity. The is alot of noise in the data that can be sorted though by finding the variance of each signal frequency. If it is <2332 then it is the right frequency. Hence, this would be the data I would want to calculate Linear/Poly regression on. p.s. I have to calc linear manually :(. The nested for loop decision structure I have isn't currently working. Any solutions would be helpful! thanks
data = csv.reader(file1)
sort = sorted(data, key=(operator.itemgetter(1))) #sorted by the frequencies
for row in sort:
x.append(float(row[0]))
y.append(float(row[2]))
frequencies.append(float(row[1]))
for i in range(499) :
freq_dict.update({ frequencies[i] : [x[i], y[i]] })
for key in freq_dict.items():
for row in sort :
if key == float(row[1]):
a.append(float(row[1]))
b.append(float(row[2]))
c.append(float(row[0]))
else :
num = np.var(a)
if num < 2332.0:
linearRegression(c, b, linear)
print('yo')
polyRegression(c, b, d, linear, py)
mplot.plot(linear, py)
else:
a = []
b = []
c = []
variances of 2332 or less are the frequencies I need
variances of 2332 or less are the frequencies I need
I used range of 499 because that is the length of my data set. Also, I tried to clear the lists (a,b,c) if the frequency wasn't correct.

There are several issues I see going on. I am unsure why you sort your data, if you all ready know the exact values you are looking for. I am unsure why you split up the data into separate variables as well. The double "for" loops means that you are repeating everything in "sort" for every single key in freq_dict. Not sure if that was your intention to repeat all those values multiple times. Also, freq_dict.items() produces tuples (key,value pairs), so your "key" is a tuple, hence "key" will never equal a float. Anyway, here is an attempt to re-write some code.
import csv, numpy
import matplotlib.pyplot as plt
from scipy import stats
data = csv.reader(file1) #Read file.
f_data = filter(lambda (x,f,y):f<2332.0,data) #Filter data to condition.
x,_,y = list(zip(*f_data)) #Split data down column.
#Standard linear stats function.
slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)
#Plot the data and the fit line.
plt.scatter(x,y)
plt.plot(x,numpy.array(x)*slope+intercept)
plt.show()

A more similar solution was using the corrcoef of the list. But in similar style it was as follows:
for key, value in freq_dict.items(): #1487
for row in sort: #when row -> goes to a new freq it calculates corrcoef of an empty list.
if key == float(row[1]): #1487
a.append(float(row[2]))
b.append(float(row[0]))
elif key != float(row[1]):
if a:
num = np.corrcoef(b, a)[0,1]
if (num < somenumber).any():
do stuff
a = [] #clear the lists and reset number
b = []
num = 0

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!

As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b

Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

averaging datasets of varying length - python

Related

Enumerating through a list of data to find averages, but the lines aren't just numbers

Calculating intermittent average

python average of multidimensional array netcdf plot

I'm not sure what is wrong with my code.. (linear/polynomial regression)

Pandas - expanding inverse quantile function

Categories

Resources