Calculating an average for every X number of lines - python

I am trying to take data from a text file and calculate an average for every 600 lines of that file. I'm loading the text from the file, putting it into a numpy array and enumerating it. I can get the average for the first 600 lines but I'm not sure how to write a loop so that python calculates an average for every 600 lines and then puts this into a new text file. Here is my code so far:
import numpy as np
#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)
#creates array for u wind values
for i,d in enumerate(data):
data[i] = (d[3])
if i == 600:
minavg = np.mean(data[i == 600])
#finds total u mean for day
ubar = np.mean(data)

Based on what I understand from your question, it sounds like you have some file that you want to take the mean of every line up to the 600th one, and repeat that multiple times till there is no more data. So at line 600 you average lines 0 - 600, at line 1200 you average lines 600 to 1200.
Modulus division would be one approach to taking the average when you hit every 600th line, without having to use a separate variable to keep count how many lines you've looped through. Additionally, I used Numpy Array Slicing to create a view of the original data, containing only the 4th column out of the data set.
This example should do what you want, but it is entirely untested... I'm also not terribly familiar with numpy, so there are some better ways do this as mentioned in the other answers:
import numpy as np
#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)
data_you_want = data[:,3]
daily_averages = list()
#creates array for u wind values
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
You can either modify the example above to write the mean out to a new file, instead of appending to a list as I have done, or just write the daily_averages list out to whatever file you want.
As a bonus, here is a Python solution using only the CSV library. It hasn't been tested much, but theoretically should work and might be fairly easy to understand for someone new to Python.
import csv
data = list()
daily_average = list()
num_lines = 600
with open('testme.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for i,row in enumerate(reader):
if (i % num_lines) == 0 and i != 0:
average = sum(data[i - num_lines:i]) / num_lines
daily_average.append(average)
data.append(int(row[3]))
Hope this helps!

Simple solution would be:
import numpy as np
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
mydata=[]; counter=0
for i,d in enumerate(data):
mydata.append((d[3]))
# Find the average of the previous 600 lines
if counter == 600:
minavg = np.mean(np.asarray(mydata))
# reset the counter and start counting from 0
counter=0; mydata=[]
counter+=1

The following program uses array slicing to get the column, and then a list comprehension indexing into the column to get the means. It might be simpler to use a for loop for the latter.
Slicing / indexing into the array rather than creating new objects also has the advantage of speed as you're just creating new views into existing data.
import numpy as np
# test data
nr = 11
nc = 3
a = np.array([np.array(range(nc))+i*10 for i in range(nr)])
print a
# slice to get column
col = a[:,1]
print col
# comprehension to step through column to get means
numpermean = 2
means = [np.mean(col[i:(min(len(col), i+numpermean))]) \
for i in range(0,len(col),numpermean)]
print means
it prints
[[ 0 1 2]
[ 10 11 12]
[ 20 21 22]
[ 30 31 32]
[ 40 41 42]
[ 50 51 52]
[ 60 61 62]
[ 70 71 72]
[ 80 81 82]
[ 90 91 92]
[100 101 102]]
[ 1 11 21 31 41 51 61 71 81 91 101]
[6.0, 26.0, 46.0, 66.0, 86.0, 101.0]

Something like this works. Maybe not that readable. But should be fairly fast.
n = int(data.shape[0]/600)
interestingData = data[:,3]
daily_averages = np.mean(interestingData[:600*n].reshape(-1, 600), axis=1)

Related

Calculating the Difference in values in a dataframe

I have a dataframe that looks like this:
index Rod_1 label
0 [[1.94559799] [1.94498416] [1.94618273] ... [1.8941952 ] [1.89461277] [1.89435902]] F0
1 [[1.94129488] [1.94268905] [1.94327065] ... [1.93593512] [1.93689935] [1.93802091]] F0
2 [[1.94034818] [1.93996006] [1.93940095] ... [1.92700882] [1.92514855] [1.92449449]] F0
3 [[1.95784532] [1.96333782] [1.96036528] ... [1.94958261] [1.95199495] [1.95308231]] F2
Each cell in the Rod_1 column has an array of 12 million values. I'm trying the calculate the difference between every two values in this array to remove seasonality. That way my model will perform better, potentially.
This is the code that I've written:
interval = 1
for j in range(0, len(df_all['Rod_1'])):
for i in range(1, len(df_all['Rod_1'][0])):
df_all['Rod_1'][j][i - interval] = df_all['Rod_1'][j][i] - df_all['Rod_1'][j][i - interval]
I have 45 rows, and as I said each cell has 12 million values, so it takes 20 min to for my laptop to calculate this. Is there a faster way to do this?
Thanks in advance.
This should be much faster, I've tested up till 1M elements per cell for 10 rows which took 1.5 seconds to calculate the diffs (but a lot longer to make the test table)
import pandas as pd
import numpy as np
import time
#Create test data
np.random.seed(1)
num_rows = 10
rod1_array_lens = 5 #I tried with this at 1000000
possible_labels = ['F0','F1']
df = pd.DataFrame({
'Rod_1':[[[np.random.randint(10)] for _ in range(rod1_array_lens)] for _ in range(num_rows)],
'label':np.random.choice(possible_labels, num_rows)
})
#flatten Rod_1 from [[1],[2],[3]] --> [1,2,3]
#then use np.roll to make the diffs, throwing away the last element since it rolls over
start = time.time() #starting timing now
df['flat_Rod_1'] = df['Rod_1'].apply(lambda v: np.array([z for x in v for z in x]))
df['diffs'] = df['flat_Rod_1'].apply(lambda v: (np.roll(v,-1)-v)[:-1])
print('Took',time.time()-start,'to calculate diff')

Create a data frame out of input_t (which is actually three numbers and act as features ) and output_t as output

This is my code
output_data = []
out = ''
i = 0
P = 500
X = 40000
while i<600:
subVals = values[i:i+X]
signal=subVals.val1
signal, rpeaks = biosppy.signals.ecg.ecg(signal, show=False)[1:3]
rpeaks=rpeaks.tolist()
nni = tools.nn_intervals(rpeaks)
fre = fd.welch_psd(nni)
tm = td.nni_parameters(nni)
f1=(fre['fft_peak'])
t1=(tm['nni_min'])
f11=np.asarray(f1)
t11=np.asarray(t1)
input_t=np.append(f11,t11)
output_t=subVals.BLEEDING
output_t=int(round(np.mean(output_t)))
i+=P
As you see we are in a loop and the goal here is to create a data frame or a csv file from input_t and output_t. Here is an example of them in one loop
input_t
array([2.83203125e-02, 1.21093750e-01, 3.33984375e-01, 8.17000000e+02])
output_t
0
I am trying to create matrix where for every rows, the first three columns is one iteration of input_t and last column is output_t. Based on the code, since i needs to be less than 600 and the initial value of i is 0 and the step is 600 so we have two loops which makes it 2 rows in total and 5 columns(4 values from input_t and 1 value from output_t) . I tried append, I tried something like out+="," but I am not sure why that is not working
Init any variable as a list before the loop and add results to it
out = []
while i<600:
....
input_t=np.append(f11,t11)
output_t=subVals.BLEEDING
output_t=int(round(np.mean(output_t)))
out.append(input_t+[output_t])
Now out is the list of lists which you can load to DataFrame

How to form a numpy matrix of known number of rows but unknown number of columns?

I am trying to form a feature matrix that consists of peaks that are obtained through Power Spectral Density (PSD). I have around 1425 data files where this number is going to be the number of rows for my matrix. However, the peaks vary from each file as each file might have a different number of peaks (Say, for example, file 15 has 5 peaks and file 20 has 3 peaks). Therefore, how can I formulate a matrix where it inputs the peaks with regards to the number of columns?
I tried to implement the code (which can be found below), however, I had to initialise the number of columns and it keeps on printing even though it passes the range of number of columns
Data_PSD = np.zeros((15, 1425))
# Forming a feature matrix from peaks PSD values:
y_axis_PSD_peaks = np.asarray(y_axis[0].transpose())
DataFrame_Feature_PSD = np.array(y_axis_PSD_peaks)
DataSizecolumnPSD = DataSizecolumnPSD + 1
# print('Data Size column: ', DataSizecolumn)
Data_PSD[DataSizecolumnPSD - 1] = DataFrame_Feature_PSD
if DataSizecolumnPSD in range(1, y_axis.shape[0]):
DataSizerowPSD = DataSizerowPSD + 1
# print('Data Size row: ', DataSizerow)
Data_PSD[DataSizerowPSD - 1] = DataFrame_Feature_PSD
if DataSizecolumnPSD == 0:
Data_PSD[DataSizecolumnPSD - 1] = 0
print('Signal/Exp. {0}'.format(count))
# print('Data Frame: ', Data)
Data_tran = Data_PSD.transpose()
# print('Data Frame Transpose: ', Data_tran)
# PSD_Matrix = np.append(arr=PSD_Matrix, values=Data_tran)
Feature_PSD_Matrix = Data_tran
print('Saved PSD Dataset file', Feature_PSD_Matrix)
I keep on having error after reading file 16. The error is:
IndexError: index 15 is out of bounds for axis 0 with size 15
Update 1
I tried to implement the list, however, I am getting errors like [TypeError: 'float' object is not iterable]. The code for the list can be found below:
y_axis_PSD_peaks = np.asarray(y_axis[0].transpose())
DataFrame_Feature_PSD = np.array(y_axis_PSD_peaks)
Data_PSD = DataFrame_Feature_PSD.tolist()
for row in Data_PSD:
for elem in row:
print(elem, end=' , ')
Feature_PSD_Matrix = Data_PSD
I found this website: https://snakify.org/en/lessons/two_dimensional_lists_arrays/#section_4
Update 2
I managed to solve the probelm by just doing the following code:
for row in range(1425):
Data_PSD.append(PSD_Peaks_list[0:PSD_Peaks_list_shape[0]])
row =+1
print(Data_PSD)
break
Feature_PSD_Matrix = Data_PSD
Now, I am currently trying to save the matrix into a txt file, however, it keeps on printing empty lines with the exact number of rows. How can I save a list of float numbers in the txt file? The following code is what I keep on getting empty lines:
with open('filename.txt','w') as out_file:
for i in range(len(Feature_PSD_Matrix)):
out_string = ""
# Feature_PSD_Matrix_str = map(str, Data_PSD)
out_string = Data_PSD
out_string = "\n"
out_file.write('{} \n'.format(out_string))
Update 3
[[0.0007990594037643691, 0.0004403783323228196, 0.000598699480390045, 0.0007168056730554628], [0.0008817249681644274, 0.00045377012471994107, 0.0006534686195005596, 0.0008000537399756023], [0.0005702091244336749, 0.00037008247488756936, 0.0005395840488197647, 0.0005684589958136876]]
I want to change it to:
[[0.0007990594037643691, 0.0004403783323228196, 0.000598699480390045, 0.0007168056730554628]
[0.0008817249681644274, 0.00045377012471994107, 0.0006534686195005596, 0.0008000537399756023]
[0.0005702091244336749, 0.00037008247488756936, 0.0005395840488197647, 0.0005684589958136876]]

enumeration of elements for lists within lists

I have a collection of files (kind of like CSV, but no commas) with data arranged like the following:
RMS ResNum Scores Rank
30 1 44 5
12 1 99 2
2 1 60 1
1.5 1 63 3
12 2 91 4
2 2 77 3
I'm trying to write a script that enumerates for me and gives an integer as the output. I want it to count how many times we get a value of RMS below 3 AND a score above 51. Only if both these criteria are met should it add 1 to our count.
HOWEVER, the tricky part is that for any given "ResNum" it cannot add 1 multiple times. In other words, I want to sub-group the data by ResNum then decide 1 or 0 on whether or not those two criteria are met within that group.
So right now it would give as an output as 3, whereas I want it to display 2 instead. Since ResNum 1 is being counted twice here (two rows meet the criteria).
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
for input_file in file_list:
masterlist = []
opened_file = open(input_file,'r')
count = 0
for line in opened_file:
data = line.split()
templist = []
templist.append(float(data[0])) #RMS
templist.append(int(data[1])) #ResNum
templist.append(float(data[2])) #Scores
templist.append(float(data[3])) #Rank
masterlist.append(templist)
then here comes the part that needs modification (I think)
for placement in masterlist:
if placement[0] <3 and placement[2] >51.0:
count += 1
print input_file
print count
count = 0
Choose you data structures carefully to make your life easier.
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
grouper = {}
for input_file in file_list:
with open(input_file) as f:
grouper[input_file] = set()
for line in f:
rms, resnum, scores, rank = line.split()
if int(rms) < 3 and float(scores) > 53:
grouper[input_file].add(float(resnum))
for input_file, group in grouper.iteritems():
print input_file
print len(group)
This creates a dictionary of sets. The key of this dictionary is the file-name. The values are sets of the ResNums, added only when your condition holds. Since sets don't have repeated elements, the size of your set (len) will give you the right count of the number of times your condition was met, per ResNum, per file.

Translating a gridded csv file with numpy

I need to get some meteorological data into a MySQL database.
File inputFile.csv is a comma-delimited list of values. There are 241 lines and 481 values per line.
Each line maps to a certain latitude, and each value's position within the line maps to a certain longitude.
There are two additional files with the same structure, lat.csv and lon.csv. These files contain the coordinates that the values in inputFile.csv map to.
So to find the latitude and longitude for a value in inputFile.csv, we need to refer to the values at the same line/position (or row/column) within lat.csv and lon.csv
I want to translate inputFile.csv using lat.csv and lon.csv such that my output file contains a list of values (from inputFile.csv),latitudes, and longitudes.
Here is a small visual example:
inputFile.csv
3,5,1,4,5
1,4,1,2,5
5,7,3,8,0
lat.csv
22,31,51,21,52
55,21,24,66,12
11,23,12,55,55
lon.csv
12,35,12,52,11
35,11,25,33,42
62,53,45,25,54
output:
val lat lon
3 22 12
5 31 35
1 51 12
4 21 52
5 52 11
1 55 35
4 21 11
1 24 25
2 66 33
etc
What is the best way to do this in python/numpy?
I suppose that since you know the total size the the array that you want, you can preallocate it:
a = np.empty((241*481,3))
Now you can add the data:
for i,fname in enumerate(('inputFile.csv','lat.csv','lon.csv')):
with open(fname) as f:
data = np.fromfile(f,sep=',')
a[:,i] = data.ravel()
If you don't know the number of elements up front, you can generate a 2-d list instead (a list of np.ndarrays):
alist = []
for fname in ('inputFile.csv','lat.csv','lon.csv'):
with open(fname) as f:
data = np.fromfile(f,sep=',')
alist.append( data.ravel() )
a = np.array(alist).T
Only with numpy functions:
import numpy as np
inputFile = np.gentfromtxt('inputFile.csv',delimiter = ',')
inputFile.reshape(-1)
lat = np.gentfromtxt('lat.csv',delimiter = ',')
lat.reshape(-1)
lon = np.gentfromtxt('lon.csv',delimiter = ',')
lon.reshape(-1)
output = np.vstack( (inputFile,lat,lon) )

Categories