Let me start by saying that I know this error message has posts about it, but I'm not sure what's wrong with my code. The block of code works just fine for the first two loops, but then fails. I've even tried removing the first two loops from the data to rule out issues in the 3rd loop, but no luck. I did have it set to print out the unsorted temporary list, and it just prints an empty array for the 3rd loop.
Sorry for the wall of comments in my code, but I'd rather have each line commented than cause confusion over what I'm trying to accomplish.
TL;DR: I'm trying to find and remove outliers from a list of data, but only for groups of entries that have the same number in column 0.
Pastebin with data
import numpy as np, csv, multiprocessing as mp, mysql.connector as msc, pandas as pd
import datetime
#Declare unsorted data array
d_us = []
#Declare temporary array for use in loop
tmp = []
#Declare sorted data array
d = []
#Declare Sum variable
tot = 0
#Declare Mean variable
m = 0
#declare sorted final array
sort = []
#Declare number of STDs
t = 1
#Declare Standard Deviation variable
std = 0
#Declare z-score variable
z_score
#Timestamp for output files
nts = datetime.datetime.now().timestamp()
#Create output file
with open(f"calib_temp-{nts}.csv", 'w') as ctw:
pass
#Read data from CSV
with open("test.csv", 'r', newline='') as drh:
fr_rh = csv.reader(drh, delimiter=',')
for row in fr_rh:
#append data to unsorted array
d_us.append([float(row[0]),float(row[1])])
#Sort array by first column
d = np.sort(d_us)
#Calculate the range of the data
l = round((d[-1][0] - d[0][0]) * 10)
#Declare the starting value
s = d[0][0]
#Declare the ending value
e = d[-1][0]
#Set the while loop counter
n = d[0][0]
#Iterate through data
while n <= e:
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Convert to numpy array
tmp = np.array(tmp)
#Sort numpy array
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
with open(f"calib_temp-{nts}.csv", 'a', newline='') as ct:
c = csv.writer(ct, delimiter = ',')
c.writerow([y[0],y[1]])
#Reset Variables
tot = 0
m = 0
n = n + 0.1
tmp = []
std = 0
z_score = 0
Do this before the loop:
#Create output file
ct = open(f"calib_temp-{nts}.csv", 'w')
c = csv.writer(ct, delimiter = ',')
Then change the loop to this. Note that I have moved your initializations to the top of the loop, so you don't need to initialize them twice. Note the if tmp: line, which solves the numpy exception.
#Iterate through data
while n <= e:
tot = 0
m = 0
tmp = []
std = 0
z_score = 0
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Sort numpy array
if tmp:
#Convert to numpy array
tmp = np.array(tmp)
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
c.writerow([y[0],y[1]])
#Reset Variables
n = n + 0.1
I want to store certain values in a 2D array. In the below code. I want sT to be total. When the inner loop runs the values to be stored in rows and then next column when the outer loop increment happens.
class pricing_lookback:
def __init__(self,spot,rate,sigma,time,sims,steps):
self.spot = spot
self.rate = rate
self.sigma = sigma
self.time = time
self.sims = sims
self.steps = steps
self.dt = self.time/self.steps
def call_floatingstrike(self):
simulationS = np.array([])
simulationSt = np.array([])
call2 = np.array([])
total = np.empty(shape=[self.steps, self.sims])
for j in range(self.sims):
sT = self.spot
pathwiseminS = np.array([])
for i in range(self.steps):
phi= np.random.normal()
sT *= np.exp((self.rate-0.5*self.sigma*self.sigma)*self.dt + self.sigma*phi*np.sqrt(self.dt))
pathwiseminS = np.append(pathwiseminS, sT)
np.append(total,[[j,sT]])###This should store values in rows of j column
#print (pathwiseminS)
#tst1 = np.append(tst1, pathwiseminS[1])
call2 = np.append(call2, max(pathwiseminS[self.steps-1]-self.spot,0))
#print (pathwiseminS[self.steps-1])
#print(call2)
simulationSt = np.append(simulationSt,pathwiseminS[self.steps-1])
simulationS = np.append(simulationS,min(pathwiseminS))
call = max(np.average(simulationSt) - np.average(simulationS),0)
return call, total#,call2,
Here is a simple example of what I think you are trying to do:
for i in range(5):
row = np.random.rand(5,)
if i == 0:
my_array = row
else:
my_array = np.vstack((my_array, row))
print(row)
However, this is not very efficient with memory, especially if you are dealing with large arrays, as this has to allocate new memory on every loop. It would be much better to preallocate an empty array and then populate it if possible.
To answer the question of how to append a column, it would be something like this:
import numpy as np
x = np.random.rand(5, 4)
column_to_append = np.random.rand(5,)
np.insert(x, x.shape[1], column_to_append, axis=1)
Again, this is not memory efficient and should be avoided whenever possible. Preallocation is much better.
I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?
You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184
First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])
Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())
Question: How could I peform the following task more efficiently?
My problem is as follows. I have a (large) 3D data set of points in real physical space (x,y,z). It has been generated by a nested for loop that looks like this:
# Generate given dat with its ordering
x_samples = 2
y_samples = 3
z_samples = 4
given_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for y in range(y_samples):
for x in range(x_samples):
row = [x+.1,y+.2,z+.3]
given_dat[row_ind,:] = row
row_ind += 1
for row in given_dat:
print(row)`
For the sake of comparing it to another set of data, I want to reorder the given data into my desired order as follows (unorthodox, I know):
# Generate data with desired ordering
x_samples = 2
y_samples = 3
z_samples = 4
desired_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for x in range(x_samples):
for y in range(y_samples):
row = [x+.1,y+.2,z+.3]
desired_dat[row_ind,:] = row
row_ind += 1
for row in desired_dat:
print(row)
I have written a function that does what I want, but it is horribly slow and inefficient:
def bad_method(x_samp,y_samp,z_samp,data):
zs = np.unique(data[:,2])
xs = np.unique(data[:,0])
rowlist = []
for z in zs:
for x in xs:
for row in data:
if row[0] == x and row[2] == z:
rowlist.append(row)
new_data = np.vstack(rowlist)
return new_data
# Shows that my function does with I want
fix = bad_method(x_samples,y_samples,z_samples,given_dat)
print('Unreversed data')
print(given_dat)
print('Reversed Data')
print(fix)
# If it didn't work this will throw an exception
assert(np.array_equal(desired_dat,fix))
How could I improve my function so it is faster? My data sets usually have roughly 2 million rows. It must be possible to do this with some clever slicing/indexing which I'm sure will be faster but I'm having a hard time figuring out how. Thanks for any help!
You could reshape your array, swap the axes as necessary and reshape back again:
# (No need to copy if you don't want to keep the given_dat ordering)
data = np.copy(given_dat).reshape(( z_samples, y_samples, x_samples, 3))
# swap the "y" and "x" axes
data = np.swapaxes(data, 1,2)
# back to 2-D array
data = data.reshape((x_samples*y_samples*z_samples,3))
assert(np.array_equal(desired_dat,data))
I have 3 lists, all of the same length. One of the lengths is a number representing a day, and the other two lists are data which correspond to that day, e.g
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4....]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4....] # (effectively random numbers)
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4....] # (again, effectively random numbers)
What I need to do is to take data1 and data2 for day 1, perform operations on it, and then repeat the process for day 2, day 3, day 4 and so on.
I currently have:
def sortfile(day, data1, data2):
x=[]
y=[]
date=[]
temp1=[]
temp2=[]
i=00
for i in range(0,len(day)-1):
if day[i] == day[i+1]:
x.append(data1[i])
y.append(data2[i])
i+=1
#print x, y
else:
for i in range(len(x)):
temp1.append(x)
for i in range(len(y)):
temp2.append(y)
date.append(day[i])
x=[]
y=[]
i+=1
while i!=(len(epoch)-1):
x.append(data1[i])
y.append(data2[i])
i+=1
date.append(day[i])
return date, temp1, temp2
This is supposed to append to the x array whilst the day stays the same, and then if it changes append all the data from the x array to the temp1 array, then clear the x array. It will then perform operations on temp1 and temp2. However, when I run this as a check (I'm aware that I'm not clearing temp1 and temp2 at any point), temp1 just fills with the full list of days and temp2 is empty. I'm not sure why this is and am open to completely restarting if a better way is suggested!
Thanks in advance.
Just zip the lists:
x = []
y = []
date = []
temp1 = []
temp2 = []
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
zipped = zip(day, data1,data2) # list(zipped) for python 3
for ind, dy, dt1, dt2 in enumerate(zipped[:-1]):
if zipped[ind+1][0] == dy:
x.append(dt1)
y.append(dt2)
else:
temp1 += x
temp2 += y
x = []
y = []
Not sure what your while loop is doing as it is outside the for loops and you don't actually return or use x and y so that code seems irrelevant and may well be the reason your code is not returning what you expect.
groupby and zip are a good solution for this problem. It lets you group bits of sorted data together. zip allows you to access the elements at each index of day, data1, and data2 together as a tuple.
from operator import itemgetter
from itertools import groupby
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
x = []
y = []
for day_num, data in groupby(zip(day, data1, data2), itemgetter(0)):
data = list(data)
data1_total = sum(d[1] for d in data)
x.append(data1_total)
data2_total = sum(d[2] for d in data)
y.append(data2_total)
itemgetter is just a function that tells groupby to group the tuple of elements by the first element in the tuple (the day value).
Another option is to use defaultdict and simply iterate over days adding data as we go:
from collections import defaultdict
d1 = defaultdict(list)
d2 = defaultdict(list)
for n, d in enumerate(day):
d1[d].append(data1[n])
d2[d].append(data2[n])
This creates two dicts like {day: [value1, value2...]...}. Note that this solution doesn't require days to be sorted.
Running several threads is similar to running several different programs concurrently sharing the same data. You can call a thread for each array.
Read more about threading on the threading documentation