fill missing values in python array - python

Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.

When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.

Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)

Related

pandas limited resample / windowed replacing of multiple rows of values

I am working with weather data for PV modules. The irradiance dataset (regular timeseries, 1 second data) I've been given shows an issue that occurs often in this field: occasionally, a zero value shows up when it shouldn't (daytime), e.g. due to an instrument or data writing error.
My solution that worked in the past is as below:
df['PoA_corr'] = df['PoA'].replace(0,np.nan).resample('1s').mean().interpolate(method='linear',axis=0).ffill().bfill()
where PoA: original, with issues, PoA_corr, my attempt at correcting errors.
However, as can be seen from the image below, not all of the erroneous points have been corrected appropriately: the issue is that the point where PoA == 0 is preceded and followed by 1-4 points that also are incorrect (i.e. the "V" shape in the data, with one point ==0 needs to be replaced by an interpolated line between the pre- and post- "V" points).
I have a few ideas in mind, but am stumped as to which is best, and which would be most pythonic (or able to be made so).
Get a list of indices where PoA == 0, look 3 seconds (rows) above, then replace 6-8 s (=6-8 rows) of data. I manage to find the list of points during the day using between_time and then find the point above using a timedelta, yet I don't know how to replace/overwrite the subsequent 6-8 rows (or interpolate between point "X-4" and "X+4", where X is the location where PoA == 0. The df is large (2.3 GB), so I'm loath to use a for loop on this. At the moment, my list of datetimes where PoA == 0 during day is found as:
df.between_time('09:00','16:00').loc[df['PoA']==0]['datetime']
Do some form of moving window on the data, so that if any value within the window == 0, => interpolate between first and last value of the window. Here I'm stumped as to how that could be done.
Is the solution to be found within pandas, or are numpy or pure python advisable?

Setting boundary limits to multiple operations with random number generators

I believe that my problem is really straightforward and there must be a really easy way to solve this issue, however as I am quite new with Python, I could not sort it out by my own.
I will post a made up example that I am using than the complex script which I am currently working on in case you want to test by yourself. Please, consider the following:
import numpy as np
nData = 100
sigma_alpha = np.array([1,1])
alpha = [-23,0]
data_alpha1 = np.random.randn(nData)*sigma_alpha[0]+alpha[0]
data_alpha2 = np.random.randn(nData)*sigma_alpha[1]+alpha[1]
My issue is that I have to limit data_alpha1 and data_alpha2 to -25 as lower limit and 25 as upper limit. That means, all the elements on both arrays have to be in between the aforementioned values. So, the solution that I am looking for has also to involve a case where all the elements of data_alpha1,2<25, as the following, where multiple values will be beyond 25:
nData = 100
sigma_alpha = np.array([1,1])
alpha = [25,0]
data_alpha1 = np.random.randn(nData)*sigma_alpha[0]+alpha[0]
data_alpha2 = np.random.randn(nData)*sigma_alpha[1]+alpha[1]
The variable alpha is in a loop, so it has a dynamic value and is constantly being update.
To sum up: what I have been trying to figure out is a way to make sure that data_alpha1 and data_alpha2 returns only values inbetween -25 and 25, and in case, any value doesn't respect the condition imposed, then it should be set to the closest boundary value which it surpasses. Like, if an element of alpha_data1 <-25, then it should be replaced by -25.
Hope that I managed to be succinct and precise. I would really appreciate your help on this one!
Like this:
data_alpha1[data_alpha1 > 25] = 25
data_alpha1[data_alpha1 < -25] = -25

How to fix negative values in log?

So, I am getting the data from a txt file and I want to get specific data within the whole set. In the code, I am trying to grab it by specifying which indexes and which frequencies are being used for those indexes. But my log is showing a negative value and I don't how to fix that. Code is below, thanks!
indexes = [9,10,11,12,13]
frequenciesmh = [151,610,1400,4860,18000]
frequenciesgh = [i*10**-3 for i in frequenciesmh]
bigclusterallfluxes = bigcluster[indexes]
bigclusterlogflux151mhandredshift = [i[indexes] for i in bigcluster]
shiftedlogflux151mh =
[np.interp(np.log10((151*10**-3)*i[0]),np.log10(frequenciesgh),i[1:])
for i in bigclusterlogflux151mhandredshift]
shiftflux151mh = [10**i for i in shiftedlogflux151mh]
bigclusterflux151mhandredshift =
np.array(list(zip(shiftflux151mh,np.transpose(bigcluster)[9])))
I don't know what you are trying to fix exactly, but I would definitely NOT change the negative values as they would change the power to being positive always (if you know some maths you will understand that that means 1/16 ==> 16 and also 16 ==> 16).
What you probably want, as you are working with frequencies (Which are always between 0 and 1, if you normalize them, to do this divide each of them by the sum of all of them, hence your logarithm will always be smaller or equal to 0) is to make them all positive and have the - log 10 of your probability, which is quite a common value to have, then 1 == 1/10, 2 == 1/100, etc (which in genetics at least are called phred values I believe).
Summarizing always call the minus log, not the log
-math.log(0.0001)
The abs() function is what you are looking for.

Dividing Pandas DataFrame rows into similar time-based groups

I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:

Perceptron Learning Algorithm taking a lot of iterations to converge?

I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).

Categories