pandas limited resample / windowed replacing of multiple rows of values

pandas limited resample / windowed replacing of multiple rows of values - python

I am working with weather data for PV modules. The irradiance dataset (regular timeseries, 1 second data) I've been given shows an issue that occurs often in this field: occasionally, a zero value shows up when it shouldn't (daytime), e.g. due to an instrument or data writing error.
My solution that worked in the past is as below:
df['PoA_corr'] = df['PoA'].replace(0,np.nan).resample('1s').mean().interpolate(method='linear',axis=0).ffill().bfill()
where PoA: original, with issues, PoA_corr, my attempt at correcting errors.
However, as can be seen from the image below, not all of the erroneous points have been corrected appropriately: the issue is that the point where PoA == 0 is preceded and followed by 1-4 points that also are incorrect (i.e. the "V" shape in the data, with one point ==0 needs to be replaced by an interpolated line between the pre- and post- "V" points).
I have a few ideas in mind, but am stumped as to which is best, and which would be most pythonic (or able to be made so).
Get a list of indices where PoA == 0, look 3 seconds (rows) above, then replace 6-8 s (=6-8 rows) of data. I manage to find the list of points during the day using between_time and then find the point above using a timedelta, yet I don't know how to replace/overwrite the subsequent 6-8 rows (or interpolate between point "X-4" and "X+4", where X is the location where PoA == 0. The df is large (2.3 GB), so I'm loath to use a for loop on this. At the moment, my list of datetimes where PoA == 0 during day is found as:
df.between_time('09:00','16:00').loc[df['PoA']==0]['datetime']
Do some form of moving window on the data, so that if any value within the window == 0, => interpolate between first and last value of the window. Here I'm stumped as to how that could be done.
Is the solution to be found within pandas, or are numpy or pure python advisable?

Related

Debug Exact Cover Pentominoes, Wikipedia example incomplete? OR... I'm misunderstanding something (includes code)

The Problem:
I've implemented Knuth's DLX "dancing links" algorithm for Pentominoes in two completely different ways and am still getting incorrect solutions. The trivial Wikipedia example works OK (https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X#Example), but more complex examples fail.
Debugging the full Pentominoes game requires a table with almost 2,000 entries, so I came up with a greatly reduced puzzle (pictured below) that is still complex enough to show the errant behavior.
Below is my trivial 3x5 Pentominoes example, using only 3 pieces to place. I can work through the algorithm with pen and paper, and sure enough my code is doing exactly what I told it to, but on the very first step, it nukes all of my rows! When I look at the connectedness, the columns certainly do seem to be OK. So clearly I'm misunderstanding something.
The Data Model:
This is the trivial solution I'm trying to get DLX to solve:
Below is the "moves" table, which encodes all the valid moves that the 3 pieces can make. (I filter out moves where a piece would create a hole size not divisible by 5)
The left column is the encoded move, for example the first row is
piece "L", placed at 0,0, then rotated ONE 90-degree turn
counter-clockwise.
vertical bar (|) delimiter
First 3 columns are the selector bit for which piece I'm referring to.
Since "l" is the first piece (of only 3), it has a 1 in the leftmost column.
The next 15 columns are 1 bit for every spot on a 3x5 pentominoes board.
l_0,0_rr10|100111100001000000
l_0,1_rr10|100011110000100000
l_1,1_rr10|100000000111100001
l_0,0_rr01|100111101000000000
l_0,1_rr01|100011110100000000
l_1,0_rr01|100000001111010000
l_0,0_rr30|100100001111000000
l_1,0_rr30|100000001000011110
l_1,1_rr30|100000000100001111
l_0,1_rr01|100000010111100000
l_1,0_rr01|100000000001011110
l_1,1_rr01|100000000000101111
t_0,1_rr00|010011100010000100
t_0,0_rr10|010100001110010000
t_0,1_rr20|010001000010001110
t_0,2_rr30|010000010011100001
y_1,0_rr00|001000000100011110
y_1,1_rr00|001000000010001111
y_1,0_rr01|001000000100011110
y_1,1_rr01|001000000010001111
y_0,0_rr20|001111100010000000
y_0,1_rr20|001011110001000000
y_0,0_rr01|001111100100000000
y_0,1_rr01|001011110010000000
An Example Failure:
The First Move kills all the rows of my array (disregarding the numeric header row and column)
Following the wikipedia article cited earlier, I do:
Look for minimum number of bits set in a column
4 is the min count, and column 2 is the leftmost column with that bit set
I choose the first row intersecting with column 2, which is row 13.
Column 4 and row 13 will be added to the columns and rows to be "covered" (aka deleted)
Now I look at row 13 and find all intersecting columns: 2, 5, 6, 7, 11 & 16
Now I look at all the rows that intersect with any 1 in any of those columns - THIS seem to be the problematic step - that criteria selects ALL 24 data rows to remove.
Since the board is empty, the system thinks it has found a valid solution.
Here's a picture of my pen-and-paper version of the algorithm:
Given the requests for code, I'm now attaching it. The comments at the top explain where to look.
Here's the code:
https://gist.github.com/ttennebkram/8bd27adece6fb3a5cd1bdb4ab9b51166
Second Test
There's a second 3x5 puzzle I thought of, but it hits the same problem the first example has. For the record, the second 3x5 is:
# Tiny Set 2: 3x5
# u u v v v
# u p p p v
# u u p p v

The issue you're seeing with your hand-run of the algorithm is that a matrix with no rows is not a solution. You need to eliminate all the columns, just getting rid of the rows is a failure. Your example run still has 12 columns that need to be solved left, so it's not a success.

Your exact cover implementation seems OK for the reduced instance, but the plotter was broken. I fixed it by changing
boardBitmap = fullBitmap[12:]
to
boardBitmap = fullBitmap[3:]
in plotMoveToBoard_np, since there are only three pieces in the reduced instance.
EDIT: there's also a problem with how you generate move names. There are distinct moves with the same name. There are also duplicate moves (which don't affect correctness but do affect performance). I changed
- g_rowNames.append(rowName)
+ g_rowNames.append(str(hash(str(finalBitmask))))
and 3x20 starts working as it should. (That's not a great way to generate the names, because in theory the hashes could collide, but it's one line.)

Why does peakutils.peak.indexes() seem to ignore the provided threshold value?

I'm retrieving the arrays holding the power levels and frequencies, respectively, of a signal from the plt.psd() method:
Pxx, freqs = plt.psd(signals[0], NFFT=2048, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6, scale_by_freq=True, color="green")
Please ignore the green and red signals. Just the blue one is relevant for this question.
I'm able to have the peakutils.peak.indexes() method return the X and Y coordinates of a number of the most significant peaks (of the blue signal):
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.6/max(power_lvls), min_dist=120)
print("\nX: {}\n\nY: {}\n".format(freqs[indexes], np.array(power_lvls)[indexes]))
As can be seen, the coordinates fit the blue peaks quite nicely.
What I'm not satisfied with is the number of peak coordinates I receive from the peak.indexes() method. I'd like to have only the coordinates of all peaks above a certain power level returned, e.g., -25 (which would then be exactly 5 peaks for the blue signal). According to the documentation of the peak.indexes() method this is done by providing the desired value as thres parameter.
But no matter what I try as thres, the method seems to entirely ignore my value and instead solely rely on the min_dist parameter to determine the number of returned peaks.
What is wrong with my threshold value (which I believe means "peaks above the lower 60% of the plot" in my code now) and how do I correctly specify a certain power level (instead of a percentage value)?
[EDIT]
I figured out that apparently the thres parameter can only take positive values between float 0. and 1.
So, by changing my line slightly as follows I can now influence the number of returned peaks as desired:
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.4, min_dist=1)
But that still leaves me with the question whether it's possible to somehow limit the result to the five highest peaks (provided num_of_peaks above thres >= 5).
I believe something like the following would return the five highest values:
print(power_lvls[np.argsort(power_lvls[indexes])[-5:]])
Unfortunately, though, negative values seem to be interpreted as the highest values in my power_lvls array. Can this line be changed such that (+)10 would be considered higher than, e.g., -40? Or is there another (better?) solution?
[EDIT 2]
These are the values I get as the six "highest" peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
power_lvls_max = power_lvls[np.argsort(power_lvls[indexes])[-6:]]
print("Highest Peaks in Signal:\nX: \n\nY: {}\n".format(power_lvls_max))
After trying various things for hours without any improvement I'm starting to think that these are neither valleys nor peaks, just some "random" values?! Which leads me to believe that there is a problem with my argsort line that I have to figure out first?!
[EDIT 3]
The bottleneck.partition() method seems to return the correct values (even if apparently it does so in random order, not from leftmost peak to rightmost peak):
import bottleneck as bn
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
Luckily, the order of the peaks is not important for what I have planned to do with the coordinates. I do, however, have to figure out yet how to match the Y values I have now to their corresponding X values ...
Also, while I do have a solution now, for learning purposes it would still be interesting to know what was wrong with my argsort attempt.

A simple way to solve this would be to add a constant (for example +50 dB) to your Pxx vector before the processing. That way you would avoid the negative-valued peaks. After the processing is done, you can subtract the constant again to get the right peak values.

I figured it out how to find the corresponding X values and get full coordinates of the six highest peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
print("Peaks in Signal 1\nX: {}\n\nY: {}\n".format(freqs[indexes], power_lvls[indexes]))
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
check = np.isin(power_lvls, power_lvls_max)
indexes_max = np.where(check)
print("Highest Peaks in Signal 1:\nX: {}\n\nY: {}\n".format(freqs[indexes_max], power_lvls[indexes_max]))
Now I have my "peak filtering" (kind of), which I originally tried to achieve by messing around with the thres value of peakutils.peak.indexes(). The code above gives me just the desired result:

Dividing Pandas DataFrame rows into similar time-based groups

I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv

There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:

If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:

fill missing values in python array

Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.

When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.

Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.