Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am plotting in there (http://db.tt/9SG85XFK) a pandas dataframe; index of 'timestamp' with two variables (plotted as blue and green curves).
I would like to extract subsets of that dataframe for which the blue curve variable is more or less constant (std.variation below a specific value?).
Therefore for the attached plot it would extract 3 different subsets ~(41000:41170, 41180:41315, and 41320:41580).
Is there a clean way to do this? I could do it through a loop, but ... not sure it's the right way.
Thanks,
N
You probably want the functionality of the rolling_std function.
Specify the width of the interval you want to check for the standard deviation (let's say 100 data points), select the appropriate standard deviation (let's say 10) and do:
import pandas as pd
s = pd.Series(the way you get your data)
std = pd.rolling_std(s, 100)
selected = s[std < 10]
And you will get all the data points that have a standard deviation less than 10 in a surrounding of 100 data points.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Mates,
I need to solve a TSP Problem, with the distances info given as follows
I was trying to solve it by using dictionaries but I can't figure it out:
**Actually, dont even know how to post code as text here (just got 2 days with Python)
Any advide/help would be more than wellcome.
Thanks!
You can do sth like this:
import numpy as np
TT= np.array([[0,5,2,13,4],
[3,0,6,3,14],
[2,6,0,4,5],
[2,3,7,0,8],
[4,2,5,5,0]])
# assume we start in node 0
currentStop = 0
routeList = [0]
for _i in range(len(TT)-1):
TT[:,currentStop] = 100000 # Set column of visited stop to very large number
currentStop = np.argmin(TT[currentStop,:])
routelist.append(currentStop)
print(routelist)
TSP is an NP-hard problem, which means it's computationally hard to obtain the optimal result. It is possible to try all routes when the data size is small, but in practice we have thousands of nodes (nodes are locations in a TSP problem). Then we would use different searching algorithms to obtain a sub-optimal solution in reasonable time.
Since you are using Python I'd suggest you take a look at ortools. You can take a look at their simple examples to get started. It is written in C with a Python wrapper so it is much faster than pure Python.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a few lists of movement tracking data, which looks something like this
I want to create a list of outputs where I mark these large spikes, essentially telling that there is a movement at that point.
I applied a rolling standard deviation on the data with a window size of two and got this result
Now I can see the spikes which mark the point of interest, but I am not sure how to do it in code. A statistical tool to measure these spikes, which can be used to flag these spikes.
There are several approaches that you can use for an anomaly detection task.
The choice depends on your data.
If you want to use a statistical approach, you can use some measures like z-score or IQR.
Here you can find a tutorial for these measures.
Here instead, you can find another tutorial for a statistical approach which uses mean and variance.
Last but not least, I suggest you also to check how to use a control chart, because in some cases it's enough.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have been trying to find a way to fit some of my columns (that contains user click data) to poisson distribution in python. These columns (e.g., click_website_1, click_website_2) may contain a value ranging from 1 to thousands. I am trying to do this as it is recommended by some resources:
We recommend that count data should not be analysed by
log-transforming it, but instead models based on Poisson and negative
binomial distributions should be used.
I found some methods in scipy and numpy, but these methods seem to generate some random numbers that have poisson distribution. However, what I am interested in is to fit my own data to poisson distribution. Any library suggestions to do this in Python?
Here is a quick way to check if your data follows a poisson distribution. You plot the under the assumption that it follows a poisson distribution with rate parameter lambda = data.mean()
import numpy as np
from scipy.misc import factorial
def poisson(k, lamb):
"""poisson pdf, parameter lamb is the fit parameter"""
return (lamb**k/factorial(k)) * np.exp(-lamb)
# lets collect clicks since we are going to need it later
clicks = df["clicks_website_1"]
Here we use the pmf for possion distribution.
Now lets do some modeling, from data (click_website_one)
we'll estimate the the poisson parameter using the MLE,
which turns out to be just the mean
lamb = clicks.mean()
# plot the pmf using lamb as as an estimate for `lambda`.
# let sort the counts in the columns first.
clicks.sort().apply(poisson, lamb).plot()
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a dataframe in pandas and I am trying to find the easiest way to find the max value across rows and create a new column with the max value. See Below as example
MA10D MA30D MA50D MA100D MA200D
19.838 17.197333 16.5896 16.5207 16.52065
19.296 17.015333 16.4758 16.4676 16.48300
18.722 16.833000 16.3680 16.4106 16.44475
So in the first row of the new column I would want to return a 19.838 then 19.296 and 18.722 (it is just by chance that in this example all numbers are under MA10D column). Can someone help me find the best way to do this.
In Pandas, the vast majority of operations apply through rows, i.e. per column, and it is called axis=0. When it makes sense to apply these operations through columns, i.e. per row, use axis=1.
Finding the maximum is an expected operation on a dataframe. df.max() is equivalent to df.max(axis=0) and gives one resulting row with the max per column. For your case, use df.max(axis=1).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
How can I fit my data to an asymptotic power law curve or an exponential approach curve in R or Python?
My data essentially shows that the the y-axis increases continuously but the delta (increase) decreases with increase in x.
Any help will be much appreciated.
Using Python, if you have numpy and scipy installed, you could use curve_fit of thescipy package. It takes a user-defined function and x- as well as y-values (x_values and y_values in the code), and returns the optimized parameters and the covariance of the parameters.
import numpy
import scipy
def exponential(x,a,b):
return a*numpy.exp(b*x)
fit_data, covariance = scipy.optimize.curve_fit(exponential, x_values, y_values, (1.,1.))
This answer assumes you have your data as a one-dimensional numpy-array. You could easily convert your data into one of these, though.
The last argument contains starting values for your optimization. If you dont supply them, there might be problems in determining the number of parameters.