My dataset is in hours and I need guidance on how to convert the standard deviation from hours to minutes.
In the code provided under, I only get the standard deviation for hours. How can I convert it to minutes?
np.sum(data['Time']*data['kneel']/sum(data['Time']))*60
data['kneel'].std()
I had a similar problem, i wanted to calculate cosine distance fast so I used the following solutions.
Try using Dask, it is pretty fast compared to pandas, moreover, you can also try numba, it basically makes python code computationally fast.
Related
I am running into a roadblock and would appreciate some help on this.
Problem Statement:
I am trying to calculate XIRR for a cash flow over 30 years in Python.
What have I tried so far:
However, none of the established libraries(like numpy and pandas) seem to have support for this. After doing some research, I learned through this source (https://vindeep.com/Corporate/XIRRCalculation.aspx) that with some simple manipulation, XIRR can be calculated from IRR.
So, all I need is an IRR function that is implemented well. The functionality used to exist in numpy but has moved to this other package (https://github.com/numpy/numpy-financial). While, this package works, it is very very slow. Here is a small test:
import pandas as pd
import numpy as np
import numpy_financial as npf
from time import time
# Generate some example data
t = pd.date_range('2022-01-01', '2037-01-01', freq='D')
cash_flows = np.random.randint(10000, size=len(t)-1)
cash_flows = np.insert(cash_flows, 0, -10000)
# Calculate IRR
start_timer = time()
npf.irr(cash_flows, guess)
stop_timer = time()
print(f"""Time taken to calculate IRR over 30 years of daily data: {round((stop_timer-start_timer)/60, 2)}""")
One other alternative seems to be https://github.com/better/irr - however, this has an edge case bug that has not been addressed in over 4 years.
Can anyone kindly offer to a more stable implementation. It feels like such simple and very commonly used functionality and the lack of a good stable implementation surprises me. Can someone point to any good resources.
Thanks
Uday
Try using pyxirr package. Implemented in Rust, it is blazing fast. For 30 years time period it took about .001 sec.
pyxirr creator here. The library has been used in a financial project for over a year, but I only recently found the time to publish it. We had the task of quickly calculating XIRR for various portfolios and existing implementations quickly became a bottleneck. pyxirr also mimics some numpy-financial functions and works much faster.
The XIRR implementation in Excel is not always correct. In edge cases the algorithm does not converge and shows incorrect result instead of error or NA. The result can be checked with the xnpv function: xnpv(xirr_rate, dates, values) and should be close to zero. Similarly, you can check irr using the npv function: npv(irr_rate, values), but note the difference in npv calculation between Excel and numpy-financial.
Taking a look at the implementation on their GitHub, it is pretty evident to me that the npf.irr() function is implemented pretty well. Your alternative seems to be to implement the function yourself using NumPy operations but I am doubtful that a) that is easy to accomplish or b) possible to accomplish in pure Python.
NumPy Financial seems to be doing their implementation using eigenvalues which means they are performing complex mathematic operations. Perhaps, if you are not bounded to Python, consider Microsoft's C# implementation of IRR and see if that works faster. I suspect that they are using regression to calculate the IRR. Therefore, based on your guess, it may indeed be quicker than NumPy Financial.
Your final alternative is to continue with what you have at the moment and just run on a more powerful machine. On my machine, this operation took about 71 seconds and it is does not even have a GPU. I am sure more powerful computers, with parallelization, should be able to compute this much much faster than that.
Look at the answer I provided here: https://stackoverflow.com/a/66069439/4045275.
I didn't benchmark it against pyxirr
I would like to investigate whether there are siginifcant differences between three different groups. There are about 20 numerical attributes for these groups. For each attribute there are about a thousand observations.
My first thought was to calculate a manova. Unfortunately, the data are not normally distributed (tested with Anderson Darling test). From just looking at the data, the distribution is too narrow around the mean and has no tail at all.
When I calculate the Manova anyway, highly significant results come out that are completely against my expectations.
Therefore, I would like to calculate a multivariate Kurskall Wallis test next. So far I have found scipy.stats.kruskal. Unfortunately, it only compares individual data series with each other. Is there already a similar implementation in Python to a MANOVA, where you read in all attributes and all three groups and then give a result?
If you need more information, please let me know.
Thanks a lot! :)
I am trying to write a python code which detects anomalies in time series data. My input data looks something like this:
Here, the regions marked in red are anomalies. I want it such that I get the x-coordinate of data-points which are anomalous. So far I have tried a basic if condition (ie if rate < 100, data-point is anomalous) and various statistical techniques like: Mean, Standard deviation, Rolling average with different window sizes etc. However, none of them have worked well. Is there a way to achieve what i want with using some statistical methods? If there are no simple ways to do this, I understand that I have to look to machine learning algorithms. In that case which algorithm would be suitable for my dataset? Thank you.
It looks as if your data comes in lumps, if you are able to distinguish between the lumps (maybe a certain delay between two samples), you can look at the distribution of the samples in the lump. If you know that your rate will never drop below 100, I would start with that, to clean it up a bit,then look at the remaining distribution. The mode value should kind of help identify the "middle", most occuring value. Cutting off everything a certain amount of standard deviations would maybe work to get clean data, but no guarantee that you won't cut off any of your required data.
Edit: you'd have to bin your data before getting the mode.
In general, log and exp functions should be roughly the same speed. I would expect the numpy and scipy implementations to be relative straightforward wrappers. numpy.log() and scipy.log() have similar speed as expected. However, I found that numpy.log() is ~60% slower than these exp() functions and scipy.log() is 100% slower. Does anyone one know the reason for this?
Not sure why you think that both should be "roughly the same speed". It's true that both can be calculated using a Taylor series (which, even by itself means little without analyzing the error term), but then the numerical tricks kick in.
E.g., an algebraic identity can be used to transform the original exp. Taylor series into a more efficient 2-jump power series. However, for the power series, see here a discussion of by-case optimizations, some of which involve a lookup table.
Which arguments did you give the functions - the same? the worst one for each?
What was the accuracy of the results? And how do you measure the accuracy for each: absolutely, relatively?
Edit It should be noted that these libraries can also have different backends.
I have many lines of georeferenced hydrological data with weekly resolution:
Station name, Lat, Long, Week 1 average, Week 2 average ... Week 52 average
Unfortunately, I also have some data with only monthly resolution:
Station name, Lat, Long, January average, February average ... December average
Rather than "reinventing the wheel," can anyone recommend a favorite module, package, or technique that would provide a reasonable interpolation of weekly values from monthly values? Linear would be fine, but it would be nice if we could use the coordinates to improve the interpolation based on nearby stations.
I've tagged this post with python because it's the language I've been using recently (although not its statistical functions). If the answer is "use a stats program like r" so be it, but I'm curious as to what's out there for python. Thanks!
I haven't had a chance to dig into it, but the hpgl (High Performance Geostatistics Library) provides a number of kriging (geospatial interpolation) methods:
Algorithms
Simple Kriging (SK)
Ordinary Kriging (OK)
Indicator Kriging (IK)
Local Varying Mean Kriging (LVM Kriging)
Simple CoKriging (Markov Models 1 & 2)
Sequential Indicator Simulation (SIS)
Corellogram Local Varying Mean SIS (CLVM SIS)
Local Varying Mean SIS (LVM SIS)
Sequential Gaussian Simulation (SGS)
If you are interested into expanding your experience into R, there are a number of good, well used and documented packages out there. I would start by looking at the Spatial Taskview, which lists what packages can be used for spatial data. One of the paragraphs deals with interpolation. I am most familiar with automap/gstat (I wrote automap), where especially gstat is a powerfull geostatistics package which supports a wide range of methods.
http://cran.r-project.org/web/views/Spatial.html
Integrating Python and R can be done in multiple ways, e.g. Using system calls or an in memory link using Rpy. See also:
Python interface for R Programming Language
I am looking into doing the same thing, and I found this kriging module written by Sat Kumar Tomer at AMBHAS.
There appears to be methods for producing variograms and performing ordinary kriging.
I'll update this answer if I use this and make further discoveries.
Since I originally posted this question (in 2012!) an actively-developed Python Kriging module has been released https://github.com/bsmurphy/PyKrige
There's also this older option:
https://github.com/capaulson/pyKriging