I'm trying to plot an autocorrelation_plot() of a time series using pandas.
According to this SO post pandas.tools was removed in 0.24.0 and the autocorrelation_plot function can now be found in the pandas.plotting library. However the API shows no reference to this function.
I'm able to plot an autocorrelation by importing the function but where can I find the documentation?
from pandas.plotting import autocorrelation_plot # works fine
slope = -1
offset = 250
noise_scale = 100
npts = 100
x = np.linspace(0, 100, npts)
y = slope*x + noise_scale*np.random.rand(npts) + offset
autocorrelation_plot(y)
Python: 3.7.2
Pandas: 0.24.1
I think this would probably be more appropriate as an issue in GitHub.
In any case, autocorrelation_plot and the similar plots (andrews_curves, radviz,...) are probably going to be moved out of pandas, into a separate package. So you can expect to have to call something like pandas_matplotlib.autocorrelation_plot() in the future (see #28177).
In the meantime, I'm adding it and some other missing functions to the documentation in #28179. When the pull request is merged, you'll be able to see the docs in https://dev.pandas.io. But there is nothing very interesting for autocorrelation_plot:
Have a look at:
https://github.com/pandas-dev/pandas/blob/v0.24.1/pandas/plotting/_misc.py#L600
Looks like it was buried in the plotting._misc source code.
You can at least find a reference and a short doc here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization-autocorrelation
Btw, you can search the docs for any keyword: https://pandas.pydata.org/pandas-docs/stable/search.html?q=autocorrelation_plot&check_keywords=yes&area=default#
Related
I'm trying to use matplotlib to plot a histogram but keep running into this error:
MatplotlibDeprecationWarning: The resize_event function was deprecated in Matplotlib 3.6 and will be removed two minor releases later. Use callbacks.process('resize_event', ResizeEvent(...)) instead.
Here's my code; feedback on how I can clean up the logical expressions also welcome.
lower_quartile = df['2020 Population'].quantile(0.25)
mid_quartile = df['2020 Population'].quantile(0.5)
upper_quartile = df['2020 Population'].quantile(0.75)
new_data = df.loc[df['2020 Population'] > lower_quartile]
final_2020_range = new_data.loc[df['2020 Population'] < upper_quartile]
check = final_2020_range['2020 Population']
plt.hist(check)
Seems like you can find your answer here:
https://github.com/matplotlib/matplotlib/issues/23921
In short: it's a bug, and it will be corrected in 3.6.1
I'm facing the same issue,
I found a work-around:
import matplotlib
matplotlib.use('TkAgg')
Error will still be there, but now you can see plots.
I have a dataframe like this:
I want to plot a correlation chart between variables "BTC" and "B3" similar to this one here:
https://charts.coinmetrics.io/correlations/
Can anyone point me some material where I can study how to do it?
Numpy and matplotlib are libraries that address your needs, see their docs related to your use case:
https://numpy.org/doc/1.20/reference/generated/numpy.correlate.html#numpy.correlate
https://matplotlib.org/stable/tutorials/index.html
These are well known and widely used but they are not part of Python standard library.
Some would use also pandas, but IMHO it is not needed here at all.
Also you would like to load your data into your app. Use something like csv in standard library. Or use requests and json if you can call some API to get your data.
To plot a correlation chart in Python:
import matplotlib.pyplot as plt
l = ['id1','id2','id3','id4']
y1 = [3,5,6,2]
y2 = [4,2,5,1]
fig,ax = plt.subplots()
ax.scatter(y1, y2)
for i, txt in enumerate(l):
ax.annotate(txt, (y1[i], y2[i]))
plt.show()
Output:
I'm working in a Jupyter Notebook and deleted code I thought I didn't need the other. Now I get an overflow error, when running the notebook. I'm pretty sure the code used to work just fine and the problem is caused by me stupidly deleting stuff.
Anyway, I can't seem to find what is missing and would really appreciate help. I'm using a list with coordinates, convert them to a linestring and then transform them. Finally, I lookup the length.
import pyproj
from pyproj import Transformer
from shapely.ops import transform
from shapely.geometry import LineString
route = [[41.875562, -87.624421], [29.949932, -90.070116], [40.712728, -74.006015]]
ls = LineString(route)
project = pyproj.Transformer.from_proj(
pyproj.Proj(init='epsg:4326'),
pyproj.Proj(init='epsg:3857'))
ls_metric = transform(project.transform, ls)
ls_metric_length = round(ls_metric.length / 1000)
This returns
OverflowError: cannot convert float infinity to integer
The problem arises already with ls_metric which doesn't generate a LineString.
I ran your code and got this warning:
FutureWarning: '+init=<authority>:<code>' syntax is deprecated.
'<authority>:<code>' is the preferred initialization method
Sure enough I changed the pyproj Transformer and got a result:
project = pyproj.Transformer.from_proj(
pyproj.Proj('epsg:4326'),
pyproj.Proj('epsg:3857'))
gives a length of 3984 km.
I used the latest versions in a venv:
pyproj==2.6.0
Shapely==1.7.0
The warning above also gives another important note regarding axis order changes; in short:
pyproj.Proj('epsg:4326') works with [lat,lng], [lat,lng] ...
pyproj.Proj(init='epsg:4326') works with [lng,lat], [lng,lat] ...
the first one being the preferred way while the second is deprecated.
I've been a longtime R user, recently transitioning over to Python, and I've been trying to carry over my knowledge of plotting with ggplot2, since it is so intuitive. Plotnine is supposedly the most ggplot2-esque plotting library, and I've successfully recreated most graphs with it, except critically how to plot regular functions.
In base R, you can easily define an eq., as in so, input the result into a stat_function() layer, and set the limits of the graph in place of the data arg., and successfully plot a parabola or the like. However, the syntax for setting the graph's limits must be different in Python (perhaps using numpy?), and equations are defined using sympy, which is another divergence for me.
So how would I go about plotting functions with plotnine? The above two hurdles are the two differences with ggplot2 that I think are causing me trouble, since plotnine has so few examples online.
P.S. This is an example of what I want to recreate in Python using plotnine:
> library(ggplot2)
> basic_plot <- function(x) x^2 + 2.5
> graph <- ggplot(data.frame(x=c(-5,5)), aes(x=x)) +
+ stat_function(fun = basic_plot)
> graph
You do not need numpy, it works just fine the "standard" way! :)
from plotnine import *
import pandas as pd
(ggplot(pd.DataFrame(data={"x": [-5, 5]}), aes(x="x"))
+ stat_function(fun=lambda x: x**2+2.5))
One of the main differences that caused me problems was the same as posted in the question. Specifically:
in R
aes(x = x) or aes(x)
in plotnine
aes(x = 'x')
If I try to run the script below I get the error: LinAlgError: SVD did not converge in Linear Least Squares. I have used the exact same script on a similar dataset and there it works. I have tried to search for values in my dataset that Python might interpret as a NaN but I cannot find anything.
My dataset is quite large and impossible to check by hand. (But I think my dataset is fine). I also checked the length of stageheight_masked and discharge_masked but they are the same. Does anyone know why there is an error in my script and what can I do about it?
import numpy as np
import datetime
import matplotlib.dates
import matplotlib.pyplot as plt
from scipy import polyfit, polyval
kwargs = dict(delimiter = '\t',\
skip_header = 0,\
missing_values = 'NaN',\
converters = {0:matplotlib.dates.strpdate2num('%d-%m-%Y %H:%M')},\
dtype = float,\
names = True,\
)
rating_curve_Gillisstraat = np.genfromtxt('G:\Discharge_and_stageheight_Gillisstraat.txt',**kwargs)
discharge = rating_curve_Gillisstraat['discharge'] #change names of columns
stageheight = rating_curve_Gillisstraat['stage'] - 131.258
#mask NaN
discharge_masked = np.ma.masked_array(discharge,mask=np.isnan(discharge)).compressed()
stageheight_masked = np.ma.masked_array(stageheight,mask=np.isnan(discharge)).compressed()
#sort
sort_ind = np.argsort(stageheight_masked)
stageheight_masked = stageheight_masked[sort_ind]
discharge_masked = discharge_masked[sort_ind]
#regression
a1,b1,c1 = polyfit(stageheight_masked, discharge_masked, 2)
discharge_predicted = polyval([a1,b1,c1],stageheight_masked)
print 'regression coefficients'
print (a1,b1,c1)
#create upper and lower uncertainty
upper = discharge_predicted*1.15
lower = discharge_predicted*0.85
#create scatterplot
plt.scatter(stageheight,discharge,color='b',label='Rating curve')
plt.plot(stageheight_masked,discharge_predicted,'r-',label='regression line')
plt.plot(stageheight_masked,upper,'r--',label='15% error')
plt.plot(stageheight_masked,lower,'r--')
plt.axhline(y=1.6,xmin=0,xmax=1,color='black',label='measuring range')
plt.title('Rating curve Catsop')
plt.ylabel('discharge')
plt.ylim(0,2)
plt.xlabel('stageheight[m]')
plt.legend(loc='upper left', title='Legend')
plt.grid(True)
plt.show()
I don't have your data file, but it almost always that case that when you get that error you have NaN's or infinity in your data. Look for both of those using pd.notnull or np.isfinite
As others have pointed out, the problem is likely that there are rows without numericals for the algorithm to work with. This is an issue with most regressions.
That's the problem. The solution then, is to do something about that. And that depends on the data. Often, you can replace the NaNs with 0s, using Pandas .fillna(0) for example. Sometimes, you might have to interpolate missing values, and Pandas .interpolate() is probably the simplest solution to that as well. Or, when it's not a time series, you might be able to simply drop the rows with NaNs in them, using for example Pandas .dropna() method. Or, sometimes it's not about the NaNs, but about the infs or others, and then there are other solutions for that: https://stackoverflow.com/a/55293137/12213843
Exactly which way to go about it, is up to the data. And it's up to you to interpret the data. And domain knowledge goes a long way to interpret the data well.
As ski_squaw mentions the error is most of the time due to NaN's, however for me this error came after a windows update. I was using numpy version 1.16. Moving my numpy version to 1.19.3 solved the issue. (run pip install numpy==1.19.3 --user in the cmd)
This gitHub issue explains it more:
https://github.com/numpy/numpy/issues/16744
Numpy 1.19.3 doesn't work on Linux and 1.19.4 doesn't work on Windows.
I developed a code on windows 8.
So now I'm using windows 10 and the problem popped up!
It was resolved as #Joris said.
pip install numpy==1.19.3
my example after fix:
def calculating_slope(x):
x = x.replace(np.inf, np.nan).replace(-np.inf, np.nan).dropna()
if len(x)>1:
slope = np.polyfit(range(len(x)), x, 1)[0]
else:
slope = 0
return slope