I am trying to decompose a time series. The database is a 2x8638 matrix. Follow the code.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
df1 = pd.read_csv("u_x_ts.csv").set_index("0")
df1.head()
enter image description here
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df1, model='multiplicative')
result.plot()
plt.show()
then python returns the error message:
ValueError: Multiplicative seasonality is not appropriate for zero and negative values
I think statsmodels doesn't support such small values, because at the beginning of the series the values are too small.
But if anyone knows a way out of this problem I appreciate it.
Related
Can someone help me with how to create a scatterplot. I have written the following code, however, it is not the scatter plot link that I expected as all data only concentrate 3 values of x-variable
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from scipy.stats import skew
from warnings import filterwarnings
filterwarnings('ignore')
df_transactions = pd.read_csv('transactions.csv')
daily_revenue= df_transactions.groupby("days_after_open").sum()['revenue']
df_transactions["daily_revenue"] = daily_revenue
x = df_transactions["days_after_open"]
y = df_transactions["daily_revenue"]
plt.scatter(x,y,alpha=0.2)
plt.xlabel("Days After Open (days)")
plt.ylabel("Daily Reveue ($)")
plt.savefig("plot")
dataframe image
Please define the 'daily_revenue' following before moving to the scatter plot.
y = df_transactions["daily_revenue"]
Hello and thank you in advance for your help!
I am getting ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None when I try to do a time series decomposition that pulls from GitHub. I think I have a basic understanding of the error, but I do not get this error when I directly pull the data from the file on my computer, instead of pulling from GitHub. Why do I only get this error when I pull my data from GitHub? And how should I change my code so that I no longer get this error?
import pandas as pd
import numpy as np
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose
topsoil = pd.read_csv('https://raw.githubusercontent.com/the-
datadudes/deepSoilTemperature/master/meanDickinson.csv',parse_dates=True)
topsoil = topsoil.dropna()
topsoil.head()
topsoil.plot();
result = seasonal_decompose(topsoil['Topsoil'],model='ad')
from pylab import rcParams
rcParams['figure.figsize'] = 12,5
result.plot();
Try this:
import pandas as pd
import numpy as np
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose
topsoil = pd.read_csv('https://raw.githubusercontent.com/the-datadudes/deepSoilTemperature/master/meanDickinson.csv',parse_dates=True)
topsoil = topsoil.dropna()
topsoil.head()
topsoil.plot();
topsoil['Date'] = pd.to_datetime(topsoil['Date'])
topsoil = topsoil.set_index('Date').asfreq('D')
result = seasonal_decompose(topsoil, model='ad')
from pylab import rcParams
rcParams['figure.figsize'] = 12,5
result.plot();
Output:
try adding this
freq=12, extrapolate_trend=12
Full code would look like:
from pylab import rcParams
import statsmodels.api as sm
rcParams['figure.figsize'] = 12, 8
decomposition = sm.tsa.seasonal_decompose(data.Column, model='additive', freq=12, extrapolate_trend=12)
fig = decomposition.plot()
plt.show()
I'm trying to work around this issue that I am facing here.
#import libraries
from __future__ import division
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
import plotly.offline as pyoff
import plotly.graph_objs as go
from sklearn.model_selection import KFold, cross_val_score, train_test_split
#initate plotly
pyoff.init_notebook_mode()
#read data from csv and redo the data work we done before
tx_data = pd.read_csv(r'C:\Users\aayus\OneDrive\Desktop\Aayu\College Project\OnlineRetail.csv', encoding='latin1')
tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])
tx_data
tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True)
tx_uk
Everything is working perfectly till here. But as soon as add this section of the code. It gives an error.
#create 3m and 6m dataframes
tx_3m = tx_uk[(tx_uk.InvoiceDate < date(2011,6,1)) & (tx_uk.InvoiceDate >= date(2011,3,1))].reset_index(drop=True)
tx_6m = tx_uk[(tx_uk.InvoiceDate >= date(2011,6,1)) & (tx_uk.InvoiceDate < date(2011,12,1))].reset_index(drop=True)
The error is "Invalid comparison between dtype=datetime64[ns] and date"
I am still new to numpy and pandas, so would really appreciate the help.
Thanks you guys
.dt.date will convert the dataframe series to datetime.date, which can be compared to date(2011,6,1)
e.g., tx_uk.InvoiceDate.dt.date < date(2011,6,1) will work!~
This is a very straightforward question. I have and x axis of years and a y axis of numbers increasing linearly by 100. When plotting this with pandas and matplotlib I am given a graph that does not represent the data whatsoever. I need some help to figure this out because it is such a small amount of code:
The CSV is as follows:
A,B
2012,100
2013,200
2014,300
2015,400
2016,500
2017,600
2018,700
2012,800
2013,900
2014,1000
2015,1100
2016,1200
2017,1300
2018,1400
The Code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("CSV/DSNY.csv")
data.set_index("A", inplace=True)
data.plot()
plt.show()
The graph this yields is:
It is clearly very inconsistent with the data - any suggestions?
The default behaviour of matplotlib/pandas is to draw a line between successive data points, and not to mark each data point with a symbol.
Fix: change data.plot() to data.plot(style='o'), or df.plot(marker='o', linewidth=0).
Result:
All you need is sort A before plotting.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("CSV/DSNY.csv").reset_index()
data = data.sort_values('A')
data.set_index("A", inplace=True)
data.plot()
plt.show()
I have a series of data which consists of values from several experiments (1-40, in the MWE it is 1-5). The overall amount of entries in my original data is ~4.000.000, which I try to smooth in order to display it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
from statsmodels.nonparametric.smoothers_lowess import lowess
df = pd.DataFrame()
df["values"] = np.random.randint(100000, 200000, 1000)
df["id"] = [1,2,3,4,5] * 200
plt.figure(1, figsize=(11.69,8.27))
# Both fail for my amount of data:
plt.plot(spline(df["values"], df["id"], range(100)), "r-")
plt.plot(lowess(df["values"], df["id"]), "r-")
Both, scipy.interplate and statsmodels.nonparametric.smoothers_lowess.lowess, throw out of memory exceptions for my data. Is there any efficient way to solve this like in, e.g., GNU R using ggplot2 and geom_smooth()?
I can't quite tell what you're getting at with all the dimensions to your data, but one very simple thing you can try is to just use the 'markevery' kwarg like so:
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(1,100,1E7)
y=x**2
plt.figure(1, figsize=(11.69,8.27))
plt.plot(x,y,markevery=100)
plt.show()
This will only plot every nth point (n=100 here).
If that doesn't help then you may want to try just a simple numpy interpolation with fewer samples like so:
x_large=np.linspace(1,100,1E7)
y_large=x**2
x_small=np.linspace(1,100,1E3)
y_small=np.interp(x_small,x_large,y_large)
plt.plot(x_small,y_small)