I am trying to make a model for predicting energy production, by using ARMA model.
The data I can use for training is as following;
(https://github.com/soma11soma11/EnergyDataSimulationChallenge/blob/master/challenge1/data/training_dataset_500.csv)
ID Label House Year Month Temperature Daylight EnergyProduction
0 0 1 2011 7 26.2 178.9 740
1 1 1 2011 8 25.8 169.7 731
2 2 1 2011 9 22.8 170.2 694
3 3 1 2011 10 16.4 169.1 688
4 4 1 2011 11 11.4 169.1 650
5 5 1 2011 12 4.2 199.5 763
...............
11995 19 500 2013 2 4.2 201.8 638
11996 20 500 2013 3 11.2 234 778
11997 21 500 2013 4 13.6 237.1 758
11998 22 500 2013 5 19.2 258.4 838
11999 23 500 2013 6 22.7 122.9 586
As shown above, I can use data from July 2011 to May 2013 for training.
Using the training, I want to predict energy production on June 2013 for each 500 house.
The problem is that the time series data is not stationary and has trend components and seasonal components (I checked it as following.).
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_train = pd.read_csv('../../data/training_dataset_500.csv')
rng=pd.date_range('7/1/2011', '6/1/2013', freq='M')
house1 = data_train[data_train.House==1][['EnergyProduction','Daylight','Temperature']].set_index(rng)
fig, axes = plt.subplots(nrows=1, ncols=3)
for i, column in enumerate(house1.columns):
house1[column].plot(ax=axes[i], figsize=(14,3), title=column)
plt.show()
With this data, I cannot implement ARMA model to get good prediction. So I want to get rid of the trend components and a seasonal components and make the time series data stationary. I tried this problem, but I could not remove these components and make it stationary..
I would recommend the Hodrick-Prescott (HP) filter, which is widely used in macroeconometrics to separate long-term trending component from short-term fluctuations. It is implemented statsmodels.api.tsa.filters.hpfilter.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('/home/Jian/Downloads/data.csv', index_col=[0])
# get part of the data
x = df.loc[df.House==1, 'Daylight']
# hp-filter, set parameter lamb=129600 following the suggestions for monthly data
x_smoothed, x_trend = sm.tsa.filters.hpfilter(x, lamb=129600)
fig, axes = plt.subplots(figsize=(12,4), ncols=3)
axes[0].plot(x)
axes[0].set_title('raw x')
axes[1].plot(x_trend)
axes[1].set_title('trend')
axes[2].plot(x_smoothed)
axes[2].set_title('smoothed x')
Related
i want to plot the data which is shown below and compere it to a function which gives me the theoretical plot. I am able to plot the data with its uncertainty, but i am struguling to plot the mathematical function function which gives me the theoretical plot.
amplitude uncertainty position
5.2 0.429343685 0
12.2 1.836833144 1
21.4 0.672431409 2
30.2 0.927812481 3
38.2 1.163321108 4
44.2 1.340998136 5
48.4 1.506088975 6
51 1.543016526 7
51.2 1.587229032 8
49.8 1.507327436 9
46.2 1.400355669 10
40.6 1.254401849 11
32.5 0.995301462 12
24.2 0.753044487 13
14 0.58 14
7 0.29 15
here is my code so far:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_excel("Verdier_6.xlsx")
verdier = data.values
frekvens = verdier [:,3]
effektresonans = verdier [:,0]
usikkerhet = verdier [:,1]
x = np.arange(0,15,0.1)
p= 28.2
r=0.8156
v= 343.8
f= 1117
y=p*np.sqrt(1+r**2+2*r*np.cos(((2*np.pi)/(v/f))*x))
plt.plot(x,y)
plt.plot(frekvens, effektresonans)
plt.errorbar(frekvens, effektresonans, usikkerhet, fmt = "o")
plt.title("")
plt.xlabel("Posisjon, X [cm]")
plt.ylabel("Amplitude, U [mV] ")
plt.grid()
plt.show()
And here is here is a image of the plot with only experimental data shown above:
and here is an image of how my experimental and theoretical plot look:
and here is an image of how the experimental and theoretical plot should look:
I have a mass pandas DataFrame df:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.
I want to randomly sample 10 data per year, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the values are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.
Try:
def fn(x):
_100_means = [x.sample(10).mean() for i in range(100)]
return {
"mean_of_100_means": np.mean(_100_means),
"total_sd": np.std(_100_means),
}
print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
EDIT: Changed the computation of means.
Prints:
year mean_of_100_means total_sd
0 1983 48.986 8.330787
1 1984 48.479 10.384896
2 1985 48.957 7.854900
3 1986 50.821 10.303847
4 1987 50.198 9.835832
5 1988 47.497 8.678749
6 1989 46.763 9.197387
7 1990 49.696 8.837589
8 1991 46.979 8.141969
9 1992 48.555 8.603597
10 1993 50.220 8.263946
11 1994 48.735 9.954741
12 1995 49.759 8.532844
13 1996 49.832 8.998654
14 1997 50.306 9.038316
15 1998 49.513 9.024341
16 1999 50.532 9.883166
17 2000 49.195 9.177008
18 2001 50.731 8.309244
19 2002 48.792 9.680028
20 2003 50.251 9.384759
21 2004 50.522 9.269677
22 2005 48.090 8.964458
23 2006 49.529 8.250701
24 2007 47.192 8.682196
25 2008 50.124 9.337356
26 2009 47.988 8.053438
The dataframe was created:
data = []
for y in range(1983, 2010):
for i in np.random.randint(0, 100, size=1000):
data.append({"year": y, "count": i})
df = pd.DataFrame(data)
I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.
It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
'count':np.random.randint(1,100,size=2000)
})
list_of_means = []
## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
df_sample = df.groupby("year").sample(10).groupby("year").mean()
list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]
result = pd.DataFrame({
'year': df.year.unique(),
'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
'total_sd': [np.std(k) for k in zip(*array_of_means)]
})
This results in:
>>> result
year mean_of_100_means total_sd
0 1980 50.316 8.656948
1 1981 48.274 8.647643
2 1982 47.958 8.598455
3 1983 49.357 7.854620
4 1984 48.977 8.523484
5 1985 49.847 7.114485
6 1986 47.338 8.220143
7 1987 48.106 9.413085
8 1988 53.487 9.237561
9 1989 47.376 9.173845
10 1990 46.141 9.061634
11 1991 46.851 7.647189
12 1992 49.389 7.743318
13 1993 52.207 9.333309
14 1994 47.271 8.177815
15 1995 52.555 8.377355
16 1996 47.606 8.668769
17 1997 52.584 8.200558
18 1998 51.993 8.695232
19 1999 49.054 8.178929
I want to display mean and standard deviation values above each of the boxplots in the grouped boxplot (see picture).
My code is
import pandas as pd
import seaborn as sns
from os.path import expanduser as ospath
df = pd.read_excel(ospath('~/Documents/Python/Kandidatspeciale/TestData.xlsx'),'Ark1')
bp = sns.boxplot(y='throw angle', x='incident angle',
data=df,
palette="colorblind",
hue='Bat type')
bp.set_title('Rubber Comparison',fontsize=15,fontweight='bold', y=1.06)
bp.set_ylabel('Throw Angle [degrees]',fontsize=11.5)
bp.set_xlabel('Incident Angle [degrees]',fontsize=11.5)
Where my dataframe, df, is
Bat type incident angle throw angle
0 euro 15 28.2
1 euro 15 27.5
2 euro 15 26.2
3 euro 15 27.7
4 euro 15 26.4
5 euro 15 29.0
6 euro 30 12.5
7 euro 30 14.7
8 euro 30 10.2
9 china 15 29.9
10 china 15 31.1
11 china 15 24.9
12 china 15 27.5
13 china 15 31.2
14 china 15 24.4
15 china 30 9.7
16 china 30 9.1
17 china 30 9.5
I tried with the following code. It needs to be independent of number of x (incident angles), for instance it should do the job for more angles of 45, 60 etc.
m=df.mean(axis=0) #Mean values
st=df.std(axis=0) #Standard deviation values
for i, line in enumerate(bp['medians']):
x, y = line.get_xydata()[1]
text = ' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i])
bp.annotate(text, xy=(x, y))
Can somebody help?
This question brought me here since I was also looking for a similar solution with seaborn.
After some trial and error, you just have to change the for loop to:
for i in range(len(m)):
bp.annotate(
' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i]),
xy=(i, m[i]),
horizontalalignment='center'
)
This change worked for me (although I just wanted to print the actual median values). You can also add changes like the fontsize, color or style (i.e., weight) just by adding them as arguments in annotate.
So I have the df.head() being displayed below.I wanted to display the progression of salaries across time spans.As you can see the teams will get repeated across the years and the idea is to
display how their salaries changed over time.So for teamID='ATL' I will have a graph that starts by 1985 and goes all the way to the present time.
I think I will need to select teams by their team ID and have the x axis display time (year) and Y axis display year. I don't know how to do that on Pandas and for each team in my data frame.
teamID yearID lgID payroll_total franchID Rank W G win_percentage
0 ATL 1985 NL 14807000.0 ATL 5 66 162 40.740741
1 BAL 1985 AL 11560712.0 BAL 4 83 161 51.552795
2 BOS 1985 AL 10897560.0 BOS 5 81 163 49.693252
3 CAL 1985 AL 14427894.0 ANA 2 90 162 55.555556
4 CHA 1985 AL 9846178.0 CHW 3 85 163 52.147239
5 ATL 1986 NL 17800000.0 ATL 4 55 181 41.000000
You can use seaborn for this:
import seaborn as sns
sns.lineplot(data=df, x='yearID', y='payroll_total', hue='teamID')
To get different plot for each team:
for team, d in df.groupby('teamID'):
d.plot(x='yearID', y='payroll_total', label='team')
import pandas as pd
import matplotlib.pyplot as plt
# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)
# Generate a plot for each team
df[df['teamID'] == 'ATL'].plot(ax=axes[0], x='yearID', y='payroll_total')
df[df['teamID'] == 'BAL'].plot(ax=axes[1], x='yearID', y='payroll_total')
df[df['teamID'] == 'BOS'].plot(ax=axes[2], x='yearID', y='payroll_total')
# Display the plot
plt.show()
depending on how many teams you want to show you should adjust the
fig, axes = plt.subplots(nrows=3, ncols=1)
Finally, you could create a loop and create the visualization for every team
I am new to python and I'm trying to plot an overlaid histogram for a manipulated data set from Kaggle. I tried doing it with matplotlib. This is a dataset that shows the history of gun violence in USA in recent years. I have selected only few columns for EDA.
import pandas as pd
data_set = pd.read_csv("C:/Users/Lenovo/Documents/R related
Topics/Assignment/Assignment_day2/04 Assignment/GunViolence.csv")
state_wise_crime = data_set[['date', 'state', 'n_killed', 'n_injured']]
date_value = pd.to_datetime(state_wise_crime['date'])
import datetime
state_wise_crime['Month']= date_value.dt.month
state_wise_crime.drop('date', axis = 1)
no_of_killed = state_wise_crime.groupby(['state','Year'])
['n_killed','n_injured'].sum()
no_of_killed = state_wise_crime.groupby(['state','Year']
['n_killed','n_injured'].sum()
I want an overlaid histogram that shows the no. of people killed and no.of people injured with the different states on the x-axis
Welcome to Stack Overflow! From next time, please post your data like in below format (not a link or an image) to make us easier to work on the problem. Also, if you ask about a graph output, showing the contents of desired graph (even with hand drawing) would be very helpful :)
df
state Year n_killed n_injured
0 Alabama 2013 9 3
1 Alabama 2014 591 325
2 Alabama 2015 562 385
3 Alabama 2016 761 488
4 Alabama 2017 856 544
5 Alabama 2018 219 135
6 Alaska 2014 49 29
7 Alaska 2015 84 70
8 Alaska 2016 103 88
9 Alaska 2017 70 69
As I commented in your original post, a bar plot would be more appropriate than histogram in this case since your purpose appears to be visualizing the summary statistics (sum) of each year with state-wise comparison. As far as I know, the easiest option is to use Seaborn. It depends on how you want to show the data, but below is one example. The code is as simple as below.
import seaborn as sns
sns.barplot(x='Year', y='n_killed', hue='state', data=df)
Output:
Hope this helps.