I'm working on creating a linear trendline from data that contains dates and another measure (volume). The goal is to create a linear trendline that shows how volume trends over time.
The data looks as follows:
date typeID lowPrice highPrice avgPrice volume orders \
0 2003-11-30 22.0 9000.00 9000.00 9000.00 5.0 1.0
1 2003-12-31 22.0 9000.00 9000.00 9000.00 2.0 1.0
2 2004-01-31 22.0 15750.00 15750.00 15750.00 9.5 1.0
3 2004-02-29 22.0 7000.00 7000.00 7000.00 11.0 1.0
4 2004-03-31 22.0 7000.00 7000.00 7000.00 8.0 1.0
6 2004-05-31 22.0 15000.00 15000.00 15000.00 16.0 1.0
10 2004-09-30 22.0 6500.00 6500.00 6500.00 27.0 1.0
The issue is that for some months (the interval in which the dates are stored) there is no volume data available as can be seen above, thus the following is the approach I currently take at creating a trendline from the available dates.
x = df2["date"]
df2["inc_dates"] = np.arange(len(x))
y = df2["ln_vold"]
plt.subplot(15, 4, count)
plt.plot_date(x, y, xdate = True)
model = smf.ols('ln_vold ~ inc_dates', missing = "drop", data = df2).fit()
intercept, coef = model.params
l = [intercept]
for i in range(len(x) -1):
l.append(intercept + coef*i)
plt.plot_date(x, l, "r--", xdate = True)
However the output for this currently shows:
Which clearly isn't the right trendline (seen by the beginning being non-linear).
Now I don't see how this could go wrong, as all I do in the for-loop is add constant values to an increasing integer. All I'd like to see is a linear trendline going straight from the intercept to the end.
Related
I have some trips, and for each trip contains different steps, the data frame looks like following:
tripId duration (s) distance (m) speed Km/h
1819714 NaN NaN NaN
1819714 6.0 8.511452 5.106871
1819714 10.0 6.908963 2.487227
1819714 5.0 15.960625 11.491650
1819714 6.0 26.481649 15.888989
... ... ... ... ...
1865507 6.0 16.280313 9.768188
1865507 5.0 17.347482 12.490187
1865507 5.0 14.266625 10.271970
1865507 6.0 22.884008 13.730405
1865507 5.0 21.565655 15.527271
I want to know if, on a trip X, the cyclist has braked (speed has decreased by at least 30%).
The problem is that the duration between every two steps is each time different.
For example, in 6 seconds, the speed of a person X has decreased from 28 km/h to 15 km/h.. here we can say, he has braked, but if the duration was high, we will not be able to say that
My question is if there is a way to apply something to know if there is a braking process, for all data frame in a way that makes sense
The measure of braking is the "change in speed" relative to "change in time". From your data, I created a column 'acceleration', which is change in speed (Km/h) divided by duration (seconds). Then the final column to detect braking if the value is less than -1 (Km/h/s).
Note that you need to determine if a reduction of 1km/h per second is good enough to be considered as braking.
df['speedChange'] = df['speedKm/h'].diff()
df['acceleration'] = df['speedChange'] / df['duration(s)']
df['braking'] = df['acceleration'].apply(lambda x: 'yes' if x<-1 else 'no')
print(df)
Output:
tripId duration(s) distance(m) speedKm/h speedChange acceleration braking
0 1819714.0 6.0 8.511452 5.106871 NaN NaN no
1 1819714.0 10.0 6.908963 2.487227 -2.619644 -0.261964 no
2 1819714.0 5.0 15.960625 11.491650 9.004423 1.800885 no
3 1819714.0 6.0 26.481649 15.888989 4.397339 0.732890 no
4 1865507.0 6.0 16.280313 9.768188 -6.120801 -1.020134 yes
5 1865507.0 5.0 17.347482 12.490187 2.721999 0.544400 no
6 1865507.0 5.0 14.266625 10.271970 -2.218217 -0.443643 no
7 1865507.0 6.0 22.884008 13.730405 3.458435 0.576406 no
I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN
I recently came across a k-means tutorial that looks a bit different than what I remember the algorithm to be, but it should still do the same after all it's k-means. So, I went and gave it a try with some data, here's how the code looks:
# Assignment Stage:
def assignment(data, centroids):
for i in centroids.keys():
#sqrt((x1-x2)^2+(y1-y2)^2 + etc)
data['distance_from_{}'.format(i)]= (
np.sqrt((data['soloRatio']-centroids[i][0])**2
+(data['secStatus']-centroids[i][1])**2
+(data['shipsDestroyed']-centroids[i][2])**2
+(data['combatShipsLost']-centroids[i][3])**2
+(data['miningShipsLost']-centroids[i][4])**2
+(data['exploShipsLost']-centroids[i][5])**2
+(data['otherShipsLost']-centroids[i][6])**2
))
print(data['distance_from_{}'.format(i)])
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
data['closest'] = data.loc[:, centroid_distance_cols].idxmin(axis=1)
data['closest'] = data['closest'].astype(str).str.replace('\D+', '')
return data
data = assignment(data, centroids)
and:
#Update stage:
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0]=np.mean(data[data['closest']==i]['soloRatio'])
centroids[i][1]=np.mean(data[data['closest']==i]['secStatus'])
centroids[i][2]=np.mean(data[data['closest']==i]['shipsDestroyed'])
centroids[i][3]=np.mean(data[data['closest']==i]['combatShipsLost'])
centroids[i][4]=np.mean(data[data['closest']==i]['miningShipsLost'])
centroids[i][5]=np.mean(data[data['closest']==i]['exploShipsLost'])
centroids[i][6]=np.mean(data[data['closest']==i]['otherShipsLost'])
return k
#TODO: add graphical representation?
while True:
closest_centroids = data['closest'].copy(deep=True)
centroids = update(centroids)
data = assignment(data,centroids)
if(closest_centroids.equals(data['closest'])):
break
When I run the initial assignment stage, it returns the distances, however when I run the update stage, all distance values become NaN, and I just dont know why or at which point exactly this happens... Maybe I made I mistake I can't spot?
Here's an excerpt of the data im working with:
Unnamed: 0 characterID combatShipsLost exploShipsLost miningShipsLost \
0 0 90000654.0 8.0 4.0 5.0
1 1 90001581.0 97.0 5.0 1.0
2 2 90001595.0 61.0 0.0 0.0
3 3 90002023.0 22.0 1.0 0.0
4 4 90002030.0 74.0 0.0 1.0
otherShipsLost secStatus shipsDestroyed soloRatio
0 0.0 5.003100 1.0 10.0
1 0.0 2.817807 6251.0 6.0
2 0.0 -2.015310 752.0 0.0
3 4.0 5.002769 43.0 5.0
4 1.0 3.090204 301.0 7.0
I am trying to loop through a list to create a series of boxplots using Matplotlib. Each item in the list should print a plot that has 2 boxplots, 1 using df1 data and 1 using df2 data.
I am successfully plotting x1, but x2 is blank and I don't know why.
I am using jupyter notebook with Python 3. Any help is appreciated!
df1 = df[df.order == 1]
df2 = df[df.order == 0]
lst = ['device', 'ship', 'bill']
i = 0
for item in lst:
plt.figure(i)
x1= df1[item].values
x2 = df2[item].values
plt.boxplot([x1, x2])
plt.title(item)
i = i+1
The series that I'm trying to plot have the following format with several thousand observations each:
df[order] == 1
df['device'] df['ship'] df['bill']
0.0 0.0 0.0
19.0 5.0 0.0
237.0 237.0 237.0
df[order] == 0
df['device'] df['ship'] df['bill']
1.0 21.0 0.0
75.0 31.0 100.0
5.0 18.0 71.0
The dataframe contains data for orders. The columns listed in lst is made up of dtype float64
Solved it...there were a couple of NaN values appear to have prevented me from plotting.
My dataframe below is for a certain value in the column "IdBox"=4. It helps me to do a plot of the data only for "IdBox"=4.
I cannot find a way to have a function to plot this more quickly when IdBox value changes. My IdBox value ranges from 4 to 9, which means 6 graphs.
chaudiere4 = yy[(yy.NameDeviceType== "Chaudière_logement") & (yy.IdBox == 4.0)]
In [898]: chaudiere4
Out[898]:
UnitDeviceType NameDeviceType IdBox IdDeviceValue ValueDeviceValue weekday hour ONOFF
DateDeviceValue
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536448.0 On 4.0 17.0 1
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536449.0 Off 4.0 17.0 0
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536450.0 On 4.0 17.0 1
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536451.0 Off 4.0 17.0 0
2015-11-27 18:09:00 On/Off Chaudière_logement 4.0 536453.0 On 4.0 18.0 1
I created a column called ONOFF and grouped by mean to do the plot.
chaudiere4 = chaudiere4['ONOFF'].groupby(chaudiere4['hour']).mean()
chaudiere4.plot(kind='bar')
plt.title("Chaudiere ON/OFF")
plt.xlabel('hour')
plt.legend('ONOFF')
plt.axis([0, 24, 0, 1])
plt.show()
Is there a way to do this quickly with a function instead of changing the dataframe to chaudiere5 for Idbox=5 and chaudiere6 for Idbox=6?
IIUC:
yy[(yy.NameDeviceType== "Chaudière_logement") & (yy.IdBox == 4.0)] \
.groupby('hour')['ONOFF'].mean() \
.plot.bar()
You can create a little function for that:
def my_plot(df, IdBox=0, title='Chaudiere ON/OFF'):
df[(df.NameDeviceType== "Chaudière_logement") & (df.IdBox == IdBox)] \
.groupby('hour')['ONOFF'].mean() \
.plot.bar(title=title)
plt.axis([0, 24, 0, 1])
plt.show()
now you can call it like this:
my_plot(df, 4.0)