My dataframe below is for a certain value in the column "IdBox"=4. It helps me to do a plot of the data only for "IdBox"=4.
I cannot find a way to have a function to plot this more quickly when IdBox value changes. My IdBox value ranges from 4 to 9, which means 6 graphs.
chaudiere4 = yy[(yy.NameDeviceType== "Chaudière_logement") & (yy.IdBox == 4.0)]
In [898]: chaudiere4
Out[898]:
UnitDeviceType NameDeviceType IdBox IdDeviceValue ValueDeviceValue weekday hour ONOFF
DateDeviceValue
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536448.0 On 4.0 17.0 1
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536449.0 Off 4.0 17.0 0
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536450.0 On 4.0 17.0 1
2015-11-27 17:54:00 On/Off Chaudière_logement 4.0 536451.0 Off 4.0 17.0 0
2015-11-27 18:09:00 On/Off Chaudière_logement 4.0 536453.0 On 4.0 18.0 1
I created a column called ONOFF and grouped by mean to do the plot.
chaudiere4 = chaudiere4['ONOFF'].groupby(chaudiere4['hour']).mean()
chaudiere4.plot(kind='bar')
plt.title("Chaudiere ON/OFF")
plt.xlabel('hour')
plt.legend('ONOFF')
plt.axis([0, 24, 0, 1])
plt.show()
Is there a way to do this quickly with a function instead of changing the dataframe to chaudiere5 for Idbox=5 and chaudiere6 for Idbox=6?
IIUC:
yy[(yy.NameDeviceType== "Chaudière_logement") & (yy.IdBox == 4.0)] \
.groupby('hour')['ONOFF'].mean() \
.plot.bar()
You can create a little function for that:
def my_plot(df, IdBox=0, title='Chaudiere ON/OFF'):
df[(df.NameDeviceType== "Chaudière_logement") & (df.IdBox == IdBox)] \
.groupby('hour')['ONOFF'].mean() \
.plot.bar(title=title)
plt.axis([0, 24, 0, 1])
plt.show()
now you can call it like this:
my_plot(df, 4.0)
Related
p = df[df['Name'].str.contains(NER, na = True)]
p['Result'] = p['Result'].astype('float')
p.groupby(["Name"])["Result"].plot(legend=True ,figsize=(15,10))
plt.legend(loc ='upper right')
plt.savefig('figure.png')
plt.close()
How can I turn the Timestamp column into the values of the X axis in the plot
I tried:
p.set_index("Timestamp", inplace=True)
But the x-axis starts from 00:00:00 and not from the time of the first index (09:34:54).
p after the line: p['Result'] = p['Result'].astype('float')
Timestamp Name Result
7 09:34:54 TRX0_NER_M0 1.0
8 09:34:54 TRX0_NER_M1 1.0
9 09:34:54 TRX1_NER_M0 1.0
10 09:34:54 TRX1_NER_M1 1.0
11 09:34:54 TRX2_NER_M0 1.0
... ... ... ...
401465 09:47:00 TRX1_NER_M1 1.0
401466 09:47:00 TRX2_NER_M0 1.0
401467 09:47:00 TRX2_NER_M1 1.0
401468 09:47:00 TRX3_NER_M0 1.0
401469 09:47:01 TRX3_NER_M1 1.0
[38341 rows x 3 columns]
i can see you are using time in format hh:mm:ss you can use this code to get the number of seconds which can be used for x axis. I created a function which u can use.
def get_seconds(timestamp):
time=list(map(int,timestamp.split(":")))
h=time[0]
m=time[1]
s=time[2]
total=h*3600+m*60+s
return total
print(get_seconds("09:34:54"))
I recently came across a k-means tutorial that looks a bit different than what I remember the algorithm to be, but it should still do the same after all it's k-means. So, I went and gave it a try with some data, here's how the code looks:
# Assignment Stage:
def assignment(data, centroids):
for i in centroids.keys():
#sqrt((x1-x2)^2+(y1-y2)^2 + etc)
data['distance_from_{}'.format(i)]= (
np.sqrt((data['soloRatio']-centroids[i][0])**2
+(data['secStatus']-centroids[i][1])**2
+(data['shipsDestroyed']-centroids[i][2])**2
+(data['combatShipsLost']-centroids[i][3])**2
+(data['miningShipsLost']-centroids[i][4])**2
+(data['exploShipsLost']-centroids[i][5])**2
+(data['otherShipsLost']-centroids[i][6])**2
))
print(data['distance_from_{}'.format(i)])
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
data['closest'] = data.loc[:, centroid_distance_cols].idxmin(axis=1)
data['closest'] = data['closest'].astype(str).str.replace('\D+', '')
return data
data = assignment(data, centroids)
and:
#Update stage:
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0]=np.mean(data[data['closest']==i]['soloRatio'])
centroids[i][1]=np.mean(data[data['closest']==i]['secStatus'])
centroids[i][2]=np.mean(data[data['closest']==i]['shipsDestroyed'])
centroids[i][3]=np.mean(data[data['closest']==i]['combatShipsLost'])
centroids[i][4]=np.mean(data[data['closest']==i]['miningShipsLost'])
centroids[i][5]=np.mean(data[data['closest']==i]['exploShipsLost'])
centroids[i][6]=np.mean(data[data['closest']==i]['otherShipsLost'])
return k
#TODO: add graphical representation?
while True:
closest_centroids = data['closest'].copy(deep=True)
centroids = update(centroids)
data = assignment(data,centroids)
if(closest_centroids.equals(data['closest'])):
break
When I run the initial assignment stage, it returns the distances, however when I run the update stage, all distance values become NaN, and I just dont know why or at which point exactly this happens... Maybe I made I mistake I can't spot?
Here's an excerpt of the data im working with:
Unnamed: 0 characterID combatShipsLost exploShipsLost miningShipsLost \
0 0 90000654.0 8.0 4.0 5.0
1 1 90001581.0 97.0 5.0 1.0
2 2 90001595.0 61.0 0.0 0.0
3 3 90002023.0 22.0 1.0 0.0
4 4 90002030.0 74.0 0.0 1.0
otherShipsLost secStatus shipsDestroyed soloRatio
0 0.0 5.003100 1.0 10.0
1 0.0 2.817807 6251.0 6.0
2 0.0 -2.015310 752.0 0.0
3 4.0 5.002769 43.0 5.0
4 1.0 3.090204 301.0 7.0
I'm working on creating a linear trendline from data that contains dates and another measure (volume). The goal is to create a linear trendline that shows how volume trends over time.
The data looks as follows:
date typeID lowPrice highPrice avgPrice volume orders \
0 2003-11-30 22.0 9000.00 9000.00 9000.00 5.0 1.0
1 2003-12-31 22.0 9000.00 9000.00 9000.00 2.0 1.0
2 2004-01-31 22.0 15750.00 15750.00 15750.00 9.5 1.0
3 2004-02-29 22.0 7000.00 7000.00 7000.00 11.0 1.0
4 2004-03-31 22.0 7000.00 7000.00 7000.00 8.0 1.0
6 2004-05-31 22.0 15000.00 15000.00 15000.00 16.0 1.0
10 2004-09-30 22.0 6500.00 6500.00 6500.00 27.0 1.0
The issue is that for some months (the interval in which the dates are stored) there is no volume data available as can be seen above, thus the following is the approach I currently take at creating a trendline from the available dates.
x = df2["date"]
df2["inc_dates"] = np.arange(len(x))
y = df2["ln_vold"]
plt.subplot(15, 4, count)
plt.plot_date(x, y, xdate = True)
model = smf.ols('ln_vold ~ inc_dates', missing = "drop", data = df2).fit()
intercept, coef = model.params
l = [intercept]
for i in range(len(x) -1):
l.append(intercept + coef*i)
plt.plot_date(x, l, "r--", xdate = True)
However the output for this currently shows:
Which clearly isn't the right trendline (seen by the beginning being non-linear).
Now I don't see how this could go wrong, as all I do in the for-loop is add constant values to an increasing integer. All I'd like to see is a linear trendline going straight from the intercept to the end.
I'm looking to adjust values of one column based on a conditional in another column.
I'm using np.busday_count, but I don't want the weekend values to behave like a Monday (Sat to Tues is given 1 working day, I'd like that to be 2)
dispdf = df[(df.dispatched_at.isnull()==False) & (df.sold_at.isnull()==False)]
dispdf["dispatch_working_days"] = np.busday_count(dispdf.sold_at.tolist(), dispdf.dispatched_at.tolist())
for i in range(len(dispdf)):
if dispdf.dayofweek.iloc[i] == 5 or dispdf.dayofweek.iloc[i] == 6:
dispdf.dispatch_working_days.iloc[i] +=1
Sample:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 1
49193 3.0 0
42470 3.0 1
47874 6.0 1
44500 3.0 1
43031 6.0 3
43193 0.0 4
43591 6.0 3
Expected Results:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 2
49193 3.0 0
42470 3.0 1
47874 6.0 2
44500 3.0 1
43031 6.0 2
43193 0.0 4
43591 6.0 4
At the moment I'm using this for loop to add a working day to Saturday and Sunday values. It's slow!
Can I use a vectorization instead to speed this up. I tried using .apply but to no avail.
Pretty sure this works, but there are more optimized implementations:
def adjust_dispatch(df_line):
if df_line['dayofweek'] >= 5:
return df_line['dispatch_working_days'] + 1
else:
return df_line['dispatch_working_days']
df['dispatch_working_days'] = df.apply(adjust_dispatch, axis=1)
for in you code could be replaced by that line:
dispdf.loc[dispdf.dayofweek>5,'dispatch_working_days']+=1
or you could use numpy.where
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
I am trying to loop through a list to create a series of boxplots using Matplotlib. Each item in the list should print a plot that has 2 boxplots, 1 using df1 data and 1 using df2 data.
I am successfully plotting x1, but x2 is blank and I don't know why.
I am using jupyter notebook with Python 3. Any help is appreciated!
df1 = df[df.order == 1]
df2 = df[df.order == 0]
lst = ['device', 'ship', 'bill']
i = 0
for item in lst:
plt.figure(i)
x1= df1[item].values
x2 = df2[item].values
plt.boxplot([x1, x2])
plt.title(item)
i = i+1
The series that I'm trying to plot have the following format with several thousand observations each:
df[order] == 1
df['device'] df['ship'] df['bill']
0.0 0.0 0.0
19.0 5.0 0.0
237.0 237.0 237.0
df[order] == 0
df['device'] df['ship'] df['bill']
1.0 21.0 0.0
75.0 31.0 100.0
5.0 18.0 71.0
The dataframe contains data for orders. The columns listed in lst is made up of dtype float64
Solved it...there were a couple of NaN values appear to have prevented me from plotting.