I recently came across a k-means tutorial that looks a bit different than what I remember the algorithm to be, but it should still do the same after all it's k-means. So, I went and gave it a try with some data, here's how the code looks:
# Assignment Stage:
def assignment(data, centroids):
for i in centroids.keys():
#sqrt((x1-x2)^2+(y1-y2)^2 + etc)
data['distance_from_{}'.format(i)]= (
np.sqrt((data['soloRatio']-centroids[i][0])**2
+(data['secStatus']-centroids[i][1])**2
+(data['shipsDestroyed']-centroids[i][2])**2
+(data['combatShipsLost']-centroids[i][3])**2
+(data['miningShipsLost']-centroids[i][4])**2
+(data['exploShipsLost']-centroids[i][5])**2
+(data['otherShipsLost']-centroids[i][6])**2
))
print(data['distance_from_{}'.format(i)])
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
data['closest'] = data.loc[:, centroid_distance_cols].idxmin(axis=1)
data['closest'] = data['closest'].astype(str).str.replace('\D+', '')
return data
data = assignment(data, centroids)
and:
#Update stage:
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0]=np.mean(data[data['closest']==i]['soloRatio'])
centroids[i][1]=np.mean(data[data['closest']==i]['secStatus'])
centroids[i][2]=np.mean(data[data['closest']==i]['shipsDestroyed'])
centroids[i][3]=np.mean(data[data['closest']==i]['combatShipsLost'])
centroids[i][4]=np.mean(data[data['closest']==i]['miningShipsLost'])
centroids[i][5]=np.mean(data[data['closest']==i]['exploShipsLost'])
centroids[i][6]=np.mean(data[data['closest']==i]['otherShipsLost'])
return k
#TODO: add graphical representation?
while True:
closest_centroids = data['closest'].copy(deep=True)
centroids = update(centroids)
data = assignment(data,centroids)
if(closest_centroids.equals(data['closest'])):
break
When I run the initial assignment stage, it returns the distances, however when I run the update stage, all distance values become NaN, and I just dont know why or at which point exactly this happens... Maybe I made I mistake I can't spot?
Here's an excerpt of the data im working with:
Unnamed: 0 characterID combatShipsLost exploShipsLost miningShipsLost \
0 0 90000654.0 8.0 4.0 5.0
1 1 90001581.0 97.0 5.0 1.0
2 2 90001595.0 61.0 0.0 0.0
3 3 90002023.0 22.0 1.0 0.0
4 4 90002030.0 74.0 0.0 1.0
otherShipsLost secStatus shipsDestroyed soloRatio
0 0.0 5.003100 1.0 10.0
1 0.0 2.817807 6251.0 6.0
2 0.0 -2.015310 752.0 0.0
3 4.0 5.002769 43.0 5.0
4 1.0 3.090204 301.0 7.0
Related
I have some trips, and for each trip contains different steps, the data frame looks like following:
tripId duration (s) distance (m) speed Km/h
1819714 NaN NaN NaN
1819714 6.0 8.511452 5.106871
1819714 10.0 6.908963 2.487227
1819714 5.0 15.960625 11.491650
1819714 6.0 26.481649 15.888989
... ... ... ... ...
1865507 6.0 16.280313 9.768188
1865507 5.0 17.347482 12.490187
1865507 5.0 14.266625 10.271970
1865507 6.0 22.884008 13.730405
1865507 5.0 21.565655 15.527271
I want to know if, on a trip X, the cyclist has braked (speed has decreased by at least 30%).
The problem is that the duration between every two steps is each time different.
For example, in 6 seconds, the speed of a person X has decreased from 28 km/h to 15 km/h.. here we can say, he has braked, but if the duration was high, we will not be able to say that
My question is if there is a way to apply something to know if there is a braking process, for all data frame in a way that makes sense
The measure of braking is the "change in speed" relative to "change in time". From your data, I created a column 'acceleration', which is change in speed (Km/h) divided by duration (seconds). Then the final column to detect braking if the value is less than -1 (Km/h/s).
Note that you need to determine if a reduction of 1km/h per second is good enough to be considered as braking.
df['speedChange'] = df['speedKm/h'].diff()
df['acceleration'] = df['speedChange'] / df['duration(s)']
df['braking'] = df['acceleration'].apply(lambda x: 'yes' if x<-1 else 'no')
print(df)
Output:
tripId duration(s) distance(m) speedKm/h speedChange acceleration braking
0 1819714.0 6.0 8.511452 5.106871 NaN NaN no
1 1819714.0 10.0 6.908963 2.487227 -2.619644 -0.261964 no
2 1819714.0 5.0 15.960625 11.491650 9.004423 1.800885 no
3 1819714.0 6.0 26.481649 15.888989 4.397339 0.732890 no
4 1865507.0 6.0 16.280313 9.768188 -6.120801 -1.020134 yes
5 1865507.0 5.0 17.347482 12.490187 2.721999 0.544400 no
6 1865507.0 5.0 14.266625 10.271970 -2.218217 -0.443643 no
7 1865507.0 6.0 22.884008 13.730405 3.458435 0.576406 no
I am new to pandas so sorry for naiveté.
I have two dataframe.
One is out.hdf:
999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074
999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292
999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030
999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340
and another one is out.res (the first column is station name):
061Z 56.72 0.0 P 603879074
061Z 29.92 0.0 P 603879074
0614 46.24 0.0 P 603879292
109C 87.51 0.0 P 603947030
113A 66.93 0.0 P 603947030
113A 26.93 0.0 P 603947030
121A 31.49 0.0 P 603947340
The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).
The new dataframe:
"999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074"
061Z 56.72 0.0 P
061Z 29.92 0.0 P
"999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292"
0614 46.24 0.0 P
"999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030"
109C 87.51 0.0 P
113A 66.93 0.0 P
113A 26.93 0.0 P
"999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340"
121A 31.49 0.0 P
My code to do this is:
import csv
import pandas as pd
import numpy as np
path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)
###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter='\t')
i=0
with open('./out.hdf', 'r') as a_file:
for line in a_file:
liney = line.strip()
writer.writerow(np.array([liney]))
print(liney)
j=0
with open('./out.res', 'r') as a_file:
for line in a_file:
if res.iloc[j, 4] == hdf.iloc[i, 14]:
strng = res.iloc[j, [0, 1, 2, 3]]
print(strng)
writer.writerow(np.array(strng))
j+=1
i+=1
The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:
res.drop_duplicates([0], keep = 'last', inplace = True)
and
res.groupby([0], as_index = False).last()
and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.
I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:
unique_stations = res.loc[res[0].shift() != res[0]]
I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN
I'm working on creating a linear trendline from data that contains dates and another measure (volume). The goal is to create a linear trendline that shows how volume trends over time.
The data looks as follows:
date typeID lowPrice highPrice avgPrice volume orders \
0 2003-11-30 22.0 9000.00 9000.00 9000.00 5.0 1.0
1 2003-12-31 22.0 9000.00 9000.00 9000.00 2.0 1.0
2 2004-01-31 22.0 15750.00 15750.00 15750.00 9.5 1.0
3 2004-02-29 22.0 7000.00 7000.00 7000.00 11.0 1.0
4 2004-03-31 22.0 7000.00 7000.00 7000.00 8.0 1.0
6 2004-05-31 22.0 15000.00 15000.00 15000.00 16.0 1.0
10 2004-09-30 22.0 6500.00 6500.00 6500.00 27.0 1.0
The issue is that for some months (the interval in which the dates are stored) there is no volume data available as can be seen above, thus the following is the approach I currently take at creating a trendline from the available dates.
x = df2["date"]
df2["inc_dates"] = np.arange(len(x))
y = df2["ln_vold"]
plt.subplot(15, 4, count)
plt.plot_date(x, y, xdate = True)
model = smf.ols('ln_vold ~ inc_dates', missing = "drop", data = df2).fit()
intercept, coef = model.params
l = [intercept]
for i in range(len(x) -1):
l.append(intercept + coef*i)
plt.plot_date(x, l, "r--", xdate = True)
However the output for this currently shows:
Which clearly isn't the right trendline (seen by the beginning being non-linear).
Now I don't see how this could go wrong, as all I do in the for-loop is add constant values to an increasing integer. All I'd like to see is a linear trendline going straight from the intercept to the end.
I'm looking to adjust values of one column based on a conditional in another column.
I'm using np.busday_count, but I don't want the weekend values to behave like a Monday (Sat to Tues is given 1 working day, I'd like that to be 2)
dispdf = df[(df.dispatched_at.isnull()==False) & (df.sold_at.isnull()==False)]
dispdf["dispatch_working_days"] = np.busday_count(dispdf.sold_at.tolist(), dispdf.dispatched_at.tolist())
for i in range(len(dispdf)):
if dispdf.dayofweek.iloc[i] == 5 or dispdf.dayofweek.iloc[i] == 6:
dispdf.dispatch_working_days.iloc[i] +=1
Sample:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 1
49193 3.0 0
42470 3.0 1
47874 6.0 1
44500 3.0 1
43031 6.0 3
43193 0.0 4
43591 6.0 3
Expected Results:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 2
49193 3.0 0
42470 3.0 1
47874 6.0 2
44500 3.0 1
43031 6.0 2
43193 0.0 4
43591 6.0 4
At the moment I'm using this for loop to add a working day to Saturday and Sunday values. It's slow!
Can I use a vectorization instead to speed this up. I tried using .apply but to no avail.
Pretty sure this works, but there are more optimized implementations:
def adjust_dispatch(df_line):
if df_line['dayofweek'] >= 5:
return df_line['dispatch_working_days'] + 1
else:
return df_line['dispatch_working_days']
df['dispatch_working_days'] = df.apply(adjust_dispatch, axis=1)
for in you code could be replaced by that line:
dispdf.loc[dispdf.dayofweek>5,'dispatch_working_days']+=1
or you could use numpy.where
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html