I have a time series with hourly frequency and a label per day. I would like to fix the class imbalance by oversampling while preserving the sequence for each one day period. Ideally I would be able to use ADASYN or another method better than random oversampling. Here is what the data looks like:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')
data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]
class_labels = []
for i in df['Date'].unique():
class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)
df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)
print(df['class_label'].value_counts())
df.head(15)
Out[209]:
5 264
1 240
6 145
9 120
7 120
10 72
8 72
4 24
2 24
Out[213]:
col1 col_2 class_label
DateTime
2019-02-01 18:28:29.214935 0.095549 0.307041 6
2019-02-01 19:28:29.214935 0.925004 0.981620 6
2019-02-01 20:28:29.214935 0.343573 0.610662 6
2019-02-01 21:28:29.214935 0.310477 0.482961 6
2019-02-01 22:28:29.214935 0.002010 0.242208 6
2019-02-01 23:28:29.214935 0.235595 0.355516 6
2019-02-02 00:28:29.214935 0.237792 0.028726 5
2019-02-02 01:28:29.214935 0.735916 0.221198 5
2019-02-02 02:28:29.214935 0.495468 0.712723 5
2019-02-02 03:28:29.214935 0.784425 0.818065 5
2019-02-02 04:28:29.214935 0.126506 0.414326 5
2019-02-02 05:28:29.214935 0.606649 0.264835 5
2019-02-02 06:28:29.214935 0.466121 0.244843 5
2019-02-02 07:28:29.214935 0.237132 0.298100 5
2019-02-02 08:28:29.214935 0.435159 0.621991 5
I would like to use ADASYN or SMOTE, but even random oversampling to fix the class imbalance would be good.
The desired result is in hourly increments like the original, has one label per day and classes are balanced:
print(df['class_label'].value_counts())
Out[211]:
5 264
1 264
6 264
9 264
7 264
10 264
8 264
4 264
2 264
Using for loop with groupby then sample each subset
newdf=pd.concat([y.sample(264,replace=True) for _, y in df.groupby('class_label')])
newdf.class_label.value_counts()
9 264
7 264
5 264
1 264
10 264
8 264
6 264
4 264
2 264
Name: class_label, dtype: int64
You really can't "oversample" time series data, at least not in the same sense that you can unordered data. It wouldn't be possible to have 264 examples of every class, that would mean inserting new data into the time series between existing points and throwing all of the time sensitive patters out of wack.
The best option (as far as oversampling) is to synthetically generate one or more new time series based on your original data. One option: for each point, pick a random class then interpolate between the closest data points of that class from the original time series. Another option: randomly sample 24 points from each class (which will always include all of class 2 and 4) and interpolate the rest of the time series a few times until you have a set of balanced time series.
A much better option is to address class imbalance some other way, say by changing your loss/error function.
Related
Using Python with pandas to export data from a database to csv.Data looks like this when exported. Got like 100 logs/day so this is pure for visualising purpose:
time
Buf1
Buf2
12/12/2022 19:15:56
12
3
12/12/2022 18:00:30
5
18
11/12/2022 15:15:08
12
3
11/12/2022 15:15:08
10
9
Now i only show the "raw" data into a csv but i am in need to generate for each day a min. max. and avg value. Whats the best way to create that ? i've been trying to do some min() max() functions but the problem here is that i've multiple days in these csv files. Also trying to manupilate the data in python it self but kinda worried about that i'll be missing some and the data will be not correct any more.
I would like to end up with something like this:
time
buf1_max
buf_min
12/12/2022
12
3
12/12/2022
12
10
Here you go, step by step.
In [27]: df['time'] = df['time'].astype("datetime64").dt.date
In [28]: df
Out[28]:
time Buf1 Buf2
0 2022-12-12 12 3
1 2022-12-12 5 18
2 2022-11-12 12 3
3 2022-11-12 10 9
In [29]: df = df.set_index("time")
In [30]: df
Out[30]:
Buf1 Buf2
time
2022-12-12 12 3
2022-12-12 5 18
2022-11-12 12 3
2022-11-12 10 9
In [31]: df.groupby(df.index).agg(['min', 'max', 'mean'])
Out[31]:
Buf1 Buf2
min max mean min max mean
time
2022-11-12 10 12 11.0 3 9 6.0
2022-12-12 5 12 8.5 3 18 10.5
Another approach is to use pivot_table for simplification of grouping data (keep in mind to convert 'time' column to datetime64 format as suggested:
import pandas as pd
import numpy as np
df.pivot_table(
index='time',
values=['Buf1', 'Buf2'],
aggfunc={'Buf1':[min, max, np.mean], 'Buf2':[min, max, np.mean]}
)
You can add any aggfunc as you wish.
I have two data frames:
DF1
cid dt tm id distance
2 ed032f716995 2021-01-22 16:42:48 43 21.420561
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314
DF2
id cluster
0 3
1 6 7,8,43
2 20 1817
3 25
4 10 11,13,14,15,9,539
I want to search each id in df1 in cluster column of df2. The desired output is:
cid dt tm id distance cluster
2 ed032f716995 2021-01-22 16:42:48 43 21.420561 7,8,43
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355 11,13,14,15,9,539
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627 7,8,43
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314 11,13,14,15,9,539
In the above df1 - line 1, since 43 is present in df2, I am including the entire cluster details for df1 - line 1.
I tried the following:
for index, rows in df1.iterrows():
for idx,rws in df2.iterrows():
if (str(rows['id']) in str(rws['cluster'])):
print([rows['id'],rws['cluster']])
This looks like working. However, since the df2['cluster'] is a string, even if there is a partial match, it is returning the result. For example, if df1['id'] = 34 and df2['cluster'] has 344,432, etc, it still matches based on 344 and returns a positive result.
I tried another option from SO here:
d = {k: set(v.split(',')) for k, v in df2.set_index('id')['cluster'].items()}
df1['idc'] = [next(iter([k for k, v in d.items() if set(x).issubset(v)]), '') for x in str(df1['id'])]
However, in the above I am getting an error indicating the length of variable is different between the two datasets.
How do I get the cluster mapped based on exact match of the id column in df1?
One way is split the cluster, explode it and map:
to_map = (df2.assign(cluster_i=df2.cluster.str.split(','))
.explode('cluster_i').dropna()
.set_index('cluster_i')['cluster']
)
df1['cluster'] = df1['id'].astype(str).map(to_map)
Output:
cid dt tm id distance cluster
2 ed032f716995 2021-01-22 16:42:48 43 21.420561 7,8,43
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355 11,13,14,15,9,539
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627 7,8,43
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314 11,13,14,15,9,539
I am implementing a Genetic Algorithm. For this algorithm a number of iterations (between 100 to 500) have to be done where in each iteration all 100 individuals are evaluated for their 'fitness'. To this extent, I have written an evaluate function. However, even for one iteration evaluating the fitness of the 100 individuals already takes 13 seconds. I have to speed this up massively in order to implement an efficient algorithm.
The evaluate function takes two arguments, and then performs some calculations. I will share part of the function since a similar form of calculation is repeated after that. Specifically, I now perform a groupby to a dataframe called df_demand, and then take the sum of a list comprehension that uses the resulting dataframe from the groupby function and another dataframe called df_distance. A snippet of df_demand looks as follows but has larger dimensions in reality (index is just 0,1,2,...):
date customer deliveries warehouse
2020-10-21 A 30 1
2020-10-21 A 47 1
2020-10-21 A 59 2
2020-10-21 B 130 3
2020-10-21 B 102 3
2020-10-21 B 95 2
2020-10-22 A 55 1
2020-10-22 A 46 4
2020-10-22 A 57 4
2020-10-22 B 89 3
2020-10-22 B 104 3
2020-10-22 B 106 4
and a snippet of df_distance is (where the columns are the warehouses):
index 1 2 3 4
A 30.2 54.3 76.3 30.9
B 96.2 34.2 87.7 102.4
C 57.0 99.5 76.4 34.5
Next, I want to groupby df_demand such that each combination of (date, customer, warehouse) appears once and all deliveries for this combination are summed. Finally, I want to calculate total costs. Currently, I have done the following but this is too slow:
def evaluate(df_demand, df_distance):
costs = df_demand.groupby(["date", "customer", "warehouse"]).sum().reset_index()
cost = sum([math.ceil(costs.iat[i, 3] / 20) * df_distance.loc[costs.iat[i, 1], costs.iat[i, 2]] for i in range(len(costs))])
etc...
return cost
Since I have to do many iterations and considering the fact that dimensions of my data are considerably larger, my question is: what is the fastest way to do this operation?
let's try:
def get_cost(df, df2):
'''
df: deliveries data
df2: distance data
'''
pivot = np.ceil(df.pivot_table(index=['customer', 'warehouse'], columns=['date'],
values='deliveries', aggfunc='sum', fill_value=0)
.div(20)
)
return pivot.mul(df2.rename_axis(index='customer', columns='warehouse').stack(),
axis='rows').sum().sum()
consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64
I have a dataframe with 2 columns and 3000 rows.
First column is representing time in time-steps. For example first row is 0, second is 1, ..., last one is 2999.
Second column is representing pressure. The pressure changes as we iterate over the rows, but shows a repetitive behaviour. So every few steps we see that it goes to its minimum value (which is 375), then goes up again, then again at 375 etc.
What I want to do in Python, is to iterate over the rows and see:
1) at which time-steps we see pressure is at its minimum
2)Find the frequency between the minimum values.
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.linalg as lin
from matplotlib.pylab import *
import re
from pylab import *
import datetime
df = pd.read_csv('test.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]
print(df[[0, 1]])
You don't need to iterate row-wise, you can compare the entire column against the min value to mask it, you can then use the mask to find the timestep diff:
Data setup:
In [44]:
df = pd.DataFrame({'timestep':np.arange(20), 'value':np.random.randint(375, 400, 20)})
df
Out[44]:
timestep value
0 0 395
1 1 377
2 2 392
3 3 396
4 4 377
5 5 379
6 6 384
7 7 396
8 8 380
9 9 392
10 10 395
11 11 393
12 12 390
13 13 393
14 14 397
15 15 396
16 16 393
17 17 379
18 18 396
19 19 390
mask the df by comparing the column against the min value:
In [45]:
df[df['value']==df['value'].min()]
Out[45]:
timestep value
1 1 377
4 4 377
We can use the mask with loc to find the corresponding 'timestep' value and use diff to find the interval differences:
In [48]:
df.loc[df['value']==df['value'].min(),'timestep'].diff()
Out[48]:
1 NaN
4 3.0
Name: timestep, dtype: float64
You can divide the above by 1/60 to find frequency wrt to 1 minute or whatever frequency unit you desire