I have a Pandas Dataframe of Time.
0 2020-08-01 23:59:59
1 2020-08-01 23:59:49
2 2020-08-01 20:52:17
3 2020-08-01 19:02:34
4 2020-08-01 18:38:06
I want to add a column where I want to index by making a cluster. For eg. as follows:
0 2020-08-01 23:59:59 1
1 2020-08-01 23:59:49 1
2 2020-08-01 20:52:17 2
3 2020-08-01 19:02:34 3
4 2020-08-01 18:38:06 3
I have written this for this example as we can see 3 clusters can be made, which are the nearest/closest time stamps.
from sklearn.cluster import KMeans
mat = df['datetime'].values
kmeans = KMeans(n_clusters=3)
kmeans.fit(mat.iloc[:,1:])
y_kmeans = kmeans.predict(mat.iloc[:,1:])
df['cluster'] = y_kmeans
However, the above code also didn't work. Well, I have millions of data and obviously don't know how many clusters should I need to make. I read Elbow Method can be used but not exactly sure how it can be done. Can someone direct how it can be done?
kmeans assumes that you know the number of clusters.
If you want a method that determines the number of clusters algorithmically, you can e.g. use DBSCAN which forms a cluster whenever a group of data points is "close" to each other (closeness determined by the eps parameter). If you have a large number of samples and this is very costly, you can also try to explore any clusters in the data using a smaller (representative) subset of the data.
Related
I am having a bit of trouble getting some pandas code to work. My basic problem is given a set of transactions and a set of balances, I need to come up with "balancing transactions"; i.e. fake transactions (which will be tagged as such) that will make it so that the sum of transactions are equal to balances (ignore the fact that in most cases this isn't a good idea; it makes sense in the context I am working in, I promise!).
Sample data:
import pandas as pd
from io import StringIO
txn_data = StringIO(
"""Contract,Txndate,TxnAmount
"wer42134423",1/1/2014, 50
"wer42134423",1/2/2014, -10
"wer42134423",1/3/2014, 100
"wer42134423",1/4/2014, -50
"wer42134423",1/5/2014, -10
"wer42134423",1/6/2014, 20
"wer42134423",1/7/2014, 50
"wer42134423",1/8/2014, -70
"wer42134423",1/10/2014, 21
"wer42134423",1/11/2014, -3
"""
)
txns=pd.read_csv(txn_data,parse_dates=["Txndate"])
txns.head()
balance_data = StringIO(
"""Contract,Baldate,Amount
"wer42134423", 1/1/2014, 50
"wer42134423", 1/4/2014, 100
"wer42134423", 1/9/2014, 96
"wer42134423", 1/11/2014, 105
"""
)
balances=pd.read_csv(balance_data,parse_dates=["Baldate"])
txns["CumulativeSumofTxns"]=txns.groupby("Contract")["TxnAmount"].cumsum()
balances_merged=pd.merge_asof(balances,txns,by="Contract",left_on=["Baldate"],right_on=["Txndate"])
balances_merged.head()
I can do this fairly easily in Excel; I merge the cumulative sum of transactions onto my balance data, then just apply a fairly simple sum formula, and then everything can balance out.
However, I cannot for the life of me figure out how to do the same in Pandas (without manually iterating through each "cell", which would be horrendous for performance). After doing a lot of digging it almost seem like the expanding window function would do the trick, but I couldn't get that to work after multiple attempts with shifting and such. I think the problem is that every entry in my new column is dependent on entries for the same row (namely, the current balance and cumulative sum of transactions) and all the prior entries in the column (namely, all the prior balancing transactions). Any help appreciated!
IIUC, do you want?
balances_merged['Cumulative Sum Balancing Transactions'] = balances_merged['Amount'] - balances_merged['CumulativeSumofTxns']
balances_merged['Balancing Transaction'] = balances_merged['Cumulative Sum Balancing Transactions'].diff()
balances_merged
Output:
Contract Baldate Amount Txndate TxnAmount CumulativeSumofTxns Cumulative Sum Balancing Transactions Balancing Transaction
0 wer42134423 2014-01-01 50 2014-01-01 50 50 0 NaN
1 wer42134423 2014-01-04 100 2014-01-04 -50 90 10 10.0
2 wer42134423 2014-01-09 96 2014-01-08 -70 80 16 6.0
3 wer42134423 2014-01-11 105 2014-01-11 -3 98 7 -9.0
I have a large csv with the following format:
timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter
I load csv into a dataframe with:
df = pd.read_csv("/home/test.csv")
and then I want to create multiple dataframes every 2 seconds. For example:
df1 contains:
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
df2 contains :
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
and so on.
I achieve to split timestamps with command below:
full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')
but how I can store these spitted dataframes? How can I split a dataset based on timestamps into multiple dataframes?
Probably that question can help us: Pandas: Timestamp index rounding to the nearest 5th minute
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
ns2sec=2*1000000000 # 2 seconds in nanoseconds
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))
# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])
How about .resample()?
#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
2020-03-01 00:00:02 nick NaN
2020-03-01 00:00:02 john NaN
2020-03-01 00:00:04 peter NaN
#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>
#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>> df_dict[i] = df
>>> df_dict[0]
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
Now for some explanation...
resample() is great for rebinning DataFrames based on time (I use it often for downsampling time series data), but it can be used simply to cut up the DataFrame, as you want to do. Iterating over the resampler object produced by df.resample() returns a tuple of (name of the bin,df corresponding to that bin): e.g. the first tuple is (timestamp of the first second,data corresponding to the first 2 seconds). So to get the DataFrames out, we can loop over this object and store them somewhere, like a dict.
Note that this will produce every 2-second interval from the start to the end of the data, so many will be empty given your data. But you can add a step to filter those out if needed.
Additionally, you could manually assign each sliced DataFrame to a variable, but this would be cumbersome (you would probably need to write a line for each 2 second bin, rather than a single small loop). Rather with a dictionary, you can still associate each DataFrame with a callable name. You could also use an OrderedDict or list or whatever collection.
A couple points on your script:
setting freq to "0.2T" is 12 seconds (.2 *60); you can rather
do freq="2s"
The example df and df2 are "out of phase," by that I mean one is binned in 2 seconds starting on odd numbers (1-2 seconds), while one is starting on evens (4-5 seconds). So the date_range you mentioned wouldn't create those bins, it would create dfs from either 0-1s, 2-3s, 4-5s... OR 1-2s,3-4s,5-6s,... depending on which timestamp it started on.
For the latter point, you can use the base argument of .resample() to set the "phase" of the resampling. So in the case above, base=0 would start bins on even numbers, and base=1 would start bins on odds.
This is assuming you are okay with that type of binning - if you really want 1-2 seconds and 4-5 seconds to be in different bins, you would have to do something more complicated I believe.
I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:
Count Year Month Day Count-1 Count-2 Friday Monday Saturday Sunday Thursday Tuesday Wednesday
196.0 2017.0 7.0 10.0 196.0 196.0 0 1 0 0 0 0 0
264.0 2017.0 7.0 11.0 196.0 196.0 0 0 0 0 0 1 0
274.0 2017.0 7.0 12.0 264.0 196.0 0 0 0 0 0 0 1
286.0 2017.0 7.0 13.0 274.0 264.0 0 0 0 0 1 0 0
502.0 2017.0 7.0 14.0 286.0 274.0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.
First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.
Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.
If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks
Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.
Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.
forecasting = rf.predict(dataset_to_be_forecasted)
I have a dataset like this with data every 10 second interval.
rec NO2_RAW NO2
0 2019-05-31 13:42:15 0.01 9.13
1 2019-05-31 13:42:25 17.0 51.64
2 2019-05-31 13:42:35 48.4 111.69
The timestamp is not consistent throughout the table. There are instances where after a long gap, the timestamp has started from a new time. Like after 2019-05-31 16:00:00, it started from 2019-06-01 00:00:08.
I want to fill up the missing value by calculating the time difference between two consecutive rows (10s) and assign NAN values to the missing time.
I saw this example Search Missing Timestamp and display in python? but it is meant for consistent data. I want to calculate moving average of 15 minutes from this data. So I want a consistent data.
Can someone please help?
need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.