How to split dataframe according sub ID - python

I have a csv file with 3 columns containing image data set.1st column name 'ID' where ID represent patient id, 2nd and 3rd columns represent side and label of the data set respectively.I would like to split this dataframe in to test and train set according to patient ID in where patient Id wouldn't be repeat in both set.I mean the train ID would not present in the test set. Using this below code
# Defining a function for spliting dataframe into train and test
df_Datacopy = df_Data.copy() # copy the df
#df_Datacopy= df_Datacopy.sort_values(by=['ID'])
df_Datacopy = df_Datacopy.sample(frac=1)
train_df = df_Datacopy.sample(frac=0.80, random_state=0) # train spliting size 80%
# sorted according to ID
train_df= train_df.sort_values(by=['ID'])
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to ID
test_df= test_df.sort_values(by=['ID'])
u1 = np.unique(train_df['ID'])
u2 = np.unique(test_df['ID'])
print(set(u1).union(set(u2)))
I tried to split the test and train set,but the problem is the i seen that some ID present in both test and train set.It would be a great help for me if i get some help including code example.

Simple Python Lists Approach
So I would recommend using simple python lists for this as the preferred and simpler approach.Since you started with pandas I'll provide a way to use pandas methods to achieve something similar but with a possible worse outcome.
whole_dataset_list =df_copy.to_numpy().tolist()
patientid_list =df['ID'].to_numpy().tolist()
patientid_set =list(set(patientid_list))
import random as rand
rand.shuffle(patientid_set)
#Change the numbers as to represent a 80% slice of your dataset/10/10 respectively
train_set_by_patientID = patientid_set[0:800] # 80
val_set_by_patientID = patientid_set[800:900] # 10
test_set_by_patientID = patientid_set[1000:] # 10
After splitting these lists you can use them to obtain the final train/test/val splits as such.
for i in range(len(wholeDataset_list)):
curr_pt_id = wholeDataset_list[i]
if(curr_pt_id in train_set_by_patientID):
train_set.append(wholeDataset_list[i])
elif(curr_pt_id in val_set_by_patientID):
val_set.append(wholeDataset_list[i])
elif(curr_pt_id in test_set_by_patientID):
test_set.append(wholeDataset_list[i])
else:
raise RuntimeError("Whole dataset does not contain given i ")
Finally you can come back to a dataframe if you want as such:
train_df = pd.DataFrame(train_set, columns=df_copy.columns)
val_df = pd.DataFrame(val_set, columns=df_copy.columns)
test_df = pd.DataFrame(test_set, columns=df_copy.columns)
Second Option using Pandas Only:
Here sop_uid is a unique index. I am using a train/val/test split instead of a train/test split but that can be changed easily.
dff.sort_values(by="patient_id", axis=0, inplace=True)
count_study = dff.groupby_agg(by = 'patient_id', agg='count', agg_column_name='sop_uid', new_column_name="count_instances")
df_Datacopy = dict_dff
train_df = df_Datacopy.sample(frac=0.90, weights='count_study', random_state=0) # train spliting size 90%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to count_study
test_df= test_df.sort_values(by=['count_instances'], ascending = False)
#Sample
train_df = train_df.sample(frac=0.89, weights='count_study', random_state=0) # train spliting size 80%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
val_df = df_Datacopy.drop(train_df.index.append(test_df.index))

I recommend using a boolean mask to filter the dataset.
If you want to split 50/50 maybe checking if ID is even or uneven might work.
Since you didnt provide any sample data or furter detail on which citeria to split i suggest
train_df= df[df.ID % 2 == 0]
test_df = df[df.ID % 2 != 0]
Is that what you wanted to achieve?
If not maybe provide more information on what result you want.

Related

I'm getting a ValueError: unable to convert str to float 'XX'

Some background, I'm taking a machine learning class on customer segmentation. My code env is pandas(python) and sklearn. I have two datasets, a general population dataset and a customer demographics dataset with 85 identical columns.
I'm calling a function I created to run preprocessing steps on the 'customers' data, steps that were previously run outside this function on the general population dataset. Within the function is a loop that replaces missing values with np.nan. Here is the loop:
#replacing missing data with NaNs.
#feat_sum is a dataframe (feature_summary) of coded values
for i in range(len(feat_sum)):
mi_unk = feat_sum.iloc[i]['missing_or_unknown'] #locate column and values
mi_unk = mi_unk.strip('[').strip(']').split(',')# strip the brackets then split
mi_unk = [int(val) if (val!='' and val!='X' and val!='XX') else val for val in mi_unk]
if mi_unk != ['']:
featsum_attrib = feat_sum.iloc[i]['attribute']
df = df.replace({featsum_attrib: mi_unk}, np.nan)
Toward the end of the function I'm engineering new variables:
#Investigate "CAMEO_INTL_2015" and engineer two new variables.
df['WEALTH'] = df['CAMEO_INTL_2015']
df['LIFE_STAGE'] = df['CAMEO_INTL_2015']
mf_wealth_dict = {'11':1, '12':1, '13':1, '14':1, '15':1, '21':2, '22':2, '23':2, '24':2, '25':2, '31':3,'32':3, '33':3, '34':3, '35':3, '41':4, '42':4, '43':4, '44':4, '45':4, '51':5, '52':5, '53':5, '54':5, '55':5}
mf_lifestage_dict = {'11':1, '12':2, '13':3, '14':4, '15':5, '21':1, '22':2, '23':3, '24':4, '25':5, '31':1, '32':2, '33':3, '34':4, '35':5, '41':1, '42':2, '43':3, '44':4, '45':5, '51':1, '52':2, '53':3, '54':4, '55':5}
#replacing the 'WEALTH' and 'LIFE_STAGE' columns with values from the dictionaries
df['WEALTH'].replace(mf_wealth_dict, inplace=True)
df['LIFE_STAGE'].replace(mf_lifestage_dict, inplace=True)
Near the end of the project code, I'm running an imputer to replace the np.nans which ran successfully on the general population dataset(azdias):
az_imp = Imputer(strategy="most_frequent")
azdias_cleaned_imp = pd.DataFrame(az_imp.fit_transform(azdias_cleaned_encoded))
So when I call the clean_data function passing the 'customers' dataframe, clean_data(customers),it is giving me the ValueError: could not convert str to float: 'XX' on this line:
customers_imp = Imputer(strategy="most_frequent")
---> 19 customers_cleaned_imputed = pd.DataFrame(customers_imp.fit_transform(customers_cleaned_encoded))
In the data dictionary for the CAMEO_INTL_2015 column of the dataset, the very last category is 'XX': unknown. When I run a value count on the WEALTH and LIFE_STAGE columns, 124 occurrences of 'XX' under those two columns. No other columns in the dataset have the 'XX' value except these. Again, no issues with the other dataset, I did not run into this problem. I know this is wordy, but any help appreciated and I can provide the project code as well.
A mentor and myself tried troubleshooting looking at all the steps that were performed on both datasets, to no avail. I was expecting the 'XX' to be dealt with from the loop I mentioned earlier.

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?
I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

How can I extract the newly added rows after SMOTE (imblearn module)

Is it possible to extract the newly added rows from pandas dataframe that were created by imblearn's smote function?
I think I figured it out. Apparently they are being appended at the end of the fit_resample returned dataframe:
my target is "DIED"
smotez = SMOTENC([10,11], random_state=555, k_neighbors=10)
smote_tomek = SMOTETomek(random_state=555, smote=smotez , n_jobs=-1)
X_train_new, y_train_new = smote_tomek.fit_resample(X_train, y_train)
train_data_new = pd.concat([X_train_new.iloc[1:],y_train_new],axis=1)
train_data_new.dropna(inplace=True)
smote_data = train_data_new.iloc[len(train_data)-1:,]
print("Y_train_smote:\n", npunique(smote_data['DIED']),smote_data['DIED'].mean())
As you can see, all rows are of the minority class ("DIED")
Y_train_smote:
[[ 1 91936]] 1.0
Double-checking, the expression below should return 0:
print(len(smote_data) + len(X_train) - len(X_train_new))
0

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

DataFrame has float values but calling to_csv() on it generates an empty CSV

Consider my following code:
columns = ['tf-idf','bag_of_words']
index = ['MultinomialNB', 'LinearSVC', 'LogisticRegression', \
'DecisionTreeClassifier','MLPClassifier']
df = pd.DataFrame(columns = columns, index = index)
estimators_dict = OrderedDict([('MultiNomialNB', MultinomialNB()), \
('LinearSVC', LinearSVC()), \
('LogisticRegression', LogisticRegression()), \
('DecisionTreeClassifier', DecisionTreeClassifier()), \
('MLPClassifier',MLPClassifier(max_iter=10))])
transformers_dict = OrderedDict([('tf-idf', TfidfVectorizer(max_features=500)), \
('bag_of_words', CountVectorizer())])
steps = []
for transformer_name, transformer in transformers_dict.items():
steps.append((transformer_name, transformer))
for estimator_name, estimator in estimators_dict.items():
steps.append((estimator_name, estimator))
model = Pipeline(steps)
predicted_labels = cross_val_predict(model, all_features,all_labels, cv=5)
# f1 is float
f1 = f1_score(all_labels, predicted_labels, average = 'weighted')
# writing to DataFrame
df[transformer_name][estimator_name] = round(f1,2)
# This correctly shows the value which just written
print(str(df[str(transformer_name)][str(estimator_name)])) # line a
del steps[1]
del steps[0]
# but writing to csv create a file with no values whatsoever
df.to_csv('classification_results_f1score') # line b
Quick little context: In my classification task I am using a set of feature transformers and another set of sklearn classifiers. I am running all possible combinations of these two sets to see which model performs the best.
I am calculating f1-score (a float value) of each model and storing it in a dataframe. The value is successfully written to the dataframe. I am able to verify the same by accessing it. (line a)
But after all the model runs are over (end of both the for loops) when I write the dataframe to a csv it is generating a csv as follows:
,tf-idf,bag_of_words
MultinomialNB,,
LinearSVC,,
LogisticRegression,,
DecisionTreeClassifier,,
MLPClassifier,,
What seems to be the issue here? Why are the values not showing up in the csv?

Categories