Drop bad data from dataset Tensorflow

Drop bad data from dataset Tensorflow - python

I have a training pipeline using tf.data. Inside the dataset there is some bad elements, in my case a values of 0. How do i drop these bad data elements based on their value? I want to be able to remove them within the pipeline while training since the dataset is large.
Assume from the following pseudo code:
def parse_function(element):
height = element['height']
if height <= 0: skip() #How to skip this value
labels = element['label']
features['height'] = height
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
A suggestion would be using ds.skip(1) based on the feature value, or provide some sort of neutral weight/loss?

You can use tf.data.Dataset.filter:
def filter_func(elem):
""" return True if the element is to be kept """
return tf.math.greater(elem['height'],0)
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.filter(filter_func)

Assuming that element is a data frame in your code, then it would be:
def parse_function(element):
element = element.query('height>0')
labels = element['label']
features['height'] = element['height']
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
`

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

Array to columns in dataframe

I've built a functioning classification model following this tutorial.
I bring in a csv and then pass each row's text value into a function which calls on the classification model to make a prediction. The function returns an array which I need put into columns in the dataframe.
Function:
def get_top_k_predictions(model,X_test,k):
# get probabilities instead of predicted labels, since we want to collect top 3
np.set_printoptions(suppress=True)
probs = model.predict_proba(X_test)
# GET TOP K PREDICTIONS BY PROB - note these are just index
best_n = np.argsort(probs, axis=1)[:,-k:]
# GET CATEGORY OF PREDICTIONS
preds = [
[(model.classes_[predicted_cat], distribution[predicted_cat])
for predicted_cat in prediction]
for distribution, prediction in zip(probs, best_n)]
preds=[ item[::-1] for item in preds]
return preds
Function Call:
for index, row in df.iterrows():
category_test_features=category_loaded_transformer.transform(df['Text'].values.astype('U'))
df['PREDICTION'] = get_top_k_predictions(category_loaded_model,category_test_features,9)
This is the output from the function:
[[('Learning Activities', 0.001271131465669718),
('Communication', 0.002696299964802842),
('Learning Objectives', 0.002774964762863968),
('Learning Technology', 0.003557563051027678),
('Instructor/TAs', 0.004512712287403168),
('General', 0.006675929282872587),
('Learning Materials', 0.013051869950436862),
('Course Structure', 0.02781481160602757),
('Community', 0.9376447176288959)]]
I want the output to look like this in the end.

You function returns a list that contains a list of tuples? Why the double-nested list? One way I can think of:
tmp = {}
for index, row in df.iterrows():
predictions = get_top_k_predictions(...)
tmp[index] = {
key: value for key, value in predictions[0]
}
tmp = pd.DataFrame(tmp).T
df.join(tmp)

How to split dataframe according sub ID

I have a csv file with 3 columns containing image data set.1st column name 'ID' where ID represent patient id, 2nd and 3rd columns represent side and label of the data set respectively.I would like to split this dataframe in to test and train set according to patient ID in where patient Id wouldn't be repeat in both set.I mean the train ID would not present in the test set. Using this below code
# Defining a function for spliting dataframe into train and test
df_Datacopy = df_Data.copy() # copy the df
#df_Datacopy= df_Datacopy.sort_values(by=['ID'])
df_Datacopy = df_Datacopy.sample(frac=1)
train_df = df_Datacopy.sample(frac=0.80, random_state=0) # train spliting size 80%
# sorted according to ID
train_df= train_df.sort_values(by=['ID'])
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to ID
test_df= test_df.sort_values(by=['ID'])
u1 = np.unique(train_df['ID'])
u2 = np.unique(test_df['ID'])
print(set(u1).union(set(u2)))
I tried to split the test and train set,but the problem is the i seen that some ID present in both test and train set.It would be a great help for me if i get some help including code example.

Simple Python Lists Approach
So I would recommend using simple python lists for this as the preferred and simpler approach.Since you started with pandas I'll provide a way to use pandas methods to achieve something similar but with a possible worse outcome.
whole_dataset_list =df_copy.to_numpy().tolist()
patientid_list =df['ID'].to_numpy().tolist()
patientid_set =list(set(patientid_list))
import random as rand
rand.shuffle(patientid_set)
#Change the numbers as to represent a 80% slice of your dataset/10/10 respectively
train_set_by_patientID = patientid_set[0:800] # 80
val_set_by_patientID = patientid_set[800:900] # 10
test_set_by_patientID = patientid_set[1000:] # 10
After splitting these lists you can use them to obtain the final train/test/val splits as such.
for i in range(len(wholeDataset_list)):
curr_pt_id = wholeDataset_list[i]
if(curr_pt_id in train_set_by_patientID):
train_set.append(wholeDataset_list[i])
elif(curr_pt_id in val_set_by_patientID):
val_set.append(wholeDataset_list[i])
elif(curr_pt_id in test_set_by_patientID):
test_set.append(wholeDataset_list[i])
else:
raise RuntimeError("Whole dataset does not contain given i ")
Finally you can come back to a dataframe if you want as such:
train_df = pd.DataFrame(train_set, columns=df_copy.columns)
val_df = pd.DataFrame(val_set, columns=df_copy.columns)
test_df = pd.DataFrame(test_set, columns=df_copy.columns)
Second Option using Pandas Only:
Here sop_uid is a unique index. I am using a train/val/test split instead of a train/test split but that can be changed easily.
dff.sort_values(by="patient_id", axis=0, inplace=True)
count_study = dff.groupby_agg(by = 'patient_id', agg='count', agg_column_name='sop_uid', new_column_name="count_instances")
df_Datacopy = dict_dff
train_df = df_Datacopy.sample(frac=0.90, weights='count_study', random_state=0) # train spliting size 90%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to count_study
test_df= test_df.sort_values(by=['count_instances'], ascending = False)
#Sample
train_df = train_df.sample(frac=0.89, weights='count_study', random_state=0) # train spliting size 80%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
val_df = df_Datacopy.drop(train_df.index.append(test_df.index))

I recommend using a boolean mask to filter the dataset.
If you want to split 50/50 maybe checking if ID is even or uneven might work.
Since you didnt provide any sample data or furter detail on which citeria to split i suggest
train_df= df[df.ID % 2 == 0]
test_df = df[df.ID % 2 != 0]
Is that what you wanted to achieve?
If not maybe provide more information on what result you want.

index 0 is out of bounds for axis 0 with size 0 Python

PLEASE READ:
I have looked at all the other answers related to this question and none of them solve my specific problem so please carry on reading below.
I have the below code. what the code basically does is keeps the Title column and then concatenated the rest of the columns into one in order to be able to create a cosine matrix.
the main point is the recommendations function that is suppose to take in a Title for imput and return the top 10 matches based on that title but what i get at the end is the index 0 is out of bounds for axis 0 with size 0 error and i have no idea why.
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
df =
pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Plot']
# instantiating Rake, by default it uses english stopwords from NLTK
# and discards all puntuation characters as well
r = Rake()
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words as keys and their scores as values
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column for the corresponding movie
row['Key_words'] = list(key_words_dict_scores.keys())
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# instantiating and generating the count matrix
df['bag_of_words'] = df[df.columns[1:]].apply(lambda x: '
'.join(x.astype(str)),axis=1)
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
indices = pd.Series(df.index)
# defining the function that takes in movie title
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies

This line:
idx = indices[indices == title].index[0]
will fail if you do not return a match:
df.loc[df['Title']=='This is not a valid title'].index[0]
returns:
IndexError: index 0 is out of bounds for axis 0 with size 0
You need to confirm that the title you are passing in is actually in DF before trying to access any data associated with it:
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
if title not in indices:
raise KeyError("title is not in indices")
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This expression also seems to be doing nothing:
for index, row in df.iterrows():
plot = row['Plot']
If you just want a single plot record with which to do some development try:
plot = df['Plot'].sample(n=1)
Finally, it appears that recommendations is using the global variable indices - in general this is bad practice, as if indices changes outside of the scope of recommendations the function might break. I would consider refactoring this to be a little less brittle overall.

Pandas dataframe update keys

I'm unable to update a Pandas Dataframe using pd.update() function, I always get a None result.
I'm using a Dataframe with keys which is the result of joining 2 Dataframes.
I calculate the z1 score for only float32 columns, and then I update the Dataframe with the new values for float32 columns.
class MySimpleScaler(object):
def __init__(self):
self._means = None
self._stds = None
def preprocess(self, data):
"""Calculate z-score for dataframe"""
if self._means is None: # During training only
self._means = data.select_dtypes('float32').mean()
if self._stds is None: # During training only
self._stds = data.select_dtypes('float32').std()
if not self._stds.all():
raise ValueError('At least one column has standard deviation of 0.')
z1 = (data.select_dtypes('float32') - self._means) / self._stds
return data.update(z1)
all_x = pd.concat([train_x, eval_x], keys=['train', 'eval'])
scaler = MySimpleScaler()
all_x = scaler.preprocess(all_x)
train_x, eval_x = all_x.xs('train'), all_x.xs('eval')
When I run the data.update(z1) it always returns None.
I need to reuse the scaler object later to calculate z score for new dataframes.

If you add to a set, you are doing an in-place operation, which returns None. The Series will be updated, but the copy returned will be None.

DataFrame update is an in-place operation. It will always return None, but the dataframe will be modified.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop bad data from dataset Tensorflow - python

You can use tf.data.Dataset.filter: def filter_func(elem): """ return True if the element is to be kept """ return tf.math.greater(elem['height'],0) ds = tf.data.Dataset.from_tensor_slices(ds_files) clean_ds = ds.filter(filter_func)

Related

Going from code that compares one value to all other values to all values of all other values

Array to columns in dataframe

How to split dataframe according sub ID

index 0 is out of bounds for axis 0 with size 0 Python

Pandas dataframe update keys

Categories

Resources