Saving oversampled dataset as csv file in pandas

Saving oversampled dataset as csv file in pandas - python

I am new to Python and apologize in advance, if it is too simple. Cannot find anything and this question did not help.
My code is
# Split data
y = starbucks_smote.iloc[:, -1]
X = starbucks_smote.drop('label', axis = 1)
# Count labels by type
counter = Counter(y)
print(counter)
Counter({0: 9634, 1: 2895})
# Transform the dataset
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
# Print the oversampled dataset
counter = Counter(y)
print(counter)
Counter({0: 9634, 1: 9634})
How to save the oversampled dataset for future work?
I tried
data_res = np.concatenate((X, y), axis = 1)
data_res.to_csv('sample_smote.csv')
Got an error
ValueError: all the input arrays must have same number of dimensions,
but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
Appreciate any tips!

You may create dataframe:
data_res = pd.DataFrame(X)
data_res['y'] = y
and then save data_res to CSV.
Solution based on concatenation od numpy.arrays is also possible, but np.vstack is needed to make dimensions compliant:
data_res = np.concatenate((X, np.vstack(y)), axis = 1)
data_res = pd.DataFrame(data_res)

Related

For loop not properly appending values

Im trying to store 60 values to x and the next one to the y and then shift it 1 up and store 60 values to x and the next one to the y. But the for loop only works once for the x values the y values some how do get stored properly. And the size of my dataset is not the problem as storing the y values work. It's just the x values that don't store and when debugging the x sets are just empty array's except the first one. What am i doing wrong?
import math
import numpy as np
import pandas as pd
df = pd.read_csv('dataset.csv')
print(df)
dataset = df.values
data_len = math.ceil(len(dataset) * .1)
data = dataset[0:data_len , :]
print(data)
x = []
y = []
for i in range(60, len(data)):
x.append(data[60-i:i, 0])
y.append(data[i, 0])

I think that you should try to avoid using a for loop, in numpy it is often faster to create masks and work with that. If i understand your question correctly, you want to store 0-59 in x, then 60 in y, then 61-119 in x, then 120 in y and so on.
If that is the correct understanding i would try this instead:
dataset = np.random.randint(1,1000,size=1000) #generating an example
mask = np.zeros(len(dataset),dtype=np.bool)
mask[60::60] = 1 #indices of the ys
y = dataset[mask]
x = dataset[~mask]

Creating masks for duplicate elements in Tensorflow/Keras

I am trying to write a custom loss function for a person-reidentification task which is trained in a multi-task learning setting along with object detection. The filtered label values are of the shape (batch_size, num_boxes). I would like to create a mask such that only the values which repeat in dim 1 are considered for further calculations. How do I do this in TF/Keras-backend?
Short Example:
Input labels = [[0,0,0,0,12,12,3,3,4], [0,0,10,10,10,12,3,3,4]]
Required output: [[0,0,0,0,1,1,1,1,0],[0,0,1,1,1,0,1,1,0]]
(Basically I want to filter out only duplicates and discard unique identities for the loss function).
I guess a combination of tf.unique and tf.scatter could be used but I do not know how.

This code works:
x = tf.constant([[0,0,0,0,12,12,3,3,4], [0,0,10,10,10,12,3,3,4]])
def mark_duplicates_1D(x):
y, idx, count = tf.unique_with_counts(x)
comp = tf.math.greater(count, 1)
comp = tf.cast(comp, tf.int32)
res = tf.gather(comp, idx)
mult = tf.math.not_equal(x, 0)
mult = tf.cast(mult, tf.int32)
res *= mult
return res
res = tf.map_fn(fn=mark_duplicates_1D, elems=x)

Sklearn logistic regression shape error, but x, y shapes are consistent

I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)

When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
print(sampled_y_pos.nnz)
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
print(sampled_y_neg.nnz)
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm

Remove rows from numpy array when its repeated less than n times

Remove rows from numpy array when its repeated less than n times
Cause:
I have a certain dataset that is 1 gb in size.
It has 29.118.021 samples and 108.390 classes.
However, some classes has just 1 sample. Or 3 samples, and so on...
Problem:
I want to remove the rows/classes from the numpy array that are presented/repeated less than N times.
Reference
XgBoost : The least populated class in y has only 1 members, which is too few
Attempt that failed
train_x, train_y, test_x, test_id = loader.load()
n_samples = train_y.shape[0]
unique_labels, y_inversed = np.unique(train_y, return_inverse=True)
label_counts = bincount(y_inversed)
min_labels = np.min(label_counts)
print "Total Rows ", n_samples
print "unique_labels ", unique_labels.shape[0]
print "label_counts ", label_counts[:]
print "min labels ", min_labels
unique_labels = unique_labels.astype(np.uint8)
unique_amounts = np.empty(shape=unique_labels.shape, dtype=np.uint8)
for u in xrange(0, unique_labels.shape[0]):
if u % 100 == 0:
print "Processed ", str(u)
for index in xrange(0, train_y.shape[0]):
if train_y[index] == unique_labels[u]:
unique_amounts[u] = unique_amounts[u] + 1
for k in xrange(0, unique_amounts.shape[0]):
if unique_amounts[k] == 1:
print "\n"
print "value :", unique_amounts[k]
print "at ", k
The code above is taking too long.Even after i left it running at the server for 1 whole night, it didnt even reach half processment.
Load method
This is my load method.
I could load it and keep it as a dataframe.
def load():
train = pd.read_csv('input/train.csv', index_col=False, header='infer')
test = pd.read_csv('input/test.csv', index_col=False, header='infer')
# drop useless columns
train.drop('row_id', axis=1, inplace=True)
acc = train["accuracy"].iloc[:].as_matrix()
x = train["x"].iloc[:].as_matrix()
y = train["y"].iloc[:].as_matrix()
time = train["time"].iloc[:].as_matrix()
train_y = train["place_id"].iloc[:].as_matrix()
####################################################################################
acc = acc.reshape(-1, 1)
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
time = time.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)
####################################################################################
train_x = np.hstack((acc, x, y, time))
####################################################################################
acc = test["accuracy"].iloc[:].as_matrix()
x = test["x"].iloc[:].as_matrix()
y = test["y"].iloc[:].as_matrix()
time = test["time"].iloc[:].as_matrix()
test_id = test['row_id'].iloc[:].as_matrix()
#######################
acc = acc.reshape(-1, 1)
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
time = time.reshape(-1, 1)
#######################
test_x = np.hstack((acc, x, y, time))
return train_x, train_y, test_x, test_id

The numpy_indexed package (disclaimer: I am its author) contains a multiplicity function, which leads to a very readable way of performing such manipulations:
import numpy_indexed as npi
samples_mask = npi.multiplicity(train_y) >= n_min
filtered_train_y = train_y[samples_mask]

I would keep your data in a dataframe format.
That way, you can use some useful methods from the pandas module, and that should be quicker than looping.
First, get the different labels associated with df with df['labels'].value_counts().
(I assume that the labels column name is 'labels').
Then, get only the labels that have less than n_min rows in the dataframe.
vc = df['labels'].value_counts()
labels = vc[vc < n_min].index
df.drop(labels, inplace=True)
Hope that helps !

Numpy compare 2 array shape, if different, append 0 to match shape

I am comparing 2 numpy arrays, and want to add them together. but, before doing so, i need to make sure they are the same size. If the size are not same, then take the smaller sized one and fill the last rows with zero to match the shape.
Both array have 16 columns and N rows. I am assuming it should be pretty straight forward, but I can't get my head around it. So far I am able to compare the 2 array shape.
import csv
import numpy as np
import sys
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
This is the output I got:
=============New file.csv============
(603, 16)
(604, 16)
we have an error
I want the fill the last row of "data" array with 0 so that I can add the 2 arrays.
Thanks for your help.

You can use vstack(array1, array2) from numpy which stacks arrays vertically. For example:
A = np.random.randint(2, size = (2, 16))
B = np.random.randint(2, size = (5, 16))
print A.shape
print B.shape
if A.shape[0] < B.shape[0]:
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], 16))))
elif A.shape[0] > B.shape[0]:
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], 16))))
print A.shape
print A
In your case:
if data.shape[0] < data_sys.shape[0]:
data = np.vstack((data, np.zeros((data_sys.shape[0] - data.shape[0], 16))))
elif data.shape[0] > data_sys.shape[0]:
data_sys = np.vstack((data_sys, np.zeros((data.shape[0] - data_sys.shape[0], 16))))
I assume that your matrices have always the same number of columns, if not you can similarly use hstack to stack them horizontally.

If you have only two files, and their shapes differ in just the 0th dimension, a simple check and copy is probably easiest, though it lacks generality:
import numpy as np
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
fill_value = 0 # could be np.nan or something else instead
if data.shape[0]>data_sys.shape[0]:
temp = data_sys
data_sys = np.ones(data.shape)*fill_value
data_sys[:temp.shape[0],:] = temp
elif data.shape[0]<data_sys.shape[0]:
temp = data
data = np.ones(data_sys.shape)*fill_value
data[:temp.shape[0],:] = temp
print 'Using conditional:'
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
A much more general solution is a custom class--overkill for your two files but much easier if you have lots of files to handle. The basic idea is that static class variables sx and sy keep track of the largest widths and heights, and are used when get_data is called, to output a standard shape array. This is pre-filled with your desired fill value, and the actual data from the corresponding file are copied into the upper left corner of the standard shape array:
import numpy as np
class IsomorphicArray:
sy = 0 # static class variable
sx = 0 # static class variable
fill_value = 0.0
def __init__(self,csv_filename):
self.data = np.genfromtxt(csv_filename,dtype=float,delimiter=',')
self.instance_sy,self.instance_sx = self.data.shape
if self.instance_sy>IsomorphicArray.sy:
IsomorphicArray.sy = self.instance_sy
if self.instance_sx>IsomorphicArray.sx:
IsomorphicArray.sx = self.instance_sx
def get_data(self):
out = np.ones((IsomorphicArray.sy,IsomorphicArray.sx))*self.fill_value
out[:self.instance_sy,:self.instance_sx] = self.data
return out
isomorphic_array_list = []
for filename in ['./test1.csv','./test2.csv']:
isomorphic_array_list.append(IsomorphicArray(filename))
numpy_array_list = []
for isomorphic_array in isomorphic_array_list:
numpy_array_list.append(isomorphic_array.get_data())
print 'Using custom class:'
for numpy_array in numpy_array_list:
print numpy_array.shape

Assuming both arrays have 16 columns
len1=len(data)
len2=len(data_sys)
if len1<len2:
data=np.append(data, np.zeros((len2-len1, 16)),axis=0)
elif len2<len1:
data_sys=np.append(data_sys, np.zeros((len1-len2, 16)),axis=0)
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
else:
print "we r good"

Numpy provides an append function to add values to an array: see here for details. In multi-dimensional arrays you can define how the values should be added. As you have already the information which of your arrays is the smaller one, just add the desired number of zeroes with creating a zero filled array first by numpy.zeroes and then append it to your target array.
It might be necessary to flatten your array first and then to reshape it.

I had a similar situation. Two arrays of sizes mask_in:(n1,m1) and mask_ot:(n2,m2)that were generated through a mask of a 2D image of size (N,M) where A2 is larger than A1 and both share a common center (X0,Y0). I followed the approach suggested by #AniaG using vstack and hstack. I simply obtained the shapes of both arrays, size difference and finally account the number of missing elements at both ends.
Here is what I got:
mask_in = np.random.randint(2, size = (2, 8))
mask_ot = np.random.randint(2, size = (6, 16))
mask_in_amp = mask_in
dif_row = mask_ot.shape[0]-mask_in_amp.shape[0]
dif_col = mask_ot.shape[1]-mask_in_amp.shape[1]
complete_row = dif_row / 2
complete_col = dif_col / 2
mask_in_amp = np.vstack((mask_in_amp, np.zeros((complete_row, mask_in_amp.shape[1]))))
mask_in_amp = np.vstack((np.zeros((complete_row, mask_in_amp.data.shape[1])), mask_in_amp))
mask_in_amp = np.hstack((mask_in_amp, np.zeros((mask_in_amp.shape[0],complete_col))))
mask_in_amp = np.hstack((np.zeros((mask_in_amp.shape[0],complete_col)), mask_in_amp))

If you don't care about the exact shapes of two arrays you can also do the following:
if data.size == datasys.size:
print ('arrays have the same number of elements, and possibly shape')
else:
print ('arrays do not have the same shape for sure')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving oversampled dataset as csv file in pandas - python

Related

For loop not properly appending values

Creating masks for duplicate elements in Tensorflow/Keras

Sklearn logistic regression shape error, but x, y shapes are consistent

Remove rows from numpy array when its repeated less than n times

Numpy compare 2 array shape, if different, append 0 to match shape

Categories

Resources