Tensorflow target column always returns 1 - python

I'm working on a classification problem with Tensorflow and I'm new to this. I want to see two targets (1 and 0) after all. I'm asking because I don't know, is it normal for the whole target column to be 1 as below? Thank you.
df['target'] = np.where(df['Class']== 2, 0, 1)
df = df.drop(columns=['Class'])
then when I run the command line below, the target column shows exactly 1.
print(df.head(50))

Just change the last parameter to the array you are making the comparisons on.
This will replace the values of 2 with 1 in df["Class"]
df['target'] = np.where(df['Class']== 2, 1, df['Class'])

Related

Python lambda based on condition in two columns

I need to add an indicator column to my dataframe that flags user with promo code (1 if on promo else 0 ). I need to look at two columns and see if any promo code exist under either of col_promo_1, col_promo_2. This is the code I'm using but it returns Nan value:
df['promo_ind'] = df[['col_promo_1', 'col_promo_2']].apply(lambda x: 1 if x is not None else 0)
However, when I use the code with only one column for example col_promo_1, the result is accurate. Any thoughts on how can I get this fixed?
Make a new column:
df['promo_ind'] = 0
You can build a mask and use it to set the values in the correct places:
df.loc[df['col_promo_1'].notna() | df['col_promo_2'].notna(), 'promo_ind'] = 1
Sticking to your approach, let's assume you have the below example DataFrame (df) with two columns (promo1 and promo2) and the goal is to indicate promo status in a third column, if a user is on either promo1 or promo2.
import pandas as pd
df = pd.DataFrame(data={'promo1': [0, 1, 0, 1], 'promo2': [0, 0, 1, 1]})
The line below, creates a third column, checks the two existing columns at every row and calculates the corresponding promo status accordingly. (The issue with the posted code is that "x" takes columns in the DataFrame one by one, although you want to take rows and check them. The fix is to set attribute axis=1 for apply() method.)
df['promo_ind'] = df[['promo1', 'promo2']].apply(lambda row: 0 if (row['promo1']==0 and row['promo2']==0) else 1, axis=1)

python bypassing binary if statement and issuing all 0's

I'm trying to create binary values (0, 1) in a dataset and python is completely bypassing the if statements and I'm not sure why. I'm assuming my syntax is incorrect, but I have no idea what to do.
Example code:
df['Binary'] = 0
for row in df['standard_deviation']:
if row > 2: #greater than 2 standard deviations
df['Binary'] = 1
else
df['Binary'] = 0
expected output: Binary should come out with 0's and 1's, but instead it's issuing all 0's to the data.
Any help is greatly appreciated!
EDIT: Prints of "row" value. Row from dataset
If I understand what you want to do, it's for the 'Binary' column to contain a 1 or 0 for each row, depending on the value of the 'standard_deviation' column for each row.
But instead by writing df['Binary'] = 0 or 1 you are just assigning that value you to the entire column.
What you want in this case is simply:
df['Binary'] = (df['standard_deviation'] > 2).astype('uint8')

Handling ragged CSV columns in pandas

I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.

NumPy - Getting column and row number to reshape array

I am learning neural networks and I am trying to automate some of the processes.
Right now, I have code to split the dataset randomly, a 284807x31 piece. Then, I need to separate inputs and outputs, meaning I need to select the entire array until the last column and, on the other hand, select only the last column. For some reason I can't figure out how to do this properly and I am stuck at splitting and separating the set as explained above. Here's my code so far (the part that refers to this specific problem):
train, test, cv = np.vsplit(data[np.random.permutation(data.shape[0])], (6,8))
# Should select entire array except the last column
train_inputs = np.resize(train, len(train[:,1]), -1)
test_inputs = np.resize(test, len(test[:,1]), -1)
cv_inputs = np.resize(cv, len(cv[:,1]), -1)
# Should select **only** the last column.
train_outs = train[:, 30]
test_outs = test[:, 30]
cv_outs = test[:, 30]
The idea is that I'd like the machine to find the column number of the corresponding dataset and do intended resizes. The second part will select only the last column - and I am not sure if that works because the script stops before that. The error is, by the way:
Traceback (most recent call last):
File "src/model.py", line 43, in <module>
train_inputs = np.resize(train, len(train[:,1]), -1)
TypeError: resize() takes exactly 2 arguments (3 given)
PS: Now that I am looking at the documentation, I can see I am very far from the solution but I really can't figure it out. It's the first time I am using NumPy.
Thanks in advance.
Some slicing should help:
Should select entire array except the last column
train_inputs = train[:,:-1]
test_inputs = test[:,:-1]
cv_inputs = cv[:,:-1]
and:
Should select only the last column.
train_outs = train[:,-1]
test_outs = test[:, -1]
cv_outs = test[:, -1]

Taking mean along columns with masks in Python

I have a 2D array containing data from some measurements. I have to take mean along each column considering good data only.
Hence I have another 2D array of the same shape which contains 1s and 0s showing whether data at that (i,j) is good or bad. Some of the "bad" data can be nan as well.
def mean_exc_mask(x, mas): #x is the real data arrray
#mas tells if the data at the location is good/bad
sum_array = np.zeros(len(x[0]))
avg_array = np.zeros(len(x[0]))
items_array = np.zeros(len(x[0]))
for i in range(0, len(x[0])): #We take a specific column first
for j in range(0, len(x)): #And then parse across rows
if mas[j][i]==0: #If the data is good
sum_array[i]= sum_array[i] + x[j][i]
items_array[i]=items_array[i] + 1
if items_array[i]==0: # If none of the data is good for a particular column
avg_array[i] = np.nan
else:
avg_array[i] = float(sum_array[i])/items_array[i]
return avg_array
I am getting all values as nan!
Any ideas of what's going on wrong here or someother way?
The code seems to work for me, but you can do it a whole lot simpler by using the build-in aggregation in Numpy:
(x*(m==0)).sum(axis=0)/(m==0).sum(axis=0)
I tried it with:
x=np.array([[-0.32220561, -0.93043128, 0.37695923],[ 0.08824206, -0.86961453, -0.54558324],[-0.40942331, -0.60216952, 0.17834533]])
and
m=array([[1, 1, 0],[1, 0, 0],[1, 1, 1]])
If you post example data, it is often easier to give a qualified answer.

Categories