Importing CSV data and splitting correctly - python

I'm trying to create a neural network using NumPy, but im having trouble with importing and splitting my CSV file. When I run my code it seems to believe I want all of my rows, when I only want the 70% that I split in my code below. my data is 470 rows in total but I want the 329 value which is 70%.
not sure if the way ive indexed my code to split my x/y values has messed it up maybe. when I print my 'trainingdata' before I split to x/y it presents 329 which is the correct value. 'trainingdata' is only ever used in this section of code also which means its not a later function issue I believe.
dataset = pd.read_csv(('data.csv'), names = header_list)
trainingdata = dataset.sample(frac=0.7)
testingdata = dataset.drop(trainingdata.index)
X = trainingdata.iloc[:,1:].to_numpy().astype('float32')
Y = trainingdata.iloc[:,0].to_numpy().reshape(-1, 1).astype('float32')
print(Y.shape)
print(X.shape)
output message = ValueError: Shape of passed values is (329, 1), indices imply (470, 1)
EDIT =
# convert output to
predicted_output = pd.DataFrame(
predicted_output,
columns=["prediction"],
index=df.index)
sorry for the confusion. Basically I need to call my split 70/30 dataframe in the code above, but when I try to call it doesnt work as I dont know how to call a split dataframe like this. Above is when I call the 'pre-split' dataframe (pd.Dataframe) which works fine, but when I try and call 'testingdata' (data that is split 70/30), it says as its a NumPy array this wont work.
I'm not sure how to convert my 'trainingdata' array into something callable so that it usable by the function above. Is there specific syntax for this?

Related

Cannot plot or use .tolist() on pd dataframe column

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

Confused with this simple normalization function

Apparently this code below is supposed to normalize the columns in x values.
#normalize columns
def normalize_cols(m):
col_max = m.max(axis=0)
col_min = m.min(axis=0)
return (m-col_min)/(col_max - col_min)
x_vals_train = np.nan_to_num(normalize_cols(x_vals_train))
x_vals_test = np.nan_to_num(normalize_cols(x_vals_test))
However, I am a bit confused.
Firstly, does the function normalize the data by column by column? If so, how and why? (Because we are inputting the whole columns at once.)
Secondly, documentation of np.nan_to_num says that:
Replace nan with zero and inf with large finite numbers.
But why is that used here? I don't get why we would have to replace zeros after we normalize the data?

Slicing columns in Python

I am new in Python. I want to slice columns from index 1 to end of a marix and perform some operations on the those sliced out columns. Following is the code:
import numpy as np
import pandas as pd
train_df = pd.read_csv('train_475_60_W1.csv',header = None)
train = train_df.as_matrix()
y = train[:,0]
X = train[:,1:-1]
The problem is if I execeute "train.shape", it gives me (89512, 61). But when I execute "X.shape", it give me (89512, 59). I was expecting to get 60 as I want to execute operations on all the colunms except the first one. Can anyone please help me in solving this?
In the line
X = train[:,1:-1]
you cut off the last column. -1 refers to the last column, and Python includes the beginning but not the end of a slice - so lst[2:6] would give you entries 2,3,4, and 5. Correct it to
X = train[:,1:]
BTW, you can make your code format properly by including four spaces before each line (you can just highlight it and hit Ctrl+K).
The thing you should know with slicing for single dimension even in normal lists is that it looks like this:
[start : end]
with start included and end excluded.
you can also use these:
[:x] # from the start to x
[x:] # from x to the end
you can then generalize than to 2D or more, so in your case it would be:
X = train[:,1:] # the first : to get all rows, and 1: to get all columns except the first
you can learn more about these in here if you want, it's a good way to practice

zip rows of pandas DataFrame with list/array of values

My current code is
from numpy import *
def buildRealDataObject(x):
loc = array(x[0])
trueClass = x[1]
evid = ones(len(loc))
evid[isnan(loc)] = 0
loc[isnan(loc)] = 0
return DataObject(location=loc, trueClass=trueClass, evidence=evid)
if trueClasses is None:
trueClasses = zeros(len(dataset), dtype=int8).tolist()
realObjects = list(map(lambda x: buildRealDataObject(x), zip(dataset, trueClasses)))
and it is working. What I expect is to create for each row of the DataFrame dataset each combined with the corresponding entry of trueClasses a realObject. I am not really sure though why it is working because if run list(zip(dataset, trueClasses)) I just get something like [(0, 0.0), (1, 0.0)]. The two columns of dataset are called 0 and 1. So my first question is: Why is this working and what is happening here?
However I think this might still be wrong on some level, because it might only work due to "clever implicit transformation" on side of pandas. Also, for the line evid[isnan(loc)] = 0 I now got the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
How should I rewrite this code instead?
Currently the zip works on columns instead of rows. Use one of the method from Pandas convert dataframe to array of tuples to make the zip work on rows instead of columns. For example substitute
zip(dataset, trueClasses)
with
zip(dataset.values, trueClasses)
Considiering this post, if you have already l = list(data_train.values) for some reason, then zip(l, eClass) is faster than zip(dataset.values, trueClasses). However, if you don't then the transformation takes too much time to make it worth it in my tests.

Saving/loading a table (with different column lengths) using numpy

A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.

Categories