Hi there I am working with a Sci Kit learn data set, digits and I Split the data
So I have X_train and Y_train arrays
The arrays are related in such a way that the index x[0] belongs to y[0]
print x_train.shape
(1347, 64)
print y_train.shape
(1347)
print set(y_train)
(0,1,2,3,4,5,6,7,8,9)
I would like to extract a random sample from x_train given the set(y), i.e. To resample my data by extracting just one random observation of the set(y).However I donĀ“t know if I can do this with numpy or pandas, any one have an idea of how to deal with this????
Thank you very much.
It is not clear what you want to do.
The set(y) contains all the available labels of your dataset X.
In general (until you specify what you need), use random.choice:
You have this:
print set(y)
(0,1,2,3,4,5,6,7,8,9)
Convert it first to a list:
index_all = list(set(y))
Now, randomly sample the set(y):
# this is a random index (class/label) from 0 to 9.
random_index = np.random.choice(index_all, 1)
Now, I see 2 possibilities (I believe you want the Case 2):
1) Directly resample x based on this random index (random based on the set(y))
Finally, if x is a numpy array:
x[random_index, :]
This returns a random observation of x based on the set(y)
2) Resample the x but get a random observation that has a label y. Label 'y' is defined randomly above (random_index)
x[y==random_index]
This returns a random observation of x that is associated with a label y.
This is the approach I generally use for constructing a dataframe and extracting data from it.
import numpy as np
import pandas as pd
#Dummy arrays for x and y
x_train = np.zeros((1347,64))
y_train = np.ones((1347))
#First we pair up the arrays according to their index using zip. Only use this
#method if both arrays are of equal length.
training_dataset = list(zip(x_train,y_train))
#Next we load the dataset as a dataframe using Pandas
df = pd.DataFrame(data=training_dataset)
#Check that the dataframe is what you want
df.head()
#If you would like to extract a random row, you may use
df.sample(n=1)
#Alternatively if you would like to extract a specific row (eg. 10th row aka index 9)
df.iloc[10]
I hope I've understood what you wanted to achieve but if not, feel free to let me know so I can amend my answer!
Sources:
Pandas Docs
Selecting Rows and Columns in Pandas Dataframes
Related
I am trying to extract the section (matrix) of the numbers in pandas dataframe like as marked in the given picture embedded above.
Please anyone who can assist me, I want to perform analytics based on the section (matrix) of a bigger data frame. Thank you in advance!!
You can use the .iloc[] function to select the rows and columns you want.
dataframe.iloc[5:15,6:15]
This should select rows 5-14 and columns 6-14.
Not sure if the numbers are correct but I think this method is what you were looking for.
edit: changed .loc[] to .iloc[] because we're using index values, and cleaned it up a bit
Here is the code to iterate over the whole dataframe
#df = big data frame
shape = (10,10) #shape of matrix to be analized, here is 10x10
step = 1 #step size, itterate over every number
#or
step = 10 #step size, itterate block by block
#keep in mind, iterating by block will leave some data out at the end of the rows and columns
#you can set step = shape if you are working with a matrix that isn't square, just be sure to change step in the code below to step[0] and step[1] respectively
for row in range( 0, len(df[0]) - shape[0]+1, step): #number of rows of big dataframe - number of rows of matrix to be analized
for col in range(0, len(df.iloc[0,:]) - shape[1]+1, step): #number of columns of big dataframe - number of columns of matrix to be analized
matrix = df.iloc[row:shape[0]+row, col:shape[1]+col] #slice out matrix and set it equal to 'matrix'
#analize matrix here
This is basically the same as #dafmedinama said, i just added more commenting and simplified specifying the shape of the matrix as well as included a step variable if you don't want to iterate over every single number every time you move the matrix.
Be sub_rows and sub_cols the dimension of the datafram to be extracted:
import pandas as pd
sub_rows = 10 # Amount of rows to be extracted
sub_cols = 3 # Amount of columns to be extracted
if sub_rows > len(df.index):
print("Defined sub dataframe rows are more than in the original dataframe")
elif sub_cols > len(df.columns):
print("Defined sub dataframe columns are more than in the original dataframe")
else:
for i in range(0,len(df.index)-sub_rows):
for j in range(0, len(df.columns)):
d.iloc[i:i+sub_rows, j:j+sub_cols] # Extracted dataframe
# Put here the code you need for your analysis
Looking to print the minimum values of numpy array columns.
I am using a loop in order to do this.
The array is shaped (20, 3) and I want to find the min values of columns, starting with the first (i.e. col_value=0)
I have coded
col_value=0
for col_value in X:
print(X[:, col_value].min)
col_value += 1
However, it is coming up with an error
"arrays used as indices must be of integer (or boolean) type"
How do I fix this?
Let me suggest an alternative approach that you might find useful. numpy min() has axis argument that you can use to find min values along various
dimensions.
Example:
X = np.random.randn(20, 3)
print(X.min(axis=0))
prints numpy array with minimum values of X columns.
You don't need col_value=0 nor do you need col_value+=1.
x = numpy.array([1,23,4,6,0])
print(x.min())
EDIT:
Sorry didn't see that you wanted to iterate through columns.
import numpy as np
X = np.array([[1,2], [3,4]])
for col in X.T:
print(col.min())
Transposing the axis of the matrix is one the best solution.
X=np.array([[11,2,14],
[5,15, 7],
[8,9,20]])
X=X.T #Transposing the array
for i in X:
print(min(i))
I have got table (DataFrame) created in Pandas. It is 2D table with integers as column index and integer as row index (it is position x and position y).
I know how to get value that is in "cell" of that table using indexes, but I would like to get value "from between" columns and rows that will be linearly interpolated.
Preferably, I would like to do this for large number of x,y that are kept in two tables Position_x(m x n), Position_y(m x n) and put results to table Results(m x n)
https://i.stack.imgur.com/utv03.png
Here is example of such procedure in Excel:
https://superuser.com/questions/625154/what-is-the-simplest-way-to-interpolate-and-lookup-in-an-x-y-table-in-excel
Thanks
Szymon
I've found something that works in 90%, however, it has two disadvantages:
1) index and columns need to be strictly increasing,
2) for a set of n input pairs it plots n x n result array instead of just n results (for example below for 3 pairs of input points I need only 3 resulting values, using that code I will get 9 values as all combination of input points).
Here is what I've found:
import scipy
import scipy.interpolate
import numpy as np
import pandas as pd
x=np.array([0,10,25,60,100]) #Index
y=np.array([1000,1200,1400,1600]) #Column
data=np.array([[60,54,33,0],
[50,46,10,0],
[42,32,5,0],
[30,30,2,0],
[10,10,0,0]])
Table_to_Interpolate=pd.DataFrame(data,index=x,columns=y)
sp=scipy.interpolate.RectBivariateSpline(x,y,data, kx=1, ky=1, s=0)
scipy.interpolate.RectBivariateSpline(x,y,data, kx=1, ky=1, s=0)
Input_Xs=12, 44, 69
Input_Ys=1150, 1326, 1416
Results=pd.DataFrame(sp(Input_Xs, Input_Ys), index=Input_Xs, columns=Input_Ys,)
It's not perfect, but it's the best I could find.
If I understood your question:
you can start by using pandas.melt to convert the multi-column-result table to a one-column-result table.
Then, you can use ben-t great answer to interpolate.
Hope I helped.
I have a 5D array called predictors with a shape of [6,288,37,90,107] where 6 is the number of variables,
288 is the time series of those variables,
37is the k locations,
90 is the j locations,
107 is the i locations.
I want to have a pandas dataframe that includes columns of each variable timeseries at each k,j,i location so that of course will be a lot of columns.
Then I would like to somehow obtain the names for each column.
For example the first column would be var1_k_j_i = predictors[0,:,0,0,0]
except in the name I actually want the k location, j location,
and i location instead of k_j_i.
Since there are so many I can't do this by hand so I was hoping for a suggestion on the best way to organize this into a pandas dataframe and obtain the names? A loop possibly?
So in summary by the end of this I would like my 5D array of predictors turned into a large pandas dataframe where each column is a variable located at different k,j,i locations with the corresponding names of the variable and location in the header or first row of the dataframe.
Sound like you need to have fun with reshape here.
To address the location i,j,k is easy as using reshape. Then I'm not sure if you can reshape again to obtain a 2D representation of what you need, so I'm proposing a loop for you as follow.
import itertools
import pandas as pd
dfs = []
new_matrix = matrix.reshape([6,288,37*90*107])
for var range(6):
iterator = itertools.product(range(37), range(90), range(107))
columns = ['var%i_' % var + '_'.join(map(str, x)) for x in iterator]
dfs.append(pd.DataFrame(new_matrix[var]))
result = pd.concat(dfs)
So I am trying to create an array and then access the columns by name. So I came up with something like this:
import numpy as np
data = np.ndarray(shape=(3,1000),
dtype=[('x',np.float64),
('y',np.float64),
('z',np.float64)])
I am confused as to why
data.shape
and
data['x'].shape
both come back as (3,1000), this is causing me issues when I'm trying to populate my data fields
data['x'] = xvalues
where xvalues has a shape of (1000,). Is there a better way to do this?
The reason why it comes out the same is because 'data' has a bit more structure than the one revealed by shape.
Example:
data[0][0] returns:
(6.9182540632428e-310, 6.9182540633353e-310, 6.9182540633851e-310)
while data['x'][0][0]:
returns 6.9182540632427993e-310
so data contains 3 rows and 1000 columns, and the element of that is a 3-tuple.
data['x'] is the first element of that tuple of all combinations of 3 rows and 1000 columns, so the shape is (3,1000) as well.
Just set shape=(1000,). The triple dtype will create 3 columns.