I have an 149x5 NumPy array. I need to save some (30%) of values selected randomly from whole array. Additionally selected values will be deleted from data.
What I have so far:
# Load dataset
data = pd.read_csv('iris.csv')
# Select randomly 30%(45) of rows from dataset
random_rows = data.sample(45)
# Object for values to be saved
values = []
# Iterate over rows and select a value randomly.
for index, row in data.iterrows():
# Random between 1 - 5
rand_selector = randint(0, 4)
# Somehow save deleted value and its position in data object
value = ?? <-------
values.append(value)
# Delete random value
del row[rand_selector]
To add further, the data from value will later be compared to values imputed in its place by other methods(data imputation), therefore I need the position of the deleted value in original dataset.
This method will, given a 2D numpy matrix m, return an array of length 0.3*m.size containing arrays of length 3 consisting of a random value and its coordinates in m.
def pickRand30(data):
rand = np.random.choice(np.arange(data.size), size = int(data.size*0.3))
indexes1 = rand//data.shape[1]
indexes2 = rand%data.shape[1]
return np.array((data[indexes1, indexes2], indexes1, indexes2)).T
You can delete the entries by using its coordinates, however you may want to have a look into masked arrays instead of deleting single entries out of a matrix.
Related
I'm trying in Python using Numpy to do the following.
Receive every step a row of values. Call a function for each column.
To make it simple: assume I call a function: GetRowOfValues()
And after 5 rows I want to sum each column.
And return a full row which is the sum of all 5 rows received.
Anyone has an idea how to implement to using numpy?
Thanks for the help
I'm assuming that rows have a fixed length n and that their values are of float data type.
import numpy as np
n = 10 # adjust according to your need
cache = np.empty((5, n), dtype=float) # allocate empty 5xn array
cycle = True
while cycle:
for i in range(5):
cache[i,:] = GetRowOfValues() # save result of function call in i-th row
column_sum = np.sum(cache, axis=0) # sum by column
# your logic here...
I am trying to extract the section (matrix) of the numbers in pandas dataframe like as marked in the given picture embedded above.
Please anyone who can assist me, I want to perform analytics based on the section (matrix) of a bigger data frame. Thank you in advance!!
You can use the .iloc[] function to select the rows and columns you want.
dataframe.iloc[5:15,6:15]
This should select rows 5-14 and columns 6-14.
Not sure if the numbers are correct but I think this method is what you were looking for.
edit: changed .loc[] to .iloc[] because we're using index values, and cleaned it up a bit
Here is the code to iterate over the whole dataframe
#df = big data frame
shape = (10,10) #shape of matrix to be analized, here is 10x10
step = 1 #step size, itterate over every number
#or
step = 10 #step size, itterate block by block
#keep in mind, iterating by block will leave some data out at the end of the rows and columns
#you can set step = shape if you are working with a matrix that isn't square, just be sure to change step in the code below to step[0] and step[1] respectively
for row in range( 0, len(df[0]) - shape[0]+1, step): #number of rows of big dataframe - number of rows of matrix to be analized
for col in range(0, len(df.iloc[0,:]) - shape[1]+1, step): #number of columns of big dataframe - number of columns of matrix to be analized
matrix = df.iloc[row:shape[0]+row, col:shape[1]+col] #slice out matrix and set it equal to 'matrix'
#analize matrix here
This is basically the same as #dafmedinama said, i just added more commenting and simplified specifying the shape of the matrix as well as included a step variable if you don't want to iterate over every single number every time you move the matrix.
Be sub_rows and sub_cols the dimension of the datafram to be extracted:
import pandas as pd
sub_rows = 10 # Amount of rows to be extracted
sub_cols = 3 # Amount of columns to be extracted
if sub_rows > len(df.index):
print("Defined sub dataframe rows are more than in the original dataframe")
elif sub_cols > len(df.columns):
print("Defined sub dataframe columns are more than in the original dataframe")
else:
for i in range(0,len(df.index)-sub_rows):
for j in range(0, len(df.columns)):
d.iloc[i:i+sub_rows, j:j+sub_cols] # Extracted dataframe
# Put here the code you need for your analysis
I have netcdf data that is masked. The data is in (time, latitude, longitude). I would like to make an array with the same size as the original data but with zeros when the data is masked and with ones where it is not masked. So fare I have tried to make this function:
def find_unmasked_values(data):
empty = np.ones((len(data),len(data[0]),len(data[0,0])))
for k in range(0,len(data[0,0]),1): # third coordinate
for j in range(0,len(data[0]),1): # second coordinate
for i in range(0,len(data),1): # first coordinate
if ma.is_mask(data[i,j,k]) is True:
empty[i,j,k] = 0
return(empty)
But this only returns an array with ones and no zeros eventhough there is masked values in the data. If you have suggestions on how to improve the code in efficiency I would also be very happy.
Thanks,
Keep it simple! There is no need for all the manual loops, which will make your approach very slow for large data sets. A small example with some other data (where thl is a masked variable):
import netCDF4 as nc4
nc = nc4.Dataset('bomex_qlcore_0000000.nc')
var = nc['default']['thl'][:]
mask_1 = var.mask # masked=True, not masked=False
mask_2 = ~var.mask # masked=False, not masked=True
# What you need:
int_mask = mask_2.astype(int) # masked=0, not masked=1
p.s.: some other notes:
Instead of len(array), len(array[0]), et cetera, you can also directly get the shape of your array with array.shape, which returns a tupple with the array dimensions.
If you want to create a new array with the same dimensions as another one, just use empty = np.ones_like(data) (or np.zeros_like() is you want an array with zeros).
ma.is_mask() already returns a bool; no need to compare it with True.
Don't confuse is with ==: Is there a difference between "==" and "is"?
Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252
I have a large (50000 x 50000) 64-bit integer NumPy array containing 10-digit numbers. There are about 250,000 unique numbers in the array.
I have a second reclassification table which maps each unique value from the first array to an integer between 1 and 100. My hope would be to reclassify the values from the first array to the corresponding values in the second.
I've tried two methods of doing this, and while they work, they are quite slow. In both methods I create a blank (zeros) array of the same dimensions.
new_array = np.zeros(old_array.shape)
First method:
for old_value, new_value in lookup_array:
new_array[old_array == old_value] = new_value
Second method, where lookup_array is in a pandas dataframe with the headings "Old" and "New:
for new_value, old_values in lookup_table.groupby("New"):
new_array[np.in1d(old_array, old_values)] = new_value
Is there a faster way to reclassify values
Store the lookup table as a 250,000 element array where for each index you have the mapped value. For example, if you have something like:
lookups = [(old_value_1, new_value_1), (old_value_2, new_value_2), ...]
Then you can do:
idx, val = np.asarray(lookups).T
lookup_array = np.zeros(idx.max() + 1)
lookup_array[idx] = val
When you get that, you can get your transformed array simply as:
new_array = lookup_array[old_array]