I'm trying in Python using Numpy to do the following.
Receive every step a row of values. Call a function for each column.
To make it simple: assume I call a function: GetRowOfValues()
And after 5 rows I want to sum each column.
And return a full row which is the sum of all 5 rows received.
Anyone has an idea how to implement to using numpy?
Thanks for the help
I'm assuming that rows have a fixed length n and that their values are of float data type.
import numpy as np
n = 10 # adjust according to your need
cache = np.empty((5, n), dtype=float) # allocate empty 5xn array
cycle = True
while cycle:
for i in range(5):
cache[i,:] = GetRowOfValues() # save result of function call in i-th row
column_sum = np.sum(cache, axis=0) # sum by column
# your logic here...
Related
I have a numpy array as follows, I want to take a random sample of n rows.
([[996.924, 265.879, 191.655],
[996.924, 265.874, 191.655],
[996.925, 265.884, 191.655],
[997.294, 265.621, 192.224],
[997.294, 265.643, 192.225],
[997.304, 265.652, 192.223]], dtype=float32)
I've tried:
rows_id = random.sample(range(0,arr.shape[1]-1), 1)
row = arr[rows_id, :]
But this 9ndex mask only returns a single row, I want to return n rows as an numpy array (without duplication).
You have three key issues: arr.shape[1] returns the number of columns, while you want the number of rows--arr.shape[0]. Second, the second parameter to range is exclusive, so you don't really need the -1. Third, your last parameter to random.sample is the number of rows, which you set to 1.
A better way to do what you're trying might be random.choices instead.
Try where x is your original array:
n = 2 #number of rows
idx = np.random.choice(len(x), n, replace = False)
result = np.array([x[i] for i in idx])
I have an array A that has 1 million rows and 3 columns. In the last column there are unique integers that help identify data in the other two columns. I would only like to keep data that has three of the same unique integer occurrences, and delete all other rows that have other amounts of unique integer occurrences (i.e. for unique integers that are only appearing once, twice, or four times for example). Below is a function remove_loose_ends that I wrote to handle this. However, this function is being called many times and is the bottleneck of the entire program. Are there any possible enhancements that could remove the loop from this operation or decrease its runtime in other ways?
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
# initialize outgoing indice mask
good_index = np.array([[True] * (A.shape[0])])
# loop through all indices and flip them to false if they match the not triplet entries
for i in range(0, len(unique_id)):
if unique_counter[i] != 3:
good_index = good_index ^ (A[:, 2] == unique_id[i])
# return incoming array with mask applied
return A[np.squeeze(good_index), :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
So, the main problem is that you loop over all the values twice basically, making it roughly an n² operation.
What you could do instead, is create an array of booleans directly from the output of the numpy.unique function to do the indexing for you.
For example, something like this:
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
_, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
# Obtain boolean array of which integers occurred 3 times
idx = unique_counter == 3
# Obtain boolean array of which rows have integers that occurred 3 times
row_idx = idx[unique_inverse]
# return incoming array with mask applied
return A[row_idx, :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
I tried timing both versions.
The function you posted I stopped after 15 minutes, whereas the one I give takes around 0.15s on my PC.
I am trying to extract the section (matrix) of the numbers in pandas dataframe like as marked in the given picture embedded above.
Please anyone who can assist me, I want to perform analytics based on the section (matrix) of a bigger data frame. Thank you in advance!!
You can use the .iloc[] function to select the rows and columns you want.
dataframe.iloc[5:15,6:15]
This should select rows 5-14 and columns 6-14.
Not sure if the numbers are correct but I think this method is what you were looking for.
edit: changed .loc[] to .iloc[] because we're using index values, and cleaned it up a bit
Here is the code to iterate over the whole dataframe
#df = big data frame
shape = (10,10) #shape of matrix to be analized, here is 10x10
step = 1 #step size, itterate over every number
#or
step = 10 #step size, itterate block by block
#keep in mind, iterating by block will leave some data out at the end of the rows and columns
#you can set step = shape if you are working with a matrix that isn't square, just be sure to change step in the code below to step[0] and step[1] respectively
for row in range( 0, len(df[0]) - shape[0]+1, step): #number of rows of big dataframe - number of rows of matrix to be analized
for col in range(0, len(df.iloc[0,:]) - shape[1]+1, step): #number of columns of big dataframe - number of columns of matrix to be analized
matrix = df.iloc[row:shape[0]+row, col:shape[1]+col] #slice out matrix and set it equal to 'matrix'
#analize matrix here
This is basically the same as #dafmedinama said, i just added more commenting and simplified specifying the shape of the matrix as well as included a step variable if you don't want to iterate over every single number every time you move the matrix.
Be sub_rows and sub_cols the dimension of the datafram to be extracted:
import pandas as pd
sub_rows = 10 # Amount of rows to be extracted
sub_cols = 3 # Amount of columns to be extracted
if sub_rows > len(df.index):
print("Defined sub dataframe rows are more than in the original dataframe")
elif sub_cols > len(df.columns):
print("Defined sub dataframe columns are more than in the original dataframe")
else:
for i in range(0,len(df.index)-sub_rows):
for j in range(0, len(df.columns)):
d.iloc[i:i+sub_rows, j:j+sub_cols] # Extracted dataframe
# Put here the code you need for your analysis
I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)
By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.
Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.
I have a 2D NumPy array and it's huge. I have some computer memory, which is not so huge.
A single copy of the array fits snugly in the computer memory. A second copy of this array brings the computer to its knees crying.
Before I can cut up the matrix into smaller, more manageable, chunks I need to add a few rows to it and remove some. Luckily I need to remove more rows than add new ones, so in theory this could all be done in-place. I'm working on a function to accomplish this, but I'm curious to what advice any of you can give me.
The plan so far:
Make a list of rows to remove
Make a matrix of rows to add
Replace rows to remove by the rows to add (one by one, cannot use fancy indexing here?)
Move any rows that still need to be removed to the end of the matrix
Call .resize() on the matrix to resize it in memory
Specially step 4 is hard to implement efficiently.
Code so far:
import numpy as np
n_rows = 100
n_columns = 1000000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.arange(n_rows)
np.random.shuffle(to_drop)
to_drop = to_drop[:n_rows_to_drop]
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
for new_data_idx, to_drop_idx in enumerate(to_drop):
if new_data_idx >= n_rows_to_add:
break # no more new data to add
# Replace a row to drop with a new row
data[to_drop_idx] = new_data[new_data_idx]
# These should still be dropped
to_drop = to_drop[n_rows_to_add:]
to_drop.sort()
# Make a list of row indices to keep, last rows first
to_keep = set(range(n_rows)) - set(to_drop)
to_keep = list(to_keep)
to_keep.sort()
to_keep = to_keep[::-1]
# Replace rows to drop with rows at the end of the matrix
for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
if to_drop_idx > to_keep_idx:
# All remaining rows to drop are at the end of the matrix
break
data[to_drop_idx] = data[to_keep_idx]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)
This seems to work, but is there any way to make this more elegant/efficient? Any way to check whether a copy of the huge array is made at some point?
This seems to perform the same as your code but is a little more brief. I'm relatively sure no copies of the big array are made here - the fancy indexing will work with views.
import numpy as np
n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data
# These should still be dropped
to_drop = to_drop[:n_rows_to_add]
# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]
# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
data[to_drop_i] = data[to_keep_i]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)